Professional Documents
Culture Documents
Debdeep Mukhopadhyay - Rajat Subhra Chakraborty - Hardware Security - Design, Threats, and Safeguards-Chapman & Hall - CRC (2014)
Debdeep Mukhopadhyay - Rajat Subhra Chakraborty - Hardware Security - Design, Threats, and Safeguards-Chapman & Hall - CRC (2014)
SECURITY
Design, Threats,
and Safeguards
Debdeep Mukhopadhyay
Rajat Subhra Chakraborty
Indian Institute of Technology Kharagpur
West Bengal, India
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been
made to publish reliable data and information, but the author and publisher cannot assume responsibility for the valid-
ity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright
holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this
form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may
rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or uti-
lized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopy-
ing, microfilming, and recording, or in any information storage or retrieval system, without written permission from the
publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://
www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923,
978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For
organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for
identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
Foreword xvii
Preface xix
I Background 1
1 Mathematical Background 3
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Modular Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Groups, Rings, and Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Greatest Common Divisors and Multiplicative Inverse . . . . . . . . . . . . 8
1.4.1 Euclidean Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4.2 Extended Euclidean Algorithm . . . . . . . . . . . . . . . . . . . . . 10
1.4.3 Chinese Remainder Theorem . . . . . . . . . . . . . . . . . . . . . . 11
1.5 Subgroups, Subrings, and Extensions . . . . . . . . . . . . . . . . . . . . . 13
1.6 Groups, Rings, and Field Isomorphisms . . . . . . . . . . . . . . . . . . . . 14
1.7 Polynomials and Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.8 Construction of Galois Field . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.9 Extensions of Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.10 Cyclic Groups of Group Elements . . . . . . . . . . . . . . . . . . . . . . . 19
1.11 Efficient Galois Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.11.1 Binary Finite Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.12 Mapping between Binary and Composite Fields . . . . . . . . . . . . . . . 24
1.12.1 Constructing Isomorphisms between Composite Fields . . . . . . . . 25
1.12.1.1 Mapping from GF (2k ) to GF (2n )m , where k = nm . . . . 25
1.12.1.2 An Efficient Conversion Algorithm . . . . . . . . . . . . . . 26
1.13 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3.3.4 AddRoundKey . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.3.4 Key-Scheduling in AES . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.4 Rijndael in Composite Field . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.4.1 Expressing an Element of GF (28 ) in Subfield . . . . . . . . . . . . . 44
2.4.2 Inversion of an Element in Composite Field . . . . . . . . . . . . . . 45
2.4.3 The Round of AES in Composite Fields . . . . . . . . . . . . . . . . 46
2.5 Elliptic Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.5.1 Simplification of the Weierstraß Equation . . . . . . . . . . . . . . . 50
2.5.2 Singularity of Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.5.3 The Abelian Group and the Group Laws . . . . . . . . . . . . . . . 51
2.5.4 Elliptic Curves with Characteristic 2 . . . . . . . . . . . . . . . . . . 52
2.5.5 Projective Coordinate Representation . . . . . . . . . . . . . . . . . 57
2.6 Scalar Multiplications: LSB First and MSB First Approaches . . . . . . . . 58
2.7 Montgomery’s Algorithm for Scalar Multiplication . . . . . . . . . . . . . . 59
2.7.1 Montgomery’s Ladder . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.7.2 Faster Multiplication on EC without Pre-computations . . . . . . . 60
2.7.3 Using Projective co-ordinates to Reduce the Number of Inversions . 62
2.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
8.2.1.3 Faults Are Injected between the Output of 7th and the Input
of 8th MixColumns . . . . . . . . . . . . . . . . . . . . . . 207
8.2.1.4 Faults Are Injected between the Output of 8th and the Input
of 9th MixColumns . . . . . . . . . . . . . . . . . . . . . . 208
8.2.2 Relationships between the Discussed Fault Models . . . . . . . . . . 209
8.2.2.1 Faults Are Injected in Any Location and Any Round . . . 209
8.2.2.2 Faults Are Injected in AddRoundKey in Round 0 . . . . . 209
8.2.2.3 Faults Are Injected between the Output of 7th and the Input
of 8th MixColumns . . . . . . . . . . . . . . . . . . . . . . 209
8.2.2.4 Faults Are Injected between the Output of 8th and the Input
of 9th MixColumns . . . . . . . . . . . . . . . . . . . . . . 210
8.3 Principle of Differential Fault Attacks on AES . . . . . . . . . . . . . . . . 211
8.3.1 Differential Properties of AES S-Box . . . . . . . . . . . . . . . . . . 211
8.3.2 DFA of AES Using Bit Faults . . . . . . . . . . . . . . . . . . . . . . 212
8.3.3 Bit-Level DFA of Last Round of AES . . . . . . . . . . . . . . . . . 212
8.3.4 Bit-Level DFA of First Round of AES . . . . . . . . . . . . . . . . . 213
8.4 State-of-the-art DFAs on AES . . . . . . . . . . . . . . . . . . . . . . . . . 214
8.4.1 Byte-Level DFA of Penultimate Round of AES . . . . . . . . . . . . 214
8.4.1.1 DFA Using Two Faults . . . . . . . . . . . . . . . . . . . . 216
8.4.1.2 DFA Using Only One Fault . . . . . . . . . . . . . . . . . . 216
8.4.1.3 DFA with Reduced Time Complexity . . . . . . . . . . . . 220
8.5 Multiple-Byte DFA of AES-128 . . . . . . . . . . . . . . . . . . . . . . . . 222
8.5.1 DFA According to Fault Model DM 0 . . . . . . . . . . . . . . . . . 222
8.5.1.1 Equivalence of Faults in the Same Diagonal . . . . . . . . . 222
8.5.2 DFA According to Fault Model DM 1 . . . . . . . . . . . . . . . . . 224
8.5.3 DFA According to Fault Model DM 2 . . . . . . . . . . . . . . . . . 224
8.6 Extension of the DFA to Other Variants of AES . . . . . . . . . . . . . . . 225
8.6.1 DFA on AES-192 States . . . . . . . . . . . . . . . . . . . . . . . . . 226
8.6.2 DFA on AES-256 States . . . . . . . . . . . . . . . . . . . . . . . . . 226
8.6.2.1 First Phase of the Attack on AES-256 States . . . . . . . . 226
8.6.2.2 Second Phase of the Attack on AES-256 States . . . . . . . 228
8.7 DFA of AES Targeting the Key Schedule . . . . . . . . . . . . . . . . . . . 230
8.7.1 Attack on AES-128 Key Schedule . . . . . . . . . . . . . . . . . . . . 231
8.7.1.1 First Phase of the Attack on AES-128 Key Schedule . . . . 231
8.7.1.2 Second Phase of the Attack on AES-128 Key Schedule . . . 234
8.7.1.3 Time Complexity Reduction . . . . . . . . . . . . . . . . . 234
8.7.2 Proposed Attack on AES-192 Key Schedule . . . . . . . . . . . . . . 236
8.7.2.1 First Phase of the Attack on AES-192 Key Schedule . . . . 236
8.7.2.2 Second Phase of the Attack on AES-192 Key Schedule . . . 239
8.7.3 Proposed Attack on AES-256 Key Schedule . . . . . . . . . . . . . . 240
8.7.3.1 First Phase of the Attack of AES-256 Key Schedule . . . . 241
8.7.3.2 Second Phase of th Attack of AES-256 Key Schedule . . . 243
8.8 DFA countermeasures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
8.8.1 Hardware Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . 246
8.8.2 Time Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
8.8.3 Information Redundancy . . . . . . . . . . . . . . . . . . . . . . . . 248
8.8.3.1 Parity-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
8.8.3.2 Parity-16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
8.8.3.3 Parity-32 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
8.8.3.4 Robust Code . . . . . . . . . . . . . . . . . . . . . . . . . . 250
8.8.4 Hybrid Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
Bibliography 505
Index 535
In the past decade, the field of hardware security has grown into a major research topic,
attracting intense interest from academics, industry, and governments alike. Indeed, elec-
tronic information technology now controls, documents, and supports virtually every aspect
of our life. Be it electronic money, ID-cards, car electronics, industrial controllers, or Inter-
net routers, this world is governed by bits as much as it is governed by physical resources.
Hardware security addresses multiple key requirements in this information landscape: the
need to securely handle and store electronic information any time, anywhere; and the need
to do it very efficiently in terms of resource cost and energy cost.
Hardware security collects a rich field of research which combines multiple knowledge do-
mains including, among others, discrete math, algorithm design and transformations, digital
architecture design with analog twists, and controlled production technologies. Hardware
security puts traditional notions of hardware design on their head. Examples are the ben-
eficial use of noise to produce random numbers and to feed security protocols, and the
design of hardware that has a constant, rather than minimal, power dissipation, to avoid
side-channel leakage.
The origins of hardware security lie within cryptographic engineering, a field young
enough to have new artifacts and methods still carry the inventor’s last name. Indeed, al-
though cryptography has been around for centuries, it’s only since the last 50 years that
it has established itself as a field of science that enables participation of a broad commu-
nity. Cryptography is now a fundamental part of information technology. Cryptographic
Engineering is concerned with design and implementation aspects of cryptography. It is a
vibrant area with established conference venues such as CHES attracting hundreds of par-
ticipants. New ideas in cryptography, such as privacy-friendly subscriptions and multi-party
computation, provide a continuous influx of new design challenges to the hardware security
engineer.
Besides a continuous innovation at the level of applications, secure hardware has also
benefited, and is still benefiting, from tremendous improvements in technology. Although
the impeding end-of-the-road for traditional CMOS has been predicted many years, the
technology has propelled secure hardware into applications that would have been unthink-
able just a decade ago. In fact, the ’Things’ in the Internet of Things would be impossible
without secure hardware, and this secure hardware would be impossible without the dense
integration and high efficiencies offered by technology.
A fourth factor that makes secure hardware an exciting field of research is in its role as
last-line-of-defense. Traditional software-based cyber-security has been plagued by security
issues for many years, and hardware has been touted as a safe haven for our private electronic
information. However, secure hardware is under attack, as well. Adversaries have learned to
pick up side-channel information from secure processing using low-cost, simple measurement
equipment; they have learned how faults can reveal the inner workings of algorithms that
are conceived as a black box by cryptographers. Adversaries have even learned to interfere
with the IC design flow itself, subverting the design with a Trojan and making the IC do
something different than it was conceived for.
It is with great excitement that I see this book on ’Hardware Security Design’ published.
xvii
In recent years, many schools have started graduate courses on this topic, often relying on
advanced research papers. There is a clear need for a general, comprehensive text that can
be used as the basis for a course. Furthermore, there is a great need for a reference text that
will help designers and practitioners in the field to understand the main issues. This text
stands out among others in that it has been written by a team of just two authors. They
have not only done an admirable effort, but they have also ensured a consistent presentation
and discussion.
The authors have done an excellent job at systematically introducing the field of hard-
ware security. They start with the fundamentals, touch upon practical aspects of hardware
design, discuss the mapping of symmetric-key and public-key algorithms, and finally provide
an in-depth treatment of the last-line-of-defense aspects of secure hardware. The book is
supported by an extensive collection of references, testifying to the intensive level of research
ongoing in this exciting, innovative field. I hope the book will encourage new researchers and
practitioners to join the effort in building the fundamentals of secure information technology
that our future calls for.
Patrick Schaumont
Blacksburg, August 2014
With the ever-increasing proliferation of e-business practices, great volumes of secure busi-
ness transactions and data transmissions are routinely carried out in an encrypted form in
devices ranging in scale from personal smartcards to business servers. These cryptographic
algorithms are often computationally intensive and are designed in hardware and embedded
systems to meet the real-time requirements. Developing expertise in this field of hardware
security requires background in a wide range of topics. On one end is understanding of the
mathematical principles behind the cryptographic algorithms and the underlying theory
behind their working. These principles are necessary to innovate in the process of devel-
oping efficient implementations, which is central to this discourse. This also needs to be
backed by exposure in the field of Very Large Scale Integration (VLSI) and embedded sys-
tems. Understanding the platforms on which the designs are developed is needed to develop
high performance and compact solutions to the cryptographic algorithms. On the other
end are the threats which arise from the hardware implementations. History has taught
us that strong cryptographic algorithms and their efficient designs are just the beginning.
Naïve optimizations to develop efficient hardware devices can lead to embarrassing attacks
which can have catastrophic implications. Often these attacks are based on exploitation
of side-channels, which are covert channels leaking information which the designers of the
cryptographic algorithms or the conventional hardware engineer will be unaware of. Fur-
ther, the complexity of hardware designs also makes it increasingly difficult to detect bugs
and modifications. These modifications can be malicious, as they can be the seat of hard-
ware Trojans, which when triggered can lead to disastrous system failures and/or security
breaches. Another pressing issue in the world of cyber-security arises from the threats of
counterfeit integrated circuits (ICs). Detecting and protecting against these vulnerabili-
ties requires “unclonable” novel hardware security primitives, which can act as fingerprint
generators for the manufactured IC instances.
This book thus attempts to bring on a single platform a treatment of all these aspects
of hardware security which we believe are quite challenging and unique. The book is tar-
geted for a senior undergraduate or post-graduate course on hardware security. The book
is suitable for students from not only CS and EE backgrounds, but also from mathematics.
The book is also suitable for self-study of the practising professional who requires an expo-
sure to state-of-the-art research on hardware security. Although we have strived to provide
a contemporary overview on the design, threats and safeguards required for assurance in
hardware security, we believe that the work, because of its constant evolution, will always
be dynamic. However having said that, the fundamentals, which we have attempted to
bring out, should stand out in the mind of the reader, assisting in future developments and
research.
As mentioned, the content of the book covers modern day hardware security issues,
along with fundamental aspects which are imperative for a comprehensive understanding.
We briefly describe these aspects:
• Mathematical Background and Cryptographic Algorithms: Modern-day cryp-
tographic systems rely heavily on field theory. Understanding of field arithmetic tech-
niques, definitions, constructions, and inter-relations of several fields are required to
xix
endeavours. Debdeep would like to dedicate the book to his parents, Niharendu and Dipa
Mukhopadhyay, without whom nothing would be possible.
Rajat Subhra Chakraborty would like to thank his family members, especially his
wife Munmun, for their love, patience and understanding, and for allowing him to concen-
trate on the book during the weekends and vacations. Rajat would like to dedicate this
book to the memory of his father, Pratyush Kumar Chakraborty, who passed away while
this book was being written. Rest in peace, Baba.
1.1 Homomorphism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.2 Squaring Circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.3 Modular Reduction with Trinomial x233 + x74 + 1 . . . . . . . . . . . . . . 23
xxiii
4.15 Different Data-Flow of (a) Encryption and (b) Decryption Datapath . . . . 111
4.16 (a) Architecture of Encryption Datapath, (b) Architecture of Decryption
Datapath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
4.17 Single Chip Encryptor/Decryptor Core Design . . . . . . . . . . . . . . . . 113
7.10 Plot of DOM for a wrong guess of next bit of x compared to a correct guess
with simulated power traces . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
7.11 Effect of Faults on Secrets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
7.12 Memory Accesses in a Block Cipher Implementation . . . . . . . . . . . . . 195
7.13 Generic Structure of a Block Cipher . . . . . . . . . . . . . . . . . . . . . . 197
12.1 Schemes for Boolean function modification and modification cell. . . . . . . 354
12.2 The proposed functional and structural obfuscation scheme by modification
of the state transition function and internal node structure. . . . . . . . . . 356
12.3 Modification of the initialization state space to embed authentication signa-
ture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
12.4 Hardware obfuscation design flow along with steps of the iterative node rank-
ing algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360
12.5 SoC design modification to support hardware obfuscation. An on-chip con-
troller combines the input patterns with the output of a PUF block to pro-
duce the activation patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . 360
12.6 Challenges and benefits of the HARPOON design methodology at different
stages of a hardware IP life cycle. . . . . . . . . . . . . . . . . . . . . . . . 361
12.7 Example of a Verilog RTL description and its obfuscated version [82]: a) orig-
inal RTL; b) technology independent, unoptimized gate-level netlist obtained
through RTL compilation; c) obfuscated gate-level netlist; d) decompiled ob-
fuscated RTL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
12.8 Design transformation steps in course of the proposed RTL obfuscation pro-
cess. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
12.9 Transformation of a block of RTL code into CDFG [83]. . . . . . . . . . . . 364
12.10 Example of hosting the registers of the mode-control FSM [83]. . . . . . . . 365
12.11 Examples of control-flow obfuscation: (a) original RTL, CDFG; (b) obfus-
cated RTL, CDFG [83]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
12.12 Example of datapath obfuscation allowing resource sharing [83]. . . . . . . 366
12.13 Example of RTL obfuscation by CDFG modification: (a) original RTL; (b)
obfuscated RTL [83]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
12.14 Binary Decision Diagram (BDD) of a modified node. . . . . . . . . . . . . . 370
12.15 Flow diagram for the proposed STG modification-based RTL obfuscation
methodology [82]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372
12.16 Flow diagram for the proposed CDFG modification-based RTL obfuscation
methodology [83]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
12.17 Scheme with initialization key sequences of varying length (3, 4 or 5). . . . 376
13.12 c17 with an asynchronous binary counter-type Trojan. The reset signal for
the Trojan counter has not been shown for simplicity. . . . . . . . . . . . . 396
13.13 Failure trends of the c17 circuit with two types of inserted sequential Trojans
(described in Sec. 13.4.3). . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
13.14 Xilinx Virtex-II configuration bitstream file organization [159, 257]. . . . . 399
13.15 Proposed bitstream modification-based attack. . . . . . . . . . . . . . . . . 402
13.16 Examples of two possible cases of Type-I Trojan insertion. . . . . . . . . . 403
13.17 Demonstration of successful configuration of the RO-based Trojan insertion. 404
13.18 Effects of inserting 3–stage ring–oscillator based MHL in Virtex-II FPGA:
(a) rise of temperature as a function of the number of inserted 3-stage ring
oscillators; (b) theoretical aging acceleration factor as a function of percent-
age FPGA resource utilization by ring-oscillator Trojans. . . . . . . . . . . 405
14.1 Impact of sample size on trigger and Trojan coverage for benchmarks c2670
and c3540, N = 1000 and q = 4: (a) deviation of trigger coverage, and (b)
deviation of Trojan coverage. . . . . . . . . . . . . . . . . . . . . . . . . . . 414
14.2 Impact of N (number of times a rare point satisfies its rare value) on the
trigger/Trojan coverage and test length for benchmarks (a) c2670 and (b)
c3540. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
14.3 Integrated framework for rare occurrence determination, test generation us-
ing MERO approach, and Trojan simulation. . . . . . . . . . . . . . . . . . 416
14.4 Trigger and Trojan coverage with varying number of trigger points (q) for
benchmarks (a) c3540 and (b) c7552, at N = 1000, θ = 0.2. . . . . . . . . . 418
14.5 Trigger and Trojan coverage with trigger threshold (θ) for benchmarks (a)
c3540 and (b) c7552, for N = 1000, q = 4. . . . . . . . . . . . . . . . . . . . 418
14.6 FSM model with no loop in state transition graph. . . . . . . . . . . . . . . 419
15.1 (a) Average IDDT values at 100 random process corners (with maximum
variation of ±20% in inter-die Vth ) for c880 circuit. The impact of Trojan
(8-bit comparator) in IDDT is masked by process noise. (b) Corresponding
Fmax values. (c) The Fmax vs. IDDT plot shows the relationship between
these parameters under inter-die process variations. Trojan-inserted chips
stand out from the golden trend line. (d) The approach remains effective
under both inter-die and random intra-die process variations. A limit line is
used to account for the spread in IDDT values from the golden trend line. . 424
15.2 Effect of process variations on device threshold voltage in an IC. . . . . . . 426
15.3 Major steps in the multiple-parameter Trojan detection approach. . . . . . 429
15.4 Schematic showing the functional modules of the AES cipher test circuit. The
AES “Key Expand” module is clock-gated and operand isolation is applied to
the “SBOX” modules to reduce the background current, thereby improving
detection sensitivity. The Trojan instance is assumed to be in the logic block
2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430
15.5 The correlation among IDDT , IDDQ and Fmax can be used to improve Trojan
detection confidence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431
15.6 IDDT vs. Fmax relationship for both golden and tampered AES and IEU
circuits showing the sensitivity of our approach for detecting different Trojan
circuits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434
16.1 The obfuscation scheme for protection against hardware Trojans: (a) modi-
fied state transition graph and (b) modified circuit structure. . . . . . . . . 455
16.2 Fractional change in average number of test vectors required to trigger a
Trojan, for different values of average fractional mis-estimation of signal
probability f and Trojan trigger nodes (q). . . . . . . . . . . . . . . . . . . 459
16.3 Comparison of input logic cones of a selected flip-flop in s15850: (a) original
design and (b) obfuscated design. . . . . . . . . . . . . . . . . . . . . . . . . 459
16.4 Steps to find unreachable states for a given set of S state elements in a circuit. 460
16.5 Framework to estimate the effectiveness of the obfuscation scheme. . . . . . 462
16.6 Variation of protection against Trojans in s1196 as a function of (a), (b) and
(c): the number of added flip-flops in state encoding (S); (d), (e) and (f):
the number of original state elements used in state encoding (n). For (a), (b)
and (c), four original state elements were selected for state encoding, while
for (d), (e) and (f), four extra state elements were added. . . . . . . . . . . 463
16.7 Effect of obfuscation on Trojans: (a) 2-trigger node Trojans (q = 2), and (b)
4-trigger node Trojans (q = 4). . . . . . . . . . . . . . . . . . . . . . . . . . 465
16.8 Improvement of Trojan coverage in obfuscated design compared to the orig-
inal design for (a) Trojans with 2 trigger nodes (q = 2) and (b) Trojans with
4 trigger nodes (q = 4). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
16.9 Comparison of conventional and proposed SoC design flows. In the proposed
design flow, protection against malicious modification by untrusted CAD
tools can be achieved through obfuscation early in the design cycle. . . . . 467
16.10 Proposed FPGA design flow for protection against CAD tools. . . . . . . . 468
16.11 Obfuscation for large designs can be efficiently realized using multiple parallel
state machines which are constructed with new states due to additional state
elements as well as unreachable states of original state machine. . . . . . . 468
16.12 Functional block diagram of a crypto-SoC showing possible Trojan attack to
leak secret key stored inside the chip. Obfuscation coupled with bus scram-
bling can effectively prevent such attack. . . . . . . . . . . . . . . . . . . . 469
16.13 Module isolation through moats and drawbridges [156]. . . . . . . . . . . . 470
16.14 DEFENSE hardware infrastructure for runtime monitoring. [7]. . . . . . . . 471
4.1 XOR count of Square and Scaling circuit for GF (24 ) Polynomial Basis . . . 95
4.2 XOR count of Square and Scaling circuit for GF (24 ) Normal Basis . . . . . 96
4.3 Numbers of rounds (Nr ) as a function of the block and key length . . . . . 106
4.4 C2 C1 for different Key and Data . . . . . . . . . . . . . . . . . . . . . . . . 109
4.5 Throughput in FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.6 Throughput in ASIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.7 Performances of Compared Cores . . . . . . . . . . . . . . . . . . . . . . . . 110
xxxi
14.1 Comparison of Trigger and Trojan Coverage Among ATPG Patterns [246],
Random (100K, input weights: 0.5), and MERO Patterns for q = 2 and
q = 4, N = 1000, θ = 0.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416
15.1 Trojan Detection Sensitivity for Different Trojan Sizes in AES . . . . . . . 434
15.2 Trojan Detection Sensitivity for Different Trojan Sizes in IEU . . . . . . . . 435
15.3 Trojan Coverage for ISCAS-85 Benchmark Circuits . . . . . . . . . . . . . . 441
15.4 Probability of Detection and Probability of False Alarm (False Positives) . 450
Background
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Modular Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Groups, Rings, and Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Greatest Common Divisors and Multiplicative
Inverse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4.1 Euclidean Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4.2 Extended Euclidean Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4.3 Chinese Remainder Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5 Subgroups, Subrings, and Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.6 Groups, Rings, and Field Isomorphisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.7 Polynomials and Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.8 Construction of Galois Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.9 Extensions of Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.10 Cyclic Groups of Group Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.11 Efficient Galois Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.11.1 Binary Finite Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.12 Mapping between Binary and Composite Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.12.1 Constructing Isomorphisms between Composite Fields . . . . . . . . . . . . . . . . . . . . . . . . 25
1.12.1.1 Mapping from GF (2k ) to GF (2n )m , where k = nm . . . . . . . . . . . . . . . . . . 25
1.12.1.2 An Efficient Conversion Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.13 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Galileo Galilei
1.1 Introduction
Mathematics is often referred to as the mother of all sciences. It is every where, and
without it no scienctific study would have progressed. Mathematics defines not only the laws
of the universe, but also gives us insight to solutions to many unsolved mysteries around us.
Frankly, mathematical discoveries have often given rise to many questions, a large share of
which are unknown to our present knowledge. These unproven results often are the backbone
of science. To quote Vonn Neumann, In mathematics you don’t understand things. You just
get used to them. However, we start from these results and discover by logical deduction
results, which are used in several discourses. The study of any engineering discipline, thus
relies on the applications of these mathematical principles to solve practical problems for
the benefit of mankind.
The study of hardware security, like many other engineering subjects, relies heavily
on mathematics. To start with, it involves the implementation of complex cryptographic
algorithms on various platforms in an efficient manner. By the term efficiency, we often
imply resource utilization and the time required by a hardware circuit. Although there
are other measures, like power consumption, just restricting ourselves to these two classical
objectives of hardware circuits poses several challenges, solutions to which are often obtained
in the tricks in mathematics. The choice of a suitable algorithm for a specific purpose
implies one has to be aware of the contemporary algorithms in the crypto literature. But
these algorithms often are based on deep theories in mathematics: number theory, field
theory, and the like. Hence to obtain a proper understanding and ,most importantly, to
compare the algorithms, one needs to develop a grasp of these topics. Once an algorithm
is chosen, the underlying primitives must be understood: for example, a given algorithm
may employ a multiplication step, more specifically a finite field multiplier. So the question
is which multiplier should be chosen? Each design has their positives and negatives, thus
a designer equipped with proper mathematical training and algorithm analysis prowess
can make the choices in a prudent manner, which leads to efficient architectures. Today
hardware designs of cryptographic algorithms are threatened with attacks, that exploit the
properties of the implementations rather than the algorithms themselves. These attacks,
commonly referred to as side channel attacks rely heaviliy on statistics. Thus in order
to develop suitable defenses against these attacks, the designer also needs to understand
these statistical methods. Knowledge of these methods shall help the designer not only to
improve the existing attacks, but finally develop sound counter-measures. In short, design of
efficient and secured implementations of cryptographic algorithms needs not only prudent
engineering practices and architectural knowledge, but also understanding of the underlying
mathematical principles.
In this chapter, we present an overview of some of the important mathematical concepts
that are often useful for the understanding of hardware security designs.
multiplicative inverse, that is another number which multiplied with it gives the number 1,
known as the multiplicative identity. Since, we can define arithmetic on a finite set, we can
envisage to develop ciphering algorithms on these numbers. For this we have to generalize
this observation, in particular answer questions, like is there any speciality of the number 5
that we chose. It turns out that the primality of the number 5 has a very significant role in
the theory that follows. We gradually develop the results subsequently. It may be kept in
mind that we shall often state important results without formal proofs, which can be found
in more details in textbooks on number theory and algebra.
We first state some definitions in the following:
Definition 1.2.1 An integer a is said to be congruent to an integer b modulo m, when m
divides b − a and is denoted as a ≡ b mod m.
Congruence modulo m is an equivalence relation on the integers.
• (Reflexivity): any integer is congruent to itself modulo m.
• (Symmetricity): a ≡ b mod m ⇒ b ≡ a mod m.
• (Transitivity): (a ≡ b mod m) ∧ (b ≡ c mod m) ⇒ a ≡ c mod m.
The expression a ≡ b mod m can also be written as ∃k ∈ Z, st.a = b + km. Equivalently,
when divided by m, both a and b leave the same remainder.
Definition 1.2.2 The equivalence class of a mod m consists of all integers that are ob-
tained by adding a with integral multiples of m. This class is also called the residue class of
a mod m.
Example 1 Residue class of 1 mod 4 is the set {1, 1 ± 4, 1 ± 2 ∗ 4, 1 ± 3 ∗ 4, . . .}
The residue classes mod m is denoted by the symbol Z/mZ. Each class has a represen-
tative element, 0, 1 . . . , m − 1. The equivalence class for a representative element, say 0, is
denoted by [0]. The set {0, 1, . . . , m − 1} formed of the m incongruent residues is also called
a complete system.
Example 2 Complete systems for a given modulo m is not unique. For example for m = 5,
the sets {0, 1, 2, 3, 4} and {−12, −15, 82, −1, 31} are both complete systems.
The following theorem is straightforward and is stated without any proof.
Theorem 1 a ≡ b mod m and c ≡ d mod m, implies that −a ≡ −b mod m, a + c ≡
b + d mod m, and a ∗ c ≡ b ∗ d mod m, ∀a, b, c, d, m ∈ Z.
This result is particularly useful as it shows that operations in modular arithmetic can be
made much easier by performing intermediate modular reduction. The following example
illustrates this point.
5
Example 3 Prove that 22 + 1 is divisible by 641.
We note that 641 = 640 + 1 = 5 ∗ 27 + 1. Thus,
5 ∗ 27 ≡ −1 mod 641
⇒ (5 ∗ 27 )4 ≡ (−1)4 mod 641
⇒ 54 ∗ 228 ≡ 1 mod 641
⇒ (625 mod 641) ∗ 228 ≡ 1 mod 641
⇒ (−24 ) ∗ 228 ≡ 1 mod 641
⇒ 232 ≡ −1 mod 641
This example shows that the complicated computation can be simplified by performing
modular reductions and subsequently carrying on the computations. This fact holds true
for all the ciphering operations which work on finite sets of data that we shall subsequently
study.
Example 6 The pairs (Z, +), (Z, ∗), (Z/mZ, +), (Z/mZ, ∗) are abelian semigroups.
It can be easily seen that the semigroup can have at most one identity element. This is so,
if there are two identities, e and e′ , we have e ◦ e′ = e = e′ .
Example 7 (Z, +) has an identity element 0, and inverse −a. (Z, ∗) has an identity element
1, while the only invertible elements are +1 and −1. (Z/mZ, +) has as identity element,
mZ. The inverse is −a + mZ. This monoid is often referred to as Zm . (Z/mZ, ∗) has as
identity element 1 + mZ. The invertible elements of Z are t ∈ Z, st. gcd(t, m) = 1, i.e., if
and only if t and m are mutually co-prime. The invertible elements of this monoid is a set
denoted by Z∗m .
Definition 1.3.7 A group is a monoid in which every element is invertible. The group is
commutative or abelian if the monoid is commutative.
Example 8 (Z, +) and (Z/mZ, +) are abelian groups. However, (Z, ∗) is not an abelian
group. ((Z/mZ)\{0}, ∗) is an abelian group if and only if m is prime.
Definition 1.3.8 A ring is a triplet (R, +, ∗) such that (R, +) is an abelian group, (R, ∗) is
a monoid and the operation * distributes over the operation +, i.e. ∀x, y, z ∈ R, x ∗ (y + z) =
(x ∗ y) + (x ∗ z). The ring is called commutative if the monoid (R, ∗) is commutative.
Definition 1.3.9 A field is a commutative ring (R, +, ∗) in which every element in the
monoid (R, ∗) is invertible.
Example 9 The set of integers is not a field. The set of real and complex numbers form a
field. The residue class modulo a prime number, except the element 0, is a field.
Thus summarizing the above concepts, the definition of groups, rings, and fields are
rewritten below:
Definition 1.3.10 A group denoted by {G, ·}, is a set of elements G with a binary operation
’·’, such that for each ordered pair (a, b) of elements in G, the following axioms are obeyed
[122][369]:
• Closure: If a, b ∈ G, then a · b ∈ G.
Definition 1.3.11 A ring denoted by {R, +, ∗} or simply R is a set of elements with two
binary operations called addition and multiplication, such that for all a, b, c ∈ R the following
are satisfied:
• R is an abelian group under addition.
• The closure property of R is satisfied under multiplication.
• The associativity property of R is satisfied under multiplication.
• There exists a multiplicative identity element denoted by 1 such that for every a ∈ F,
a ∗ 1 = 1 ∗ a = 1.
• Distributive Law : For all a, b, c ∈ R, a∗(b+c) = a∗b+a∗c and (a+b)∗c = a∗c+b∗c.
The set of integers, rational numbers, real numbers, and complex numbers are all rings.
A ring is said to be commutative if the commutative property under multiplication holds.
That is, for all a, b ∈ R, a ∗ b = b ∗ a.
Definition 1.3.12 A field denoted by {F, +, ∗} or simply F is a commutative ring which
satisfies the following properties
• Multiplicative inverse : For every element a ∈ F except 0, there exists a unique element
a−1 such that a · (a−1 ) = (a−1 ) · a = 1. a−1 is called the multiplicative inverse of the
element a.
• No zero divisors : If a, b ∈ F and a · b = 0, then either a = 0 or b = 0.
As we have seen the set of rational numbers, real numbers and complex number are
examples of fields, while the set of integers is not. This is because the multiplicative inverse
property does not hold in the case of integers.
Definition 1.3.13 The characteristic of a field F is the minimal value of the integer k,
such that for any element a ∈ F, a + . . . + a(k times) = k.a = 0, where 0 ∈ F, is the additive
identity of the field F. Since the inverse a−1 exists, an alternative way of defining is by the
equation, k.1 = 0, where 1 is the multiplicative identity of the field F.
The characteristic of a field is a prime number. If k is not prime, one can factor k = k1 k2 ,
1 < k1 , k2 < k. Thus, for 1 ∈ F, k.1 = (k1 .1).(k2 .1) = 0. Since in a field there are no zero-
divisors, either k1 .1 = 0, or k2 .1 = 0. Since k is the smallest positive integer such that
k.1 = 0, k has to be prime.
The residue class Z/ pZ is of extreme importance to cryptography. Each element in the
set, a has a multiplicative inverse if and only if gcd(a, p) = 1. This happens if and only if p
is prime. Thus,
Theorem 2 The residue class Z/ pZ is a field if and only if p is prime.
key algorithms. While there are several techniques for computing multiplicative inverses,
Euclidean algorithm is one of the most well known techniques. The original Euclidean
algorithm computes the greatest common divisor of two integers (elements).
Definition 1.4.1 If a and b are integers that are not both zero, then the greatest common
divisor d of a and b is the largest of the common divisors of a and b. We denote this as:
d = gcd(a, b).
Because every number divides 0, there is strictly no greatest common divisor of 0 and 0;
for convenience we set gcd(0, 0) = 0. Further, it may be easy to observe that gcd(a, a) = a
and gcd(a, 0) = a. The following facts are useful for computing the gcd of two integers:
This follows from the fact that the set of common divisors of a and b is the same set as
the common divisors of a + kb and b. Thus we have the following useful corollary:
The proof of correctness of the above algorithm is easy to verify. It follows because of
the following sequence of equations:
gcd(a, b) = gcd(r0 , r1 )
= gcd(q1 r1 + r2 , r1 )
= gcd(r2 , r1 )
= gcd(r1 , r2 )
= ...
= gcd(rm−1 , rm )
= rm
Thus the Euclidean algorithm can be used to check the gcd of two positive integers. It
can also be used for checking for the existence of inverse of an element, a modulo another
element n. An extended version of this algorithm can also be used for the computation of
the inverse, a−1 mod n and is explained in the next subsection.
75 = 2 × 28 + 19
28 = 1 × 19 + 9
19 = 2×9+1
9 = 9×1
The algorithm terminates at this point and we obtain that the gcd is 1. As stated
previously, this implies the existence of the inverse of 28 mod 75. The inverse can be easily
obtained by observing the sequence of numbers generated while applying the EA above.
In order to compute the inverse, we first express the gcd as a linear combination of the
numbers, 28 and 75. This can be easily done as:
19 = 75 − 2 × 28
9 = 28 − 19 = 28 − (75 − 2 × 28) = −75 + 3 × 28
1 = 19 − 2 × 9 = (75 − 2 × 28) − 2 × (−75 + 3 × 28) = 3 × 75 − 8 × 28
It may be observed that the linear expression: 1 = 3× 75 − 8 × 28 is unique. Thus the inverse
of 28 mod 75 can be easily obtained by taking modulo 75 to the above expression, and we
observe that −8 × 28 ≡ 1mod 75. This shows that 28−1 mod 75 ≡ −8 = 67.
Thus if the inverse of an integer a mod n exists (ie. gcd(a, n) = 1), one applies the
Euclidean Algorithm on a and n and generates a sequence of remainders. These remainders
can be expressed as unique linear combinations of the integers a and n. The Extended
Euclidean Algorithm (EEA) is a systematic method for generating the coefficients of these
linear combinations.
The coefficients are generated by EEA as two series: s and t, and the final gcd(a, b) =
sa + tn. The series elements, s0 , s1 , . . . , sm = s, and t0 , t1 , . . . , tm = t, are generated along
with the remainders. The above series are obtained through the following recurrences:
1 if j = 0
sj = 0 if j = 1 (1.1)
if j ≥ 2
sj−2 − qj−1 sj−1
0 if j = 0
tj = 1 if j = 1 (1.2)
if j ≥ 2
tj−2 − qj−1 tj−1
For 0 ≤ j ≤ m, we have that rj = sj r0 + tj r1 , where the rj s are defined in the EA,
and the sj s and tj s are as defined in the above recurrences. Note that the final remainder,
r = gcd(a, n) = 1 is expressed as sa + tn, and hence, 1 ≡ sa modn. Thus s ≡ a−1 modn.
Thus while computing the inverse the computations of the t-series is not required and can
be omitted.
The operations are summarized in algorithm 1.2.
In the next section, we present another application of the Euclidean Algorithm (EA),
called the Chinese Remainder Theorem (CRT). This theorem is often useful in development
of some implementations of the RSA algorithm.
x ≡ ai (mod mi ), i = 1, 2, . . . , t
Let us assume that the result holds for t = k. We prove the result (that such an x exists
and is unique in the sense asserted) for t = k + 1. Consider the system of congruences:
x ≡ a1 (mod m1 )
..
.
x ≡ ak (mod mk )
x ≡ ak+1 (mod mk+1 )
The moduli are relatively prime in pairs. From the inductive hypothesis, there is a
number x′ satisfying the first k congruences.
It is easy to check that the product of the first k integers, m1 m2 . . . mk are relatively
prime to the integer mk+1 . Hence from Extended Euclidean Algorithm, there are unique
integers, u and v, such that:
um1 m2 . . . mk + vmk+1 = 1
x′′ ≡ x′ ≡ ai (mod mi ), i = 1, . . . , k
The CRT problem is to find the original x from the parts, ai , where:
x ≡ ai mod mi , i = 0, 1, . . . , n − 1 (1.3)
where mi are pairwise relatively primes, and ai are integers. In order to solve Equations 1.3
we compute the values, s0 , s1 , . . . , sn−1 satisfying:
Mi y i ≡ 1mod mi (1.5)
Every ring has two trivial subrings: itself and the set {0}.
Example 10 Consider the set R = Z/12Z. Consider the subset S = {0, 3, 6, 9}. The fol-
lowing Tables (Tables 1.3,1.4,1.5 confirm that it is a subring).
It may be noted that the set S does not form a group wrt. multiplication. Further
although the set R = Z/12Z posseses a multiplicative identity, S does not. It may be
interesting to note that it may also be possible for S to have a multiplicative identity, but
not R. The two sets can also have same or different identities wrt. multiplication. However
as per our definition for rings, we impose the further condition that S has to be a ring for
it to qualify as a subring. Thus the subring also has to have a multiplicative identity. When
S is the subring of R, the later is called as the ring extension of the former. When both S
and R are fields, the later is called as the field extension of the former. Equivalently, one
also says that S ⊆ R is a field extension or R is a field extension over S.
11
00
00
11
x 11
00
00 f (x)
11
00
11
y1
0 1
0
0
1 f (y)
1
0
0
1 0
1
x◦y
1
0
0
1 1
0
0 f (x) † f (y)
1
The following theorems state two important properties for homomorphisms on groups.
Theorem 5 If f : G1 → G2 is a group homomorphism then f (e1 ) = e2 , where e1 is the
identity of G1 and e2 is the identity of G2 .
Let x be an element of G1 . Thus f (x) is an element of G2 . Thus, f (x) = f (x.e1 ) =
f (x).f (e1 ) ⇒ f (x).e2 = f (x).f (e1 ). Thus, f (e1 ) = e1 . It may be noted that the cancellation
is allowed, owing to the existence of the multiplicative inverse, [f (x)]−1 and associativity of
the groups.
Theorem 6 If f : G1 → G2 is a group homomorphism then for every x ∈ G1 , f (x−1 ) =
(f (x))−1 .
We have f (x.x−1 ) = f (e1 ) = e2 . Also, f (x.x−1 ) = f (x).f (x−1 ). Hence, we have f (x−1 ) =
(f (x))−1
Example 11 Let G1 be the group of all real numbers under multiplication, and G2 be the
group of all real numbers under addition. The function defined as f : G1 → G2 , where
f (x) = loge (x) = ln(x), is a group homomorphism.
An injective (one-to-one) homomorphism is called an isomorphism.
The idea of homomorphism and hence isomorphism can be extended to rings and fields
in a similar fashion. In these extensions the only difference is from the fact that a ring and
a field are defined wrt. two operations, denoted by + and ◦.
Let R1 and R2 be rings and consider a surjective function, f : R1 → R2 . It is called a
ring isomorphism if and only if:
1. f (a + b) = f (a) + f (b) for every a and b in R1
2. f (a ◦ b) = f (a) ◦ f (b) for every a and b in R1
An obvious extension of the previous two theorems to the rings R1 and R2 , is f (0) = 0,
and f (−x) = −f (x), for every x ∈ R1 . If R1 has a multiplicative identity denoted by 1, and
R2 has a multiplicative identity denoted by 1′ , we have f (1) = 1′ .
Further, if x is a unit in the ring R1 , then f (x) is a unit in the ring R2 , and f (x−1 ) =
[f (x)]−1 . These properties also holds for fields. The property of isomorphisms have been
found to be useful for developing efficient implementations for finite field-based algorithms.
The fact of isomorphism is utilized to transform a given field into another isomorphic field,
perform operations in this field, and then transform back the solutions. The advantage in
such implementations occurs from the fact that the operations in the newer field are more
efficient to implement than the initial field.
where X is the variable and the coefficients a0 , . . . , an of the polynomial are elements of R.
The set of all polynomials over R in the variable X is denoted by R[X].
If the leading coefficient of the polynomial f , denoted by an is nonzero, then the degree
of the polynomial is said to be n. A monomial is a polynomial whose all co-efficients except
the leading one are zero.
If the value of a polynomial vanishes for a particular value of the variable, r ∈ R: ie.
f (r) = 0, then r is called a root or zero of
Pfn. Pm
Consider two polynomials, f (x) = i=0 ai X , and g(x) =
i
i=0 bi X , defined over
i
R, and suppose n P
P ≥ m. Then the sum of the polynomials is defined by (f + g)(X) =
m n
i=0 (ai + b i )X i
+ i=(m+1) ai X . The number of operations needed is O(n + 1).
i
Pn+m
The product of the polynomials f and g is (f g)(X) = k=0 ck X , where ck =
k
Pk
i=0 ai bk−i , 0 ≤ k ≤ n + m. The coefficients, ai and bi which are not defined are set
to 0. The multiplication requires O(nm) computations, considering the products and addi-
tions.
Example 12 The set Z/3Z contains the elements, 0, 1 and 2. Consider the polynomials,
f (X) = X 2 + X + 1, and g(X) = X 3 + 2X 2 + X ∈ (Z/3Z)[X]. It can be checked that the
first polynomial has a zero at 1, while the later has at 2.
The sum of the polynomials, is (f +g)(X) = X 3 +(1+2)X 2 +(1+1)X +1 = X 3 +2X +1.
The product of the polynomials is denoted by f g(X) = X 5 + (1 + 2)X 4 + (1 + 2 + 1)X 3 +
(2 + 1)X 2 + X = X 5 + X 3 + X.
The set of polynomials, R[X] forms a commutative ring with the operations, addition and
multiplication. If K is a field, then the ring K[X] of polynomials over K contains no zero
divisors, that there does not exist two non-zero polynomials, a(X) and b(X) st. a(X)b(X) =
0, the zero polynomial.
The following theorem is stated without proof but can be followed from any classic text
of number theory:
Theorem 7 Let f (X), g(X) ∈ K[X], g(X) 6= 0. Then there are uniquely determined poly-
nomials q(X), r(X) ∈ K[X], with f (X) = q(X)g(X) + r(X) and r(X) = 0 or deg r(X) <
deg g(X). The polynomials q(X) and r(X) are referred to as the quotient and remainder
polynomials.
An important observation based on the above result, which is often called the division
algorithm on polynomials, is that if b ∈ K is the root of a polynomial f (X) ∈ K[X], then
(X−b) divides the polynomial f (X). It can be followed quite easily, as we have by polynomial
division of f (X) by (X − b), polynomials q(X) and r(X), st. deg(r(X)) < deg(X − b) = 1
and f (X) = (X − b)q(X) + r(X). Thus we have r(X) a constant, and we denote it by r ∈ K.
The finite field is constructed much similar to what we do in the context of modular
arithmetic. We define residue classes modulo f (X), ie. we generate the set of polynomials
modulo f (X) and place them in separate classes. Thus the set consists of all polynomials
of degree < degreef (X). Each of these polynomials has representative elements of all
the polynomials in the corresponding residue class. The residue class represented by the
polynomial h(X) is denoted as:
In other words, the polynomials g(X) and the elements of the residue class are con-
gruent modulo f (X). It is easy to see that the representative elements, denoted by
(Z/pZ)[X]/hf (X)i form a ring under the standard operations of addition and multipli-
cations. However, they form a field if and only if the polynomial f (X) is irreducible.
Below we state a theorem which states that the above fact.
Theorem 9 For a non-constant polynomials f (X) ∈ (Z/pZ)[X], the ring (Z/pZ)[X]/hf (X)i
is a field if and only if f (X) is irreducible in (Z/pZ)[X].
The proof is quite straightfoward. If, f (X) is reducible over (Z/pZ)[X], we have
g(X), h(X), st. f (X) = g(X)h(X) and 1 ≤ degree(g(X)), degree(h(X)) < degree(f (X)).
Then both g(X) and h(X) are non-zero elements in (Z/pZ)[X] whose product is zero
modulo f (X). Thus, the ring (Z/pZ)[X] contains non-zero zero divisors.
If f (X) is irreducible over (Z/pZ)[X] and g(X) is a non-zero polynomial, st.
degree(g(X)) < degree(f (X)), then gcd(f (X), g(X)) = 1. Thus from Euclidean algorithm
∃ polynomials u(X), v(X) ∈ (Z/pZ)[X], u(X)f (X) + v(X)g(X) = 1, and degree(v(X)) <
degree(f (X)).
Thus, v(X)g(X) ≡ 1(mod(f (X))), i.e., g(X) has a multiplicative inverse in the ring
(Z/pZ)[X]/hf (X)i, which hence qualifies as a field.
Example 15 Consider the field GF (22 ), as an extension field of GF (2). Thus define
GF (22 )[X] = GF (2)[X]/hf (X)i, where f (X) = X 2 + X + 1 is an irreducible polynomial in
GF (2)[X]. It is clear that the polynomials are of the form aX + b, where a, b ∈ GF (2).
The four equivalence classes are 0, 1, X, X + 1, which are obtained by reducing the poly-
nomial of GF (2)[X] by the irreducible polynomial X 2 + X + 1.
If θ denotes the equivalence class of the polynomial X ∈ GF (22 ), the classes can be
represented as 0, 1, θ, θ + 1.
Notice, that setting θ2 = θ + 1, (ie. f (θ) = θ2 + θ + 1 = 0), reduces any polynomial
f (θ) ∈ GF (2)[θ] modulo (θ2 + θ + 1).
If f (X) is not irreducible, then f (X) has a factor f1 (X) ∈ K[X], st. 1 ≤ deg(f1 (X)) <
deg(f (X)). Either, f1 (X) is irreducible, or f1 (X) has a factor f2 (X), such that 1 ≤
deg(f2 (X)) < deg(f1 (X)). Eventually, we will have an irreducible polynomial q(X), which
can be used to define the extension field. Even then, f (θ) = 0, because q(X) is a factor of
f (X) in K[X].
The number of elements in a field is defined as the order. Thus if the order of the field
K is p, the elements of the extension field, K ′ = K/hf (X)i, where f (X) is an irreducible
polynomial of degree m, can be represented as: a(X) = a0 + a1 (X) + . . . + am−1 X m−1 .
Since there are exactly p choices for each of the coefficients, there are pm values in the
field. Thus the order of the extension field is pm .
Thus summarizing every non-constant polynomial over a field has a root in some exten-
sion field.
Theorem 10 Let f (X) be a non-constant polynomial with coefficients in a field K. Then
there is a field L containing K that also contains a root of f (X).
We can apply the above result repeatedly to obtain further extensions of a given field,
and finally arrive at a bigger field. Consider a field K with order p. Let us start with the
n
polynomial f (X) = X p − X over a field K. It can be extended to K 1 , where a root θ1
of f lies. In turn the field K 1 can be extended to the field K 2 , where a root θ2 lies. Thus
continuing we can write:
n
Xp − X = (X − θ1 ) . . . (X − θpn )
The set of roots, θ1 , θ2 , . . . , θpn itself forms a field and are called a splitting field of f (X). In
other words, a splitting field of a polynomial with coefficients in a field is the smallest field
extension of that field over which the polynomial splits or decomposes into linear factors.
Next we introduce another important class of polynomials which are called minimal
polynomials.
Definition 1.9.1 Let K ⊆ L be a field extension, and θ an element of L. The minimal
polynomial of θ over K is the monic polynomial m(X) ∈ K[X] of smallest degree st. m(θ) =
0.
The minimal polynomial divides any polynomial that has θ as a root. It is easy to
follow, since dividing f (X) by m(X) gives the quotient q(X) and the remainder r(X), st.
deg(r(X)) < deg(m(X)). Thus we have, f (X) = q(X)m(X) + r(X). Substituting, θ we
have f (θ) = r(θ) = 0, since f (θ) = 0. Since, deg(r(X)) < deg(m(X)) and m(X) is the
minimal polynomial, we have r(X) = 0. Thus, the minimal polynomial m(X) of θ divides
any polynomial which has θ as its root.
Let a be a finite element of a finite group G, and consider the list of powers of a,
{a1 , a2 , . . . , }. As G is finite the list will eventually have duplicates, so there are positive
integers j < k, st. aj = ak .
Thus, we have 1 = ak−j (multiplying both sides by (a−1 )j ).
So, ∃ a positive integer t = k − j, st. at = 1. The smallest such positive integer is called
the order of a, and is denoted by ord(a). Thus, whenever we have an integer n, st. an = 1,
we have ord(a) divides n.
The subset S = {a, a2 , . . . , aord(a) = 1} is itself a group and thus qualifies as a subgroup.
If G is a finite group, order of the group G is defined as the number of elements in G.
Let S be a subgroup of a finite group G. For each element a ∈ G, the set aS = {as|s ∈
S} has the same number of elements as S. If a, b ∈ G, and a 6= b, then aS and bS are
either disjoint or equal. Thus the group G can be partitioned into m units, denoted by
a1 S, a2 S, . . . , am S, so that G = a1 S ∪ . . . ∪ am S, and ai S ∩ aj S = φ, ∀i 6= j.
Thus, we have ord(G) = m × ord(S). Hence, the order of a subgroup divides the order
of the group G.
Thus we have the following theorem known as Lagrange’s Theorem.
Theorem 11 If S is a subgroup of the finite group G, then the order of S divides the order
of G.
In particular, there is an element α such that every non-zero element can be written in
the form of αk . Such an element α is called the generator of the multiplicative group, and
is often referred to as the primitive element.
Specifically consider the field GF (pn ), where p is a prime number. The primitive element
of the field is defined as follows:
Definition 1.10.1 A generator of the multiplicative group of GF (pn ) is called a primitive
element.
The minimal polynomial of the primitive element is given a special name, primitive
polynomials.
Definition 1.10.2 A polynomial of degree n over GF (p) is a primitive polynomial if it is
the minimal polynomial of a primitive element in GF (pn ).
The concepts of irreducibility, minimality and primitivity of polynomials play a central
role in the theory of fields. It can be seen that there are several interesting interrelations
and properties of these concepts. We state a few in the following sequel:
Theorem 12 The minimal polynomial of an element of GF (pn ) is irreducible.
Further, a minimal polynomial over GF (p) of any primitive element of GF (pn ) is an
irreducible polynomial of degree n. Thus, a primitive polynomial is irreducible, but not the
vice versa. A primitive polynomial must have a non-zero constant term, for otherwise it
will be divisible by x. Over the field GF (2), x + 1 is a primitive polynomial and all other
primitive polynomials have odd number of terms, since any polynomial mod 2 with an even
number of terms is divisible by x + 1.
If f (X) is an irreducible polynomial of degree n over GF (p), then f (X) divides g(X) =
X p n − X. The argument for the above is from the fact, that from the theory of extension
of fields, there is a field of order pn that contains an element θ st. f (θ) = 0. Since, f (X) is
irreducible it should be a minimal polynomial as well. Also, the polynomial g(X) = X p n −X
vanishes at X = θ. Thus, g(X) = f (X)q(X) + r(X), where degree(r(X)) < degree(f (X)),
then r(θ) = 0. Since, f (X) is a minimal polynomial of θ, r(X) = 0.
Thus, we have an alternative definition of primitive polynomials, which are nothing but
minimal polynomials of the primitive polynomials.
Definition 1.10.3 An irreducible polynomial of degree n, f (X) over GF (p) for prime p,
is a primitive polynomial if the smallest positive integer m such that f (X) divides xm − 1
is m = pn − 1.
Over GF (pn ) there are exactly φ(pn − 1)/n primitive polynomials, where φ is Euler’s
Totient function. The roots of a primitive polynomial all have order pn − 1. Thus the roots
of a primitive polynomial can be used to generate and represent the elements of a field.
We conclude this section, with the comment that all fields of the order pn are essentially
the same. Hence, we can define isomorphism between the elements of the field. We explain
a specific case in the context of binary fields in a following section.
Definition 1.11.1 Let p(x) be an irreducible polynomial over GF (2m ) and let α be the root
of p(x). Then the set
{1, α, α2 , · · · , αm−1 }
a(x)
Squaring Operation
0 0 0 0 0 0 0
Modulo Operation
2
a(x)
Definition 1.11.2 Let p(x) be an irreducible polynomial over GF (2m ), and let α be the
root of p(x), then the set
2 (m−1))
{α, α2 , α2 , · · · , α2 }
is called the normal base if the m elements are linearly independent.
The normal bases representation is useful for arithmetic circuits, as squaring an element
is accomplished by cyclic shifts. More generally, for any field GF (pm ), the basis vector is
{bp 0 , bp 1 , . . . , bp m−1 }, where b is chosen such that they are linearly independent.
Any element in the field GF (2m ) can be represented in terms of its bases as shown
below.
a(x) = am−1 αm−1 + · · · + a1 α + a0
Alternatively, the element a(x) can be represented as a binary string (am−1 , · · · , a1 , a0 )
making it suited for representation on computer systems. For example, the polynomial
x4 + x3 + x + 1 in the field GF (28 ) is represented as (00011011)2 .
Various arithmetic operations such as addition, subtraction, multiplication, squaring and
inversion are carried out on binary fields. Addition and subtraction operations are identical
and are performed by XOR operations.
Let a(x), b(x) ∈ GF (2m ) be denoted by
m−1
X m−1
X
a(x) = ai xi b(x) = bi xi
i=0 i=0
The squaring operation on binary finite fields is as easy as addition. The square of the
polynomial a(x) ∈ GF (2m ) is given by
m−1
X
a(x)2 = ai x2i mod p(x) (1.7)
i=0
The squaring essentially spreads out the input bits by inserting zeroes in between two bits
as shown in Fig. 1.2.
Multiplication is not as trivial as addition or squaring. The product of the two polyno-
mials a(x) and b(x) is given by
n−1
X
a(x) · b(x) = b(x)ai xi mod p(x) (1.8)
i=0
464 232 74 0
11111111111111111
00000000000000000
00000000000000000
11111111111111111
00000000000000000
11111111111111111
111111
000000
000000
111111
αm = 1 + αn
αm+1 = α + αn+1
.. (1.9)
.
α2m−3 = αm−3 + αm+n−3
α2m−2 = αm−2 + αm+n−2
For example, consider the irreducible trinomial x233 +x74 +1. The multiplication or squaring
of the polynomial results in a polynomial of degree at most 464. This can be reduced as
shown in Fig. 1.3. The higher-order terms 233 to 464 are reduced by using Equation 1.9.
Definition 1.12.1 The pair of the fields GF (2n ) and GF (2n )m are called a composite field,
if there exists irreducible polynomials, Q(Y ) of degree n and P (X) of degree m, which are
used to extend GF (2) to GF (2n ), and GF ((2n )m ) from GF (2n ).
Composite fields are denoted by GF (2n )m . A composite field is isomorphic to the field,
GF (2k ), where k = m × n. However, it is interesting to note that the underlying field
operations in both the fields have different complexity, and varies with the exact values of
n, m and the polynomials used to the construct the fields.
Below we provide an example.
Example 16 Consider the fields GF (24 ), elements of which are the following 16 polyno-
mials with binary coefficients:
0 z2 z3 z3 + z2
1 z2 + 1 z3 + 1 z3 + z2 + 1
z z2 + z z3 + z z3 + z2 + z
z+1 z2 + z + 1 z3 + z + 1 z3 + z2 + z + 1
There are 3 irreducible polynomials of degree 4, which can be used to construct the
fields: f1 (z) = z 4 + z + 1, f2 (z) = z 4 + z 3 + 1, f3 (z) = z 4 + z 3 + z 2 + z + 1.
The resulting fields, F1 , F2 , F3 all have the same elements, ie. the above 16 polynomials.
However, the operations are different: like the same operation, z.z 3 would result in
z + 1, z 3 + 1, z 3 + z 2 + z + 1 in the three fields F1 , F2 , and F3 , respectively.
The fields are isomorphic and one can establish between the fields, say F1 and F2 a
mapping, by computing c ∈ F2 , st. f1 (c) ≡ 0(mod f2 ) The map z → c is thus used to
construct an isomorphism T : F1 → F2 .
The choices of c are z 2 + z, z 2 + z + 1, z 3 + z 2 , and z 3 + z 2 + 1. One can verify that
c = z 2 + z ⇒ f1 (c) = (z 2 + z)4 + (z 2 + z) + 1 = z 8 + z 4 + z 2 + z + 1 ≡ 0(mod f2 ). The
modulo f2 can be performed by substituting, z 4 by z 3 + 1, ie. f2 (z) = 0.
TABLE 1.6: An Example Isomorphic Mapping between the field GF (24 ) and GF (22 )2
The first primitive element γ ∈ GF (24 ) is 2. It can be checked that raising higher powers
of 2, modulo Z 4 + Z + 1 all the non-zero elements of GF (24 ) can be generated. Likewise,
the first primitive element of GF (22 )2 , such that R(Z) ≡ 0 modulo Q(Y ) and P (X) is 4.
Hence, we establish the mapping {02} → {04}.
The complete mapping obtained by raising the above elements to their higher powers is
written in Table 1.6. It may be noted that for completeness, we specify that 0 is mapped to
0 in the table.
The above algorithm can be made more efficient, by using suitable tests for primitivity
and also storages. One such algorithm is presented in the following subsection.
Tγ i = αit , i = 0, 1, . . . , k − 1
It may be noted that the choice of t cannot be arbitrary; it has to be done such that
the homomorphism is established. wrt. addition and multiplications. For this we use the
property discussed before:
There will be exactly k primitive elements which will satisfy the condition, namely αt and
j
αt2 , j = 1, 2, . . . , k − 1, where the exponents are computed modulo 2k − 1.
We summarize the above in algorithm 1.4.
The algorithm can be explained as follows: Line 5 of the algorithm ensures that the
identity element in the field GF (2k ) is mapped to the identity element in GF (2n )m . Both
the identity elements are represented by the polynomial 1. Line 6 checks for the equality of
R(αt ) to zero, which if true indicates that t is found. If αt is not the element to which β is
j
mapped, then αt2 are also not primitive elements, where 1 < j < k − 1. Hence we set the
corresponding elements in the array, S to 0, and proceed by incrementing t by 1.
In line 10 and 11, we continue the search for the appropriate t by checking whether the
corresponding entry in the S array has been set to 0 (indicating it was found unsuitable
during a previous run of the while loop of line 6), or if not previously marked by checking
the primitivity of αt by computing the gcd of t and 2k − 1. If the gcd is found to be greater
than 1, it indicates that αt is not primitive, hence, we increment t.
When the correct value of t is obtained, the matrix T is populated columnwise from the
right by the binary representations of αjt , where 2 ≤ j ≤ k.
1.13 Conclusions
In this chapter, we developed several mathematical concepts, which form the founda-
tions of modern cryptography. The chapter presented discussions on modular arithmetic
and defined the concepts of mathematical groups, rings, and fields. Useful operations like
the Euclidean algorithm, its extensions to evaluate the greatest common divisor, and the
multiplicative inverse were elaborated. We also discussed the Chinese Remainder Theo-
rem, which is a useful tool to develop efficient designs for RSA-like algorithms and also
to perform attacks on them. The chapter subsequently developed the important concept of
subfields and shows how to construct extension fields from a field. As modern cryptographic
algorithm relies heavily on Galois (finite) fields, the chapter presents a special attention to
them, efficient representation of elements in the Galois fields, and their various properties,
like formation of cyclic group, etc. The chapter also elaborated with examples on how to
define isomorphic mapping between several equivalent fields, often technically referred to as
the composite fields. All these concepts and techniques built on them have useful impact in
efficient (hardware) designs and attacks. We show several such applications in the following
chapters.
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.2 Cryptography: Some Technical Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3 Block Ciphers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.3.1 Inner Structures of a Block Cipher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.3.2 The Advanced Encryption Standard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.3.3 The AES Round Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.3.3.1 SubBytes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.3.3.2 ShiftRows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.3.3.3 MixColumns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.3.3.4 AddRoundKey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.3.4 Key-Scheduling in AES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.4 Rijndael in Composite Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.4.1 Expressing an Element of GF (28 ) in Subfield . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.4.2 Inversion of an Element in Composite Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.4.3 The Round of AES in Composite Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.5 Elliptic Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.5.1 Simplification of the Weierstraß Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.5.2 Singularity of Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.5.3 The Abelian Group and the Group Laws . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.5.4 Elliptic Curves with Characteristic 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.5.5 Projective Coordinate Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.6 Scalar Multiplications: LSB First and MSB First Approaches . . . . . . . . . . . . . . . . . . . . . . . . 58
2.7 Montgomery’s Algorithm for Scalar Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.7.1 Montgomery’s Ladder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.7.2 Faster Multiplication on EC without Pre-computations . . . . . . . . . . . . . . . . . . . . . . . 60
2.7.3 Using Projective co-ordinates to Reduce the Number of Inversions . . . . . . . . . . . 62
2.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Bhagavad Gita
Vibhuti-yoga, Sloka 38
29
2.1 Introduction
The art of keeping messages secret is cryptography, while cryptanalysis is the study
attempted to defeat cryptographic techniques. Cryptography is used to protect information
from illegal access. It largely encompasses the art of building schemes (ciphers) which allow
secret data exchange over insecure channels [351]. The need of secured information exchange
is as old as civilization itself. It is believed that the oldest use of cryptography was found in
non-standard hieroglyphics carved into monuments from Egypt’s Old Kingdom. In 5 B.C.
the Spartans developed a cryptographic device, called scytale to send and receive secret
messages. The code was the basis of transposition ciphers, in which the letters remained
the same but the order is changed. This is still the basis for many modern day ciphers.
The other major ingredient of many modern-day ciphers is substitution ciphers, which was
used by Julius Caesar and is popularly known as Caesar’s shift cipher. In this cipher, each
plaintext character was replaced by the character 3 places to the right in the alphabet set
modulo 26. However, in the last three decades cryptography has grown beyond designing
ciphers to encompass also other activities like design of signature schemes for signing digi-
tal contracts. Also the design of cryptographic protocols for securely proving one’s identity
has been an important aspect of cryptography of the modern age. Yet the construction of
encryption schemes remains, and is likely to remain, a central enterprise of cryptography
[134]. The primitive operation of cryptography is hence encryption. The inverse operation of
obtaining the original message from the encrypted data is known as decryption. Encryption
transforms messages into representation that is meaningless for all parties other than the
intended receiver. Almost all cryptosystems rely upon the difficulty of reversing the encryp-
tion transformation in order to provide security to communication [353]. Cryptanalysis is
the art and science of breaking the encrypted message. The branch of science encompassing
both cryptography and cryptanalysis is cryptology and its practitioners are cryptologists.
One of the greatest triumph of cryptanalysis over cryptography was the breaking of a ci-
phering machine named Enigma and used during Worldwar II. In short cryptology evolves
from the long-lasting tussle between the cryptographer and cryptanalyst.
For many years, many fundamental developments in cryptology outpoured from military
organizations around the world. One of the most influential cryptanalytic papers of the twen-
tieth century was William F. Friedman’s monograph [123] entitled The Index of Coincidence
and its Applications in Cryptography. For the next fifty years, research in cryptography was
predominantly done in a secret fashion, with few exceptions like the revolutionary contri-
bution of Claude Shannon’s paper "The Communication Theory of Secrecy Systems", which
appeared in the Bell System Technical Journal in 1949 [359].
However, after the world wars cryptography became a science of interest to the research
community. The Code Breakers by David Kahn produced the remarkable history of cryp-
tography [171]. The significance of this classic text was that it raised the public awareness
of cryptography. The subsequent development of communication and hence the need of
privacy in message exchange also increased the impetus on research in this field. A large
number of cryptographers from various fields of study began to contribute leading to the
rebirth of this field. Horst Fiestel [119] began the development of the US Data Encryption
Standard (DES) and laid the foundation of a class of ciphers called as private or symmetric
key algorithms. The structure of these ciphers became popular as the Fiestel Networks in
general. Symmetric key algorithms use a single key to both encrypt and decrypt. In order to
establish the key between the sender and the receiver they required to meet once to decide
the key. This problem commonly known as the key exchange problem was solved by Martin
Hellman and Whitfield Diffie [111] in 1976 in their ground-breaking paper New Directions
in Cryptography. The developed protocol allows two users to exchange a secret key over
an insecure medium without any prior secrets. The work not only solved the problem of
key exchange but also provided the foundation of a new class of cryptography, known as
the public key cryptography. As a result of this work the RSA algorithm, named after the
inventors Ron Rivest, Adi Shamir, and Leonard Adleman, was developed [307]. The security
of the protocol was based on the computational task in factoring the product of large prime
numbers.
Cryptology has evolved further with the growing importance of communications and
the development in both processor speeds and hardware. Modern-day cryptographers have
thus more work than merely jumbling up messages. They have to look into the applica-
tion areas in which the cryptographic algorithms have to work. The transistor has become
more powerful. The development of the VLSI technology (now in submicrons) have made
the once cumbersome computers faster and smaller. The more powerful computers and
devices will allow the complicated encryption algorithm run faster. The same computing
power is also available to the cryptanalysts who will now try to break the ciphers with both
straight forward brute force analysis, as well as by leveraging the growth in cryptanalysis.
The world has thus changed since the DES was adopted as the standard cryptographic
algorithm and DES was feeling its age. Large public literature on ciphers and the devel-
opment of tools for cryptanalysis urged the importance of a new standard. The National
Institute for Standards and Technology (NIST) organized a contest for the new Advanced
Encryption Standard (AES) in 1997. The block cipher Rijndael emerged as the winner in
October 2000 because of its features of security, elegance in implementations and principled
design approach. Simultaneously Rijndael was evaluated by cryptanalysts and a lot of in-
teresting works were reported. Cryptosystems are inherently computationally complex and
in order to satisfy the high throughput requirements of many applications, they are often
implemented by means of either VLSI devices or highly optimized software routines. In re-
cent years such cryptographic implementations have been attacked using a class of attacks
which exploits leaking of information through side-channels like power, timing, intrusion of
faults etc. In short as technology progresses new efficient encryption algorithms and their
implementations will be invented, which in turn shall be cryptanalyzed in unconventional
ways. Without doubt cryptology promises to remain an interesting field of research both
from theoretical and application point of view.
Eavesdropper
Message Message
Encryption Insecure Channel Decryption
Source Destination
Secure Channel
Key
Source
entity who studies the cipher and uses algebraic and statistical techniques to attack a
cryptographic scheme. A cryptanalytic attack is a procedure through which the cryptanalyst
gains information about the secret decryption key. Attacks are classified according to the
level of a-priori knowledge available to the cryptanalyst.
A Ciphertext-only attack is an attack where the cryptanalyst has access to cipher-
texts generated using a given key but has no access to the corresponding plaintexts or the
key. A Known-plaintext attack is an attack where the cryptanalyst has access to both
ciphertexts and the corresponding plaintexts, but not the key.
A Chosen-plaintext attack (CPA) is an attack where the cryptanalyst can choose
plaintexts to be encrypted and has access to the resulting ciphertexts, again their purpose
being to determine the key.
A Chosen-ciphertext attack (CCA) is an attack in which the cryptanalyst can choose
ciphertexts, apart from the challenge ciphertext and can obtain the corresponding plaintext.
The attacker has access to the decryption device.
In case of CPA and CCA, adversaries can make a bounded number of queries to its
encryption or decryption device. The encryption device is often called as oracle: meaning
it is like a black-box without details like in an algorithm of how an input is transformed
or used to obtain the output. Although this may seem a bit hypothetical, but there are
enough real life instances where such encryption and decryption oracles can be obtained.
Thus security analysis with the existence of such oracles is imperative.
The attacks are measured against a worst case referred to as the brute-force method.
The method is a trial-and-error approach, whereby every possible key is tried until the
correct one is found. Any attack that permits the discovery of the correct key faster than
the brute-force method, on average, is considered successful. An important principle known
as the Kerckhoff’s principle states that the secrecy of a cipher must reside entirely in the
key. Thus an enemy will have a complete knowledge of the cipher but shall not know the
key. A secured cryptographic scheme should withstand the attack of such a well-informed
adversary.
Formal definition of a cryptosystem is stated below for the sake of completeness:
Definition 1 A cryptosystem is a five-tuple (P, C, K, E, D), where the following are satis-
fied:
eK (x) = (x ⊕ k) dK (x) = (y ⊕ k)
Here the operator ⊕ is a bitwise operation and is a self-invertible operation. Not all
ciphers have Ka and Kb same. In fact, depending on their equality we have two important
dichotomy of ciphers which are explained next:
• Private-key (or symmetric) ciphers: These ciphers have the same key shared between
the sender and the receiver. Thus, referring to Fig. 2.1 Ka = Kb .
• Public-key (or asymmetric) ciphers: In these ciphers we have Ka 6= Kb . The encryption
key and the decryption keys are different.
These types differ mainly in the manner in which keys are shared. In symmetric-key or
private-key cryptography both the encryptor and decryptor use the same key. Thus, the key
must somehow be securely exchanged before secret key communication can begin (through
a secured channel, Fig. 2.1). In public key cryptography the encryption and decryption
keys are different.
In such algorithms we have a key-pair, consisting of:
• Public key, which can be freely distributed and is used to encrypt messages. In the
Fig. 2.1, this is denoted by the key Ka .
• Private key, which must be kept secret and is used to decrypt messages. The decryption
key is denoted by Kb in the Fig. 2.1.
In the public key or asymmetric ciphers, the two parties –namely Alice and Bob–are
communicating with each other and have their own key pair. They distribute their public
keys freely. Mallory has the knowledge of not only the encryption function, the decryption
function, and the ciphertext, but also has the capability to encrypt the messages using Bob’s
public key. However, he is unaware of the secret decryption key, which is the private key
of the algorithm. The security of these classes of algorithms rely on the assumption that it
is mathematically hard or complex to obtain the private key from the public informations.
Doing so would imply that the adversary solves a mathematical problem which is widely
believed to be difficult. It may be noted that we do not have any proofs for their hardness,
however we are unaware of any efficient techniques to solve them. The elegance of con-
structing these ciphers lies in the fact that the public keys and private keys still have to be
related in the sense, that they perform the invertible operations to obtain the message back.
This is achieved through a class of magical functions, which are called one-way functions.
These functions are easy to compute in one direction, while computing the inverse from
the output is believed to be a difficult problem. We shall discuss this in more details in a
following section. However, first let us see an example for this class of ciphers.
Example 20 This cipher is called the famous RSA algorithm (Rivest Shamir Adleman).
Let n = pq, where p and q are properly chosen and large prime numbers. Here the proper
choice of p and q are to ensure that factorization of n is mathematically complex. The
plaintexts and ciphertexts are P = C = Zn , the keys are Ka = {n, a} and Kb = {b, p, q},
st ab ≡ 1 mod φ(n). The encryption and decryption functions are defined as, ∀x ∈ P ,
eKa (x) = y = xa mod n and dKb (y) = y b mod n.
The proof of correctness of the above algorithm follows from the combination of the
Fermat’s little theorem and the Chinese Remainder Theorem (CRT). The algorithm is
correct if ∀x ∈ P , we have:
xab ≡ x mod n
It may be observed that since gcd(p, q) = 1 we have from the Extended Euclidean
Algorithm (EEA) 1 = (q −1 mod p)q + (p−1 mod q)p. Thus, from Equation 2.1 applying
CRT we have xab ≡ x((q −1 mod p)q + (p−1 mod q)p) mod n = x.
If x ≡ 0 mod p, then it is trivial that xab ≡ x mod p.
Otherwise if x 6≡ 0 mod p, xp−1 ≡ 1 mod p. Also, since ab ≡ 1 mod φ(n) and φ(n) =
(p − 1)(q − 1) we have ab = 1 + k(p − 1)(q − 1) for some integer k.
Thus we have, xab = x.xk(p−1)(q−1) ≡ x mod p. Likewise, we have xab ≡ x mod q.
Combining the two facts, by CRT we have that xab ≡ x mod n. This shows the correctness
of the RSA cipher.
It may be observed that the knowledge of the factors of p and q help to ascertain the
value of the decryption key Kb from the encryption key Ka . Likewise, if the decryption
key Kb is leaked, then the value of n can be factored using a probabilistic algorithm with
probability of success at least 0.5.
Another kind of public key ciphers is the ElGamal cryptosystem, which is based on
another hard problem which is called the Discrete Log Problem (DLP).
Consider a finite mathematical group (G, .). For an element α ∈ G of order n, let:
Example 21 Let p be a prime, st. computing DLP in (Zp∗ , .) is hard. Let α ∈ Zp∗ be a
primitive element, and define the plaintext set as P = Zp∗ and the ciphertext set as C =
Zp∗ × Zp∗ . The key set is defined as K = (p, α, a, β) : αa ≡ βmod p.
For a given k ∈ K, x ∈ P , c ∈ C and for a secret number r ∈ Zp−1 , define c =
ek (x, r) = (y1 , y2 ), where y1 = αr mod p, and y2 = xβ r mod p. This cryptosystem is called
as the ElGamal cryptosystem. The decryption is straightforward: for a given ciphertext,
c = (y1 , y2 ), where y1 , y2 ∈ Zp∗ , we have x = (y1a )−1 (y2 ).
The plaintext x is thus masked by multiplying it by β r in the second part of the cipher-
text, y2 . The hint to decrypt is transmitted in the first part of the ciphertext in the form of
αr . It is assumed that only the receiver who has the secret key a can compute β r by raising
αr to the power of a, as β ≡ αa mod p. Then decrypting and obtaining back x as one just
needs to multiply the multiplicative inverse of β r with y2 .
Thus one can observe that the ElGamal cipher is randomized, and one can for the same
plaintext x obtain p − 1 ciphertexts, depending on the choice of r.
An interesting point to note about the hardness of the DLP is that the difficulty arises
from the modular operation. As otherwise, αi would have been monotonically increasing,
and one can apply a binary search technique to obtain the value of i from a given value of
α and β = αi . However, as the operations are performed modular p, there is no ordering
among the powers, a higher value of i can give a lower value of the αi . Thus in the worst
case, one has to do brute force search among all the possible p − 1 values of i to obtain the
exact value (note that there is a unique value of i). Hence the time complexity is O(p). One
can try to use some storage and perform a time-memory trade-off.
An attacker can pre-compute and store all possible values of (i, αi ), and then sort the
table based on the second field using an efficient sorting method. Thus the total storage
required is O(p) and the time to sort is O(plogp). Given a value of β, now the time to search
is O(logp). Sometimes for complexity analysis of DLP, we neglect the value of log, and then
thus in this case the time complexity is reduced to O(1) while the memory complexity is
increased to O(p). However there are developments in cryptanalysis which allows us to solve
√
the DLP in time-memory product of O( p), but the study of these algorithms is beyond
the scope of this text.
Public (or asymmetric) and Private (or symmetric) key algorithms have complementary
advantages and disadvantages. They have their specific application areas. Symmetric key
ciphers have higher data throughput but the key must remain secret at both the ends. Thus
in a large network there are many key pairs that should be managed. Sound cryptographic
practice dictates that the key should be changed frequently for each communication session.
The throughputs of the most popular public-key encryption methods are several orders
of magnitude slower than the best known symmetric key schemes. In a large network the
number of keys required are considerably smaller and needs to be changed less frequently. In
practice thus public-key cryptography is used for efficient key management while symmetric
key algorithms are used for bulk data encryption.
In the next subsection, we highlight an application of public key systems to achieve
key-exchanges between two parties. The famous protocol known as the Diffie-Hellman key-
exchange is based on another hard problem related closely with the DLP. This is called as
the Diffie-Hellman Problem (DHP) and the key-exchange is called as the Diffie Hellman
(DH) key-exchange.
In this exchange, Alice and Bob (see Fig. 2.2) agree upon two public elements, p
and g. Alice has a secret element a, and Bob has a secret element b, where a, b ∈ Zp−1 .
Alice computes x1 ≡ g a mod p, while Bob computes x2 ≡ g b mod p and then exchanges
these informations over the network. Then Alice computes xa2 mod p, while Bob computes
xb1 mod p, both of which are the same. Apart from the agreement (which is quite evident),
the most important question is of the secrecy of the agreed key, i.e. the untrusted third party
should not be able to compute the agreed key, which is numerically xb1 ≡ xa2 ≡ g ab mod p.
Thus the eavesdropper has to compute this value from the public information of g and p
and the exchanged information of x1 ≡ g a mod p and x2 ≡ g b mod p. This problem is known
as the Computational Diffie Hellman Problem (CDH). As can be observed this problem is
related to the DLP: if one can solve the DLP he can obtain the values of a or b and can
solve the CDH problem as well. The other direction is however not so straightforward and
is beyond the current discussion.
The classical DH key-exchange can be subjected to simple man-in-the-middle (MiM)
attacks. As an interceptor Eve can modify the value x1 from Alice to Bob and hand over
Bob a modified value of x′1 ≡ g t mod p, for some arbitrarily chosen t ∈ Zp−1 . Similarly, she
also modifies x2 received from Bob into x′2 = x′1 ≡ g t mod p. However, Alice and Bob are
unaware of this attack scenario and goes ahead with the DH key-exchange and computes
the keys as g ta mod p and g tb mod p, respectively. They use these keys to communicate with
each other. However, the future messages that are encrypted with these keys can all be
deciphered by Eve as she also can compute these keys using the exchanged values of x1 and
x2 and the public values g and p. This simple attack obviates the use of other reinforcements
to the classical DH Key-exchange, like encrypting the exchanged messages by symmetric or
asymmetric ciphers. Thus for an end-to-end security interplay of symmetric and asymmetric
ciphers is very important. However, the objective of this text is to understand the design
challenges of these primitives on hardware.
One of the important class of symmetric algorithms is block ciphers, which are used for
bulk data encryption. In the next section, we present an overview of block cipher structures.
As an important example, we present the Advanced Encryption Standard (AES), which
is the current standard block cipher. The AES algorithm uses finite field arithmetic and
the underlying field is of the form GF (28 ). Subsequently in a later section, we describe a
present-day public key cipher, namely elliptic curve cryptosystem. These ciphers rely on
the arithmetic on elliptic curves which can be defined over finite fields over characteristic 2
and primes.
Pj
Key Key
−1
(K)
E E (K)
Pj
Encryption Decryption
Cj
In this mode, as shown in Fig. 2.5 the cipher of a block, Cj−1 is XORed with the next
plaintext block, Pj . Thus the ciphertext for the next block is Cj = EK (Pj ⊕ Cj−1 ). This
indicates that the output of the j th instance depends on the output of the previous step.
Thus although, as we shall see in the following sections, that the block ciphers have an
iterated structure, there is no benefit from pipelining. More precisely, the reason is that the
next block encryption cannot start unless the encryption of the previous block is completed.
However, there are other modes of ciphers, like counter mode and Output Feedback (OFB)
where pipelining provides advantage.
c0 =IV
Cj−1
Pj +
Key Key
−1
(K)
E E (K)
Encryption
Cj + Cj−1
Decryption
Pj
PlaintextBlock
Secret Key
Key Whitening
CiphertextBlock
1. Addition with Round Key: The message state is typically XORed with the round
key.
diffusion to the cipher. This step is typically a linear transformation wrt. the XOR
operation. Hence it can be expressed in terms of the input using only XOR gates. Thus
often they are easy to implement and hence can be applied on larger block lengths as
the resource requirement typically is less.
3. S-Box: It is generally a key-less transformation, commonly referred to as the sub-
stitution box, or S-Box. It provides the much needed confusion to the cipher as it
makes the algebraic relations of the ciphertext bits in terms of the message state bits
and the key bits more complex. The S-Boxes are typically non-linear wrt. the XOR
operations. These transformations require both XOR and also AND gates. These
transformations are mathematically complex and pose a large over-head. Hence they
are often performed in smaller chunks. The hardware required also grows fast with
the input size and thus requires special techniques for implementation.
The rounds combine the diffusion and the substitution layers suitably for achieving
security. In the following section we present the design of the AES algorithm, to illustrate
the construction of a block cipher.
The state matrices of AES undergo transformations through the rounds of the cipher.
The plaintext is of 128 bits and are arranged in the state matrix, so that each of the 16
bytes are elements of the state matrix. The AES key can also be arranged in a similar
fashion, comprising of N k words of length 4 bytes each. The input key is expanded by a
Key-Scheduling algorithm to an expanded key w. The plaintext state matrix (denoted
by in), is transformed by the round keys which are extracted from the expanded key w.
The final cipher (denoted by out) is the result of applying the encryption algorithm, Cipher
on the plaintext, in. In the next two sections, we present the round functions and the key
scheduling algorithm respectively.
m(X) = X8 + X4 + X3 + X + 1
Thus the extension field GF (28 ) is created and the elements of the field are expressible
as polynomials ∈ GF (2)[X]/hm(X)i. Each non-zero element has a multiplicative inverse,
which can be computed by the Euclidean inverse algorithm. This forms the basis of what
is known as the SubBytes step of the algorithm.
2.3.3.1 SubBytes
The SubBytes step is a non-linear byte-wise function. It acts on the bytes of the state
and subsequently applies an affine transformation on the cipher (Fig. 2.7). The step is
based on the computation of finite field inverse, which is as follows:
−1
x if x 6= 0
x′ =
0 otherwise
The final output is computed as y = A(x′ ) + B, where A and B are fixed matrices defined
as follows:
1 0 0 0 1 1 1 1
1 1 0 0 0 1 1 1
1 1 1 0 0 0 1 1
1 1 1 1 0 0 0 1
A= 1 1 1 1
(2.2)
1 0 0 0
0 1 1 1 1 1 0 0
0 0 1 1 1 1 1 0
0 0 0 1 1 1 1 1
Here, (B)t represents the transpose of B, and the left most bit is the LSB.
The InvSubBytes step operates upon the bytes in the reverse order. It is defined as :
X = Y −1 A−1 + D (2.4)
2.3.3.2 ShiftRows
In the operation ShiftRows, the rows of the State are cyclically left shifted over different
offsets. We denote the number of shifts of the 4 rows by c0 , c1 , c2 and c3 . The shift offsets
c0 , c1 , c2 and c3 depend on Nb . The different values of the shift offsets are specified in
Table 2.1 [103].
The InvShiftRows operation performs circular shift in the opposite direction. The offset
values for InvShiftRows are the same as ShiftRows (Table 2.1). ShiftRows implementations
do not require any resource as they can be implemented by rewiring.
2.3.3.3 MixColumns
The MixColumns transformation operates on each column of State (X) individually. Each
column of the state matrix can be imagined as the extension field GF (28 )4 . For 0 ≤ j ≤ N b
a column of the state matrix S is denoted by the polynomial:
sj (X) = s3,j X 3 + s2,j X 2 + s1,j X + s0,j ∈ GF (28 )[X]
The transformation for MixColumns is denoted by the polynomial:
m(X) = {03}X 3 + {01}X 2 + {01}X + {02} ∈ GF (28 )[X]
The output of the MixColumns operation is obtained by taking the product of the above
two polynomials, sj (X) and m(X) over the field GF (28 )4 , with the reduction polynomial
being X 4 + 1.
Thus the output can be expressed as a modified column, computed as follows:
s′j (X) = (sj (X) ∗ m(X)) mod (X 4 + 1), 0 ≤ j < N b
The transformation can also be viewed as a linear transformation in GF (28 )4 as follows:
′
s0,j {02} {03} {01} {01} s0,j
′
s1,j {01} {02} {03} {01} s1,j
= (2.7)
′
s2,j {01} {01} {02} {03} s2,j
′
s3,j {03} {01} {01} {02} s3,j
In case of InvMixColumns, the inverse of the same polynomial is used. If m−1 (X) is
defined as a function of the transformation of InvMixColumns that operates on State X,
then
In matrix form the InvMixColumns transformation can be expressed as :
′′
s0,j {0E} {0B} {0D} {09} s0,j
′′
s1,j {09} {0E} {0B} {0D} s1,j
= (2.8)
′′
s2,j {0D} {09} {0E} {0B} s2,j
′′
s3,j {0B} {0D} {09} {0E} s3,j
2.3.3.4 AddRoundKey
Let the input state of a particular round of the cipher round be denoted by s. The
columns of the state are denoted by s0 , s1 , . . . , sN b−1 . The function AddRoundKey(state,
w[round*Nb,(round+1)*Nb-1]) is denoted as:
Here ⊕ is bit-wise XOR operation. Thus the words of the round key are combined with the
state through a mod 2 addition (bitwise XOR). The objective of the key mixing step is to
make every round states after the key mixing independent of the previous rounds, assuming
that the round keys are generated by an efficient key-scheduling algorithm, which is detailed
next.
The input to the RotWord is also a word (b0 , b1 , b2 , b3 ). The output is (b1 , b2 , b3 , b0 ),
which is nothing but the bytewise left cyclic rotation applied on the input word.
Finally, the round constant, abbreviated as Rcon[n] = ({02}n , {00}, {00}, {00}). The
round constants are added to the round keys to provided asymmetry to the key expansion
algorithm and protect against certain class of attacks.
δ0 ) ≡ 1 mod (Y 2 + τ Y + µ). Thus, using the product and equating to 1 by matching the
coefficients we can write the following simultaneous equation:
γ1 δ0 + γ0 δ1 + γ1 δ1 τ = 0
γ0 δ0 + γ1 δ1 µ = 1
We solve the above equations to compute the values of δ0 and δ1 :
δ0 = (γ0 + γ1 τ )(γ02 + γ0 γ1 τ + γ12 µ)−1
δ1 = γ1 (γ02 + γ0 γ1 τ + γ12 µ)−1
1 Inorder to map an element in GF (28 ) to an element in the composite field GF ((24 )2 ), as discussed in
section 1.12.1 the element is multiplied with a transformation matrix, T .
One such matrix T is as follows [4],
1 0 1 0 0 0 0 0
1 0 1 0 1 1 0 0
1 1 0 1 0 0 1 0
0 1 1 1 0 0 0 0
T=
1 1 0 0 0 1 1 0
0 1 0 1 0 0 1 0
0 0 0 0 1 0 1 0
1 1 0 1 1 1 0 1
However, other transformations are also possible depending on the corresponding irreducible polynomials
of the fields GF (28 ), GF (24 ), GF (24 )2 .
The computations can also be similarly reworked if the basis is normal. Considering the
normal basis of (Y, Y 16 ). Since both the elements of the basis are roots of the polynomial
Y 2 + τ Y + µ = 0, we have the following identities which we use in the equations of the
multiplication and inverse of the elements in the composite field:
Y2 = τY + µ
1 = τ −1 (Y 16 + Y )
µ = (Y 16 )Y
Thus, if (γ1 Y 16 + γ0 Y ) and (δ1 Y 16 + δ0 Y ) are inverses of each other, then we can equate
the above product to 1 = τ −1 (Y 16 + Y ). Equating the coefficients we have:
These equations show that the inverse in the field GF (28 ) can be reduced to the inverse
in the smaller field GF (24 ) along with several additional operations, like addition, multipli-
cation, and squaring in the subfield. The inverse in the subfield can be stored in a smaller
table (as compared to a table to store the inverses of GF (28 )). The operations in GF (24 ) can
be in turn expressed in the sub-subfield GF (22 ). The inverses in the sub-subfield GF (22 ) is
the same as squaring.
Depending on the choices of the irreducible polynomials, the level of decompositions
and the choices of the basis of the fields, the complexity of the computations differ and is
a subject of significant research. In the subsequent chapter on implementation of the AES
algorithm, we shall discuss this aspect and the performances achieved.
in the transformations among the different field representations, being performed once at
the beginning and finally at the end.
The Rijndael round transformations in subfield are defined as follows Consider the trans-
formation T maps an element from GF (28 ) to GF (24 )2 . The T, as discussed before repre-
sents a transformation matrix an 8 × 8 binary matrix, which operates on each byte of the
4 × 4 state matrix of AES. Denote the AES state by S, where each element is denoted by
bij , where 0 ≤ i, j ≤ 3. Thus, an element in x ∈ GF (28 ) is mapped to T (x) ∈ GF (24 )2 .
Now let us consider each of the round transformations one by one:
(a) Inverse: b′i,j = (bij )−1 . In the composite field, we have T(b′i,j ) = (T(bij ))−1 .
Note that the inverse on the RHS of the above equation is in GF (24 )2 . The
computation of the inverse is as explained above in section 2.4.1.
(b) Affine: b′′i,j = A(b′i,j )+B. Here A and B are fixed matrices as discussed in section
2.3.3.1. In the composite field, T (b′′i,j ) = T (A(b′i,j ))+T (B) = T AT −1 (T ((b′i,j )))+
T (B). Thus the matrices of the SubBytes operations needs to be changed by
applying the transformation matrix T.
2. ShiftRows: This step remains the same as this is a mere transposition of bytes and
the field transformation to the composite field is localized inside a byte.
Here the additions in either the original field or the composite field are all in
characteristic-2 field, they are bitwise XORs in both the representations.
4. Add Round Key: The operation is s′i,j = si,j + ki,j , where ki,j is a particular byte of the
round key. In the composite field, thus this transformation is T (s′i,j ) = T (si,j )+T (ki,j ).
Again the addition is a bitwise XOR.
This implies that the round keys also need to be computed in the composite field.
Hence, similar transformations also needs to be performed on the key-scheduling al-
gorithm.
We shall discuss about these transformations and the effect of them in realizing an
efficient and compact AES design in Chapter 4. In the next section, we present an overview
on a popular public key encryption algorithm, known as Elliptic Curve Cryprography
(ECC), which leads to much efficient implementations compared to the older generation
algorithms like RSA, ElGamal etc.
Some discrete values of y are plotted wrt. x and is depicted in Fig. 2.9. Curves of this
nature are commonly called Elliptic Curves: these are curves which are quadratic wrt. y and
cubic wrt. x. It may be observed that the curve has two distinct regions or lobes, as it is
often referred to as curves of genus 2. Also since, the curve is quadratic wrt. y, the curve is
symmetric over the x-axis.
We next present a method by Diophantus of Alexandria, who lived around 200 A.D. to
determine non-trivial points on the curve. This method uses a set of known points to find
an unknown point on the curve. Let us start with two trivial points: (0,0) and (1,1). Clearly,
both these two points do not indicate a solution to the puzzle.
Now the equation of a straight line between these two points is: y = x. Since the equation
of the curve is cubic wrt. x, the straight line must intersect the curve on a third point (the
points may not be distinct though!). In order to obtain the third point, we substitute y = x
in the equation of the curve, y 2 = x(x + 1)(2x + 1)/6, and we obtain:
3 1
x3 − x2 + x = 0
2 2
We know that x = 0 and 1 are two roots of the equation. From the theory of equations,
thus if the third root of the equation is x = α, we have 0 + 1 + α = 32 ⇒ α = 21 . Since the
point on the curve is y = x, we have y = 12 .
Thus ( 12 , 21 ) is a point on the curve. Since the curve is symmetric over the x-axis, ( 12 , − 12 )
is also another point on the curve. However these points also do not provide a solution as
they are not integral.
Now, consider a straight line through ( 12 , − 21 ) and (1, 1). The equation of this line is
y = 3x − 2, and intersecting with the curve we have:
51 2
x3 − x + ···+ = 0
2
Thus, again the third root x = β can be obtained from 1 + 12 + β = 51 2 ⇒ β = 24. The
corresponding y value is 70, and so we have a non-trivial solution of the puzzle as 4900.
Through this seemingly simple puzzle, we have observed an interesting geometric method
to solve an algebraic problem. This technique forms the base of the geometric techniques
(the chord-and-tangent rule) in Elliptic Curves.
An Elliptic curve over a field K is a cubic curve in two variables, denoted as f (x, y) = 0,
along with a rational point, which is referred to as the point at infinity. The field K is
usually taken to be the complex numbers, reals, rationals, algebraic extensions of rationals,
p-adic numbers, or a finite field. Elliptic curves groups for cryptography are examined with
the underlying fields of Fp (where p > 3 is a prime) and F2m (a binary representation with
2m elements).
A general form of the curve is introduced next. However the curve equation used for im-
plementation is often transformed forms of this curve using the properties of the underlying
field K.
Definition 2.5.1 An elliptic curve E over the field K is given by the Weierstraß equation
mentioned in Equation 2.9. The generalized Weierstraß equation is :
E : y 2 + a1 xy + a3 y = x3 + a2 x2 + a4 x + a6 (2.9)
This equation, known as the generalized Weierstraß equation defines the Elliptic Curve
E over the field K. It may be noted that if E is defined over K, it is also defined over any
extension of the field K. If L is any extension of K, then the set of L-rational points on E
is defined as:
E1 : y 2 + a1 xy + a3 y = x3 + a2 x2 + a4 x + a6
E2 : y 2 + a′1 xy + a′3 y = x3 + a′2 x2 + a′4 x + a′6
(x, y) → (u2 x + r, u3 y + u2 sx + t)
transform equation E1 into equation E2 . We next present those simplifications for different
characteristics for K.
Characteristic of K is neither 2 nor 3: The admissible change of variables are
y2 = x3 + ax + b
a3 3 a2 a4 + a2
(x, y) → (a21 x + , a1 y + 1 3 3 )
a1 a1
transforms the curve E to the form:
y 2 + xy = x3 + ax2 + b
(x, y) → (x + a2 , y)
y 2 + cy = x3 + ax + b
y2 = x3 + ax2 + b
where a, b, c ∈ K. The discriminant of the curve is ∆ = −a3 b. If a21 = −a2 , then the
admissible change of variables:
(x, y) → (x, y + a1 x + a3 )
y2 = x3 + ax + b
δF δF
(x0 , y0 ) = (x0 , y0 ) = 0
δx δy
or, −f ′ (x0 ) = 2y0 = 0
or, f (x0 ) = f ′ (x0 ) = 0
Thus f has a double root at the point (x0 , y0 ). Usually we assume that Elliptic Curves do
not have singular points. Let us find the condition for the curve defined as y 2 = x3 +Ax+B,
defined over a field K with appropriate characteristics.
Thus, we have:
3x2 + A = 0 ⇒ x2 = −A/3
Also we have,
x3 + Ax + B = 0 ⇒ x4 + Ax2 + Bx = 0
2A2
⇒ (−A/3)2 + A(−A/3) + Bx = 0 ⇒ x = −
9B
2A2 2
⇒ 3( ) + A = 0 ⇒ 4A3 + 27B 2 = 0
9B
Thus, the criteria for non-singularity of the curve is ∆ = 4A3 + 27B 2 6= 0. For elliptic
curve cryptography usually the curves do not have singularity.
a finite field where a single operation is used): addition when the two points are distinct,
and doubling when the points are same.
We summarize the properties of the addition operations (doubling is a special case of the
addition operation when the two points are same). The addition is denoted by the symbol
+ below for an elliptic curve E(K), where K is some underlying field.
Given two points P ,Q ∈ E(K), there is a third point, denoted by P + Q ∈ E(K), and
the following relations hold for all P, Q, R ∈ E(K)
• P + Q = Q + P (commutative)
• (P + Q) + R = P + (Q + R) (associativity)
For cryptography, the points on the elliptic curve are chosen from a large finite field.
The set of points on the elliptic curve form a group under the addition rule. The point at
infinity, denoted by O, is the identity element of the group. The operations on the elliptic
curve, i.e., the group operations are point addition, point doubling and point inverse. Given
a point P = (x, y) on the elliptic curve, and a positive integer n, scalar multiplication is
defined as
nP = P + P + P + · · · P (n times) (2.10)
The order of the point P is the smallest positive integer n such that nP = O. The points
{O, P, 2P, 3P, · · · (n − 1)P } form a group generated by P . The group is denoted as < P >.
The security of ECC is provided by the Elliptic Curve Discrete Logarithm problem
(ECDLP), which is defined as follows : Given a point P on the elliptic curve and another
point Q ∈< P >, determine an integer k (0 ≤ k ≤ n) such that Q = kP . The difficulty of
ECDLP is to calculate the value of the scalar k given the points P and Q. k is called the
discrete logarithm of Q to the base P . P is the generator of the elliptic curve and is called
the basepoint.
The ECDLP forms the base on which asymmetric key algorithms are built. These algo-
rithms include the elliptic curve Diffie-Hellman key exchange, elliptic curve ElGamal public
key encryption and the elliptic curve digital signature algorithm.
Next we define the above operations and the underlying computations for elliptic curves
of characteristic 2, which is the object of focus of this textbook.
Definition 2.5.2 An elliptic curve E over the field GF (2m ) is given by the simplified form
of the Weierstraß equation mentioned in Equation 2.9. The simplified Weierstraß equation
is :
y 2 + xy = x3 + ax2 + b (2.11)
with the coefficients a and b in GF (2m ) and b 6= 0.
Point Inversion: Let P be a point on the curve with coordinates (x1 , y1 ), then the inverse
of P is the point −P with coordinates (x1 , x1 + y1 ). The point −P is obtained by drawing
a vertical line through P . The point at which the line intersects the curve is the inverse of
P.
Let P = (x1 , y1 ) be a point on the elliptic curve of Equation 2.11. To find the inverse of
point P , a vertical line is drawn passing through P . The equation of this line is x = x1 . The
point at which this line intersects the curve is the inverse −P . The coordinates of −P is
(x1 , y1′ ). To find y1′ , the point of intersection between the line and the curve must be found.
Equation 2.12 is represented in terms of its roots p and q as shown below.
The coefficients of y is the sum of the roots. Equating the coefficients of y in Equations 2.12
and 2.14.
p + q = x1
One of the roots is q = y1 , therefore the other root p is given by
p = x1 + y1
This is the y coordinate of the inverse. The inverse of the point P is therefore given by
(x1 , x1 + y1 ).
Point Addition: Let P and Q be two points on the curve with coordinates (x1 , y1 ) and
(x2 , y2 ). Also, let P 6= ±Q, then adding the two points results in a third point R = (P + Q).
The addition is performed by drawing a line through P and Q as shown in Fig. 2.10. The
point at which the line intersects the curve is −(P + Q). The inverse of this is R = (P + Q).
Let the coordinates of R be (x3 , y3 ), then the equations for x3 and y3 is
x3 = λ2 + λ + x1 + x2 + a
(2.15)
y3 = λ(x1 + x3 ) + x3 + y1
−2P
−(P+Q)
P
Q
(P+Q)
2P
two points, a line (l) is drawn through P and Q. If P 6= ±Q, the line intersects the curve
of Equation 2.11 at the point −R = (x3 , y3′ ). The inverse of the point −R is R = (P + Q)
having coordinates (x3 , y3 ).
The slope of the line l passing through P and Q is given by
y2 − y1
λ=
x2 − x1
Equation of the line l is
y − y1 = λ(x − x1 )
(2.16)
y = λ(x − x1 ) + y1
(x − p)(x − q)(x − r) = 0
(2.18)
x3 − (p + q + r)x2 · · · = 0
p + q + r = λ2 + λ + a (2.19)
Since P = (x1 , y1 ) and Q = (x2 , y2 ) lie on the line l, therefore two roots of Equation 2.17
are x1 and x2 . Substituting p = x1 and q = x2 in Equation 2.19 we get the third root, this
is the x coordinate of the third point on the line which intersects the curve( i.e., −R). This
point is denoted by x3 , and it also represents the x coordinate of R.
x3 = λ2 + λ + x1 + x2 + a (2.20)
y3 = λ(x3 + x1 ) + y1 + x3 (2.22)
Since we are working with binary finite fields, subtraction is the same as addition.
Therefore,
x3 = λ2 + λ + x1 + x2 + a
y3 = λ(x3 + x1 ) + y1 + x3 (2.23)
y2 + y1
λ=
x2 + x1
Point Doubling : Let P be a point on the curve with coordinates (x1 , y1 ) and P 6= −P .
The double of P is the point 2 · P = (x3 , y3 ) obtained by drawing a tangent to the curve
through P . The inverse of the point at which the tangent intersects the curve is the double
of P (Fig. 2.11). The equation for computing 2 · P is given as
b
x3 = λ2 + λ + a = x1 2 +
x1 2 (2.24)
y3 = x1 2 + λx3 + x3
dy dy
2y +x + y = 3x2 + 2ax
dx dx
Since we are using modular 2 arithmetic,
dy
x + y = x2
dx
The slope dy/dx of the line t passing through the point P is given by
x1 2 + y1
λ= (2.25)
x1
The equation of the line t can be represented by the following.
y + y1 = λ(x + x1 ) (2.26)
This gives,
y = λ(x + x1 ) + y1
y = λx + c for some constant c
p + p + r = λ2 + λ + a
r = λ2 + λ + a
The dissimilar root is r. This root corresponds to the x coordinate of −2P ie. x3 . Therefore,
x3 = λ2 + λ + a
To find the y coordinate of −2P , ie. y3′ , substitute x3 in Equation 2.26. This gives,
y3′ = λx3 + λx1 + y1
y3′ = λx3 + x1 2
To find y3 , the y coordinate of 2P , the point y3′ is reflected on the x axis. From the point
inverse equation
y3 = λx3 + x1 2 + x3
To summarize, the coordinates of the double are given by Equation 2.28
x3 = λ2 + λ + a
y3 = x1 2 + λx3 + x3 (2.28)
y1
λ = x1 +
x1
The fundamental algorithm for ECC is the scalar multiplication, which can be obtained
using the basic double and add computations as shown in Algorithm 2.3. The input to the
algorithm is a basepoint P and a m bit scalar k. The result is the scalar product kP , which
is equivalent to adding the point P k times.
As an example of how Algorithm 2.3 works, consider k = 22. The binary equivalent of
this is (10110)2 . Table 2.2 below shows how 22P is computed.
Each iteration of i does a doubling on Q if ki is 0 or a doubling followed by an addition if
ki is 1. The underlying operations in the addition and doubling equations use the finite field
arithmetic discussed in the previous section. Both point doubling and point addition have
1 inversion (I) and 2 multiplications (M ) each (from Equations 2.15 and 2.24), neglecting
squaring operations which are free in characterstic 2. From this, the entire scalar multiplier
for the m bit scalar k will have m(1I + 2M ) doublings and m 2 (1I + 2M ) additions (assuming
k has approximately m/2 ones on an average). The overall expected running time of the
scalar multiplier is therefore obtained as
3
ta ≈ (3M + I)m (2.29)
2
For this expected running time, finite field addition and squaring operations have been
neglected as they are simple operations and can be considered to have no overhead to the
run time.
TABLE 2.2: Scalar Multiplication using Double and Add to find 22P
i ki Operation Q
3 0 Double only 2P
2 1 Double and Add 5P
1 1 Double and Add 11P
0 0 Double only 22P
Y 2 + XY Z = X 3 + aX 2 Z 2 + bZ 4 (2.30)
Let P = (X1 , Y1 , Z1 ) be an LD projective point on the elliptic curve, then the inverse of
point P is given by −P = (X1 , X1 Z1 + Y1 , Z1 ). Also, P + (−P ) = O, where O is the point
at infinity. In LD projective coordinates O is represented as (1, 0, 0).
The equation for doubling the point P in LD projective coordinates [227] results in the
Z3 = X12 · Z12
X3 = X14 + b · Z14 (2.31)
Y3 = b · Z14 · Z3 + X3 · (a · Z3 + Y12 +b· Z14 )
The equations for doubling require 5 finite field multiplications and zero inversions.
The equation in LD coordinates for adding the affine point Q = (x2 , y2 ) to P , where
Q 6= ±P , is shown in Equation 2.32. The resulting point is P + Q = (X3 , Y3 , Z3 ).
A = y2 · Z12 + Y1
B = x2 · Z1 + X1
C = Z1 · B
D = B 2 · (C + a · Z12 )
Z3 = C 2
(2.32)
E =A·C
X3 = A2 + D + E
F = X3 + x2 · Z3
G = (x2 + y2 ) · Z32
Y3 = (E + Z3 ) · F + G
Point addition in LD coordinates thus requires 9 finite field multiplications and zero
inversions. For an m bit scalar with approximately half the bits one, the running time
expected is given by Equation 2.33. One inversion and 2 multiplications are required at the
end to convert the result from projective coordinates back into affine.
9M
tld ≈ m(5M + ) + 2M + 1I
2 (2.33)
= (9.5m + 2)M + 1I
The LD coordinates require several multiplications to be done but have the advantage of
requiring just one inversion. To be beneficial, the extra multiplications should have a lower
complexity than the inversions removed.
Algorithm 2.4: Double and Add algorithm for scalar multiplication (LSB First)
Input: Basepoint P = (x, y) and Scalar k = (km−1 , km−2 · · · k0 )2 , where km−1 = 1
Output: Point on the curve Q = kP
1 Q = 0, R = P
2 for i = 0 to m − 1 do
3 if ki = 1 then
4 Q=Q+R
5 end
6 R = 2R
7 end
8 return Q
The working of the algorithm is self evident. However we can observe that compared
to the Algorithm 2.4, the LSB first algorithm has the opportunity of parallelism. However
it requires two variables R and Q. In the following section, we present another trick called
as the Montgomery’s ladder for efficient implementation of the scalar multiplications. The
algorithm also has consequences in the side channel analysis of the hardware implemen-
tations derived from these algorithms.
(
( xy11 +y y1 +y2
+x2 ) + ( x1 +x2 ) + x1 + x2 + a if P =
2 2
6 Q;
x3 =
x1 + x2
2 b
if P = Q.
1
( xy11 +y
+x2 )(x1 + x3 ) + x3 + y1
2
if P 6= Q;
y3 =
x21 + (x1 + xy11 )x3 + x3 if P = Q.
Neglecting squaring and addition operations, as they are cheap, point addition and
doubling each has one inversion and two multiplication operations. It is interesting to note
that the x-coordinate of the doubling operation is devoid of any y-coordinate, it works
only using x-coordinates. However, the x-cordinate of the addition operation naïvely needs
the y-coordinate. If both the operations, namely addition and doubling can be performed
with only one coordinate, say the x-coordinate, then the entire scalar multiplication can
be performed without storing one of the coordinates. This can lead to a compact hardware
implementation, and each of these coordinates is quite a large value and typically stored in
a register.
Before explaining how we can perform the addition without the y-cordinate we present a
technique for performing the scalar multiplication, which is referred to as the Montgomery’s
Ladder.
x1 y2 + x2 y1 + x1 x22 + x2 x21
x3 =
(x1 + x2 )2
The result is based on the fact that the characteristic of the underlying field is 2, and that
the points P1 and P2 are on the curve.
The next theorem expresses the x-coordinates of the P1 + P2 in terms of only the x-
coordinates of the P1 , P2 and that of P = P2 − P1 .
Theorem 14 Let P = (x, y), P1 = (x1 , y1 ) and P2 = (x2 , y2 ) be elliptic points. Let P =
(x1 + x)(x2 + x) + x2 + y
y1 = (x1 + x) +y
x
Using these theorems, one can develop Algorithm 2.6 for performing scalar multiplica-
tions. Note that the algorithm uses only the x-coordinates of the points P1 and P2 , and the
coordinate of the point P , which is an invariant.
7 if ki = 1 then
8 x1 = x + t2 + t, x2 = x22 + b
x22
9 end
10 else
11 x2 = x + t2 + t, x1 = x22 + b
x22
12 end
13 end
14 r1 = x1 + x; r2 = x2 + x
2
15 y1 = r1 (r1 r2x+x +y) + y
16 return (x1 , y1 )
we compute the inverse of x, and then evaluate the inverse of x2 by squaring x−1 rather
than paying the cost of another field inversion. This simple trick helps to obtain efficient
architectures!
x3 = X1 /Z1
y3 = (x + X1 /Z1 )[(X1 + xZ2 ) + (x2 + y)(Z1 Z2 )](xZ1 Z2 )−1 + y
This step reduces the number of inversions, by computing (xZ1 Z2 )−1 , and then obtaining
Z1−1 by multiplying with xZ2 . The required number of multiplication is thus ten, however
the inversions required is only one.
In table 2.3, we present a summary of the two design techniques that we have studied:
namely affine vs projective coordinates for implementation of the scalar multiplication using
Montgomery’s ladder. In Chapter 6, we present a description of a high performance pipelined
design of a ECC processor on random curves on an FPGA technology. However before going
into the design, we shall develop some other ideas on hardware designs on FPGAs.
TABLE 2.3: Computations for performing ECC scalar multiplication (projective vs affine
coordinates)
Computations Affine Coordinates Projective Coordinates
Addition 4k + 6 3k + 7
Squaring 2k + 2 5k + 3
Multiplication 2k + 4 6k + 10
Inversion 2k + 1 1
2.8 Conclusions
The chapter presents an overview on modern cryptography. It starts with the classifica-
tion of ciphers, presenting the concepts of symmetric and asymmetric key cryptosystems.
The chapter details the inner compositions of block ciphers, with a special attention to the
AES algorithm. The composite field representation of the Advanced Encryption Standard
(AES) algorithm is used in several efficient implementations of the block cipher, hence the
present chapter develops the background theory. The chapter subsequently develops the
underlying mathematics of the growingly popular asymmetric key algorithm, the Elliptic
Curve Cryptosystems (ECC). The chapter discusses several concepts for efficient imple-
mentations of the ECC scalar multiplication, like LSB first and MSB first algorithms, the
Montgomery ladder. These techniques are used in the subsequent chapters for developing
efficient hardware designs for both AES and ECC.
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.1.1 FPGA Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.1.2 The FPGA Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.2 Mapping an Algorithm to Hardware: Components of a Hardware Architecture . . . . . . 72
3.3 Case study: Binary gcd Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.4 Enhancing the Performance of a Hardware Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.5 Modelling of the Computational Elements of the gcd Processor . . . . . . . . . . . . . . . . . . . . . . 78
3.5.1 Modeling of an Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.5.2 Modeling of a Multiplexer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.5.3 Total LUT Estimate of the gcd Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.5.4 Delay Estimate of the gcd Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.6.1 LUT utilization of the gcd processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.6.2 Delay Estimates for the gcd Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.1 Introduction
With the growth of electronics, there has been tremendous growth in the applications
requiring security. Mobile communications, automatic teller machines (ATMs), digital signa-
tures, and online banking are some probable applications requiring cryptographic principles.
However, the real-time processing required in these applications obviates the necessity of
optimized and high performance implementations of the ciphers.
There has been lot of research in the design of cryptographic algorithms, both on soft-
ware platforms as well as dedicated hardware environments. While conventional software
platforms are limited by their parallelisms, dedicated hardware provides significant oppor-
tunities for speed up due to their parallelism. However they are costly, and often require
off-shore fabrication facilities. Further, the design cycle for such Application Specific Inte-
grated Circuits (ASICs) are lengthy and complex. On the contrary, the Field Programmable
Gate Arrays (FPGAs) are reconfigurable platforms to build hardware. They combine the
65
advantages of both hardware: (in extracting parallelism, and achieving better performance);
and software: (in terms of programmability). Thus these resources are excellent low cost,
high performance devices for performing design exploration and even in the final proto-
typing for many applications. However, designing in FPGAs is tricky, and as what works
for ASIC libraries does not necessarily work for FPGAs. FPGAs have a different architec-
ture, with fixed units in the name of Look-up-Tables (LUTs) to realize the basic operations,
along with larger inter-connect delays. Thus, the designs need to be carefully analyzed to
ensure that the utilizations of the FPGAs are enhanced, and the timing constraints are met.
In the next section, we provide an outline of the FPGA architecture.
1 http://www.xilinx.com
2 http://www.altera.com
Programmable
Routing Switches
Programmable
Logic Block Connection Switch
COUT
F4
F3 PRE
Control
LUT
& D Q
F2
Carry
CE
Logic
F1
CLK
CLR
SR
CE
CLK
BY
CIN
increasing number of inputs, namely 4 and 6. However, for a given device the number of
inputs are fixed.
(b) Hold Time: It is the minimum time that the synchronous data should be
stable after the active clock edge.
(c) False Path: The analyzer considers all combinational paths that are to be
performed in a single clock cycle. However in the circuits there be paths
that are never activated. Consider Fig. 3.3 consisting of two sub-circuits
separated by the dashed line. First consider the portion on the right-side of
the dashed line and the signal transitions showing how the path (the critical
path), g1 → g2 → g3 → g4 → g5 can get sensitized. Next consider the
other portion of the circuit and note that due to the presence of the gates
g6 and g7, this path becomes a false path as no input condition can trigger
this path. In the example shown, the inverter (g6) and the NAND gate (g7)
ensure that an input of logic one to the inverter, results in the NAND gate
producing an output logic one, thus making the output of the circuit logic
one much earlier. When the input to the inverter is logic zero, the mentioned
path is again false. Thus to obtain a proper estimate of fclk , the designer or
the CAD tool should properly identify these false paths.
(d) Multi-cycle Path: There are some paths in a design which are intentionally
designed to require more than one clock signals to become stable. Thus the
set-up and hold-time violation analysis for the overall circuit should be done
by taking care of such paths, else the timing reports will be wrongly gener-
ated. Consider the Fig. 3.4, showing an encryption hardware circuit. The
selection of the two multiplexers, MUX-A and MUX-B are the output of a 3-
stage circular shifter made of 3-DFFs (D-Flip Flop) as shown in the diagram.
The shifter is initially loaded with the value (1, 0, 0), which in 3 clock cycles
makes the following transitions: (1, 0, 0) → (0, 1, 0) → (0, 0, 1) → (1, 0, 0).
Thus at the start the input multiplexer (MUX-A) selects the plaintext input
and passes the result to the DFF. The DFF subsequently latches the data,
while the encryption hardware performs the transformation on the data. The
output multiplexer (MUX-B) passes the output of the encryption hardware
to the DFF as the ciphertext, when the select becomes one in the third clock
cycle, i.e., two clock cycles after the encryption starts. Meanwhile it latches
the previous ciphertext, which gets updated every two clock cycles. Thus
the encryption circuit has two clock cycles to finish its encryption opera-
tion. This is an example of a multi-cycle path, as the combinational delay of
the circuit is supposed to be performed in more than one clock cycles. This
constraint also should be detected and properly kept in mind for a proper
estimation of the clock frequency.
The other important design choices are the type of FPGA device, with different cost,
performance and power consumptions. Generally, the designer starts with the lower end
FPGAs and iteratively depending on the complexity of the design chooses higher end plat-
forms.
Fig. 3.5 depicts the typical design flow for FPGA design. As may be observed that
the design flow is top-down starting from the RTL design, RTL elaboration, Architecture
Independent Optimizations, Technology Mapping (Architecture Dependent Optimizations),
Placement, Placement Driven Optimizations, Routing and Bit Stream Generation. In the
design world along with this flow, the verification flow also goes on hand in hand. However it
may be noted that the verification flow proceeds in the opposite direction: answering queries
like is the RTL elaboration equivalent to the RTL design? In the following description we
describe the individual steps in the flow in more details.
g1
0→1 1→0
False Path
g2
1→0
1→1
g3
g6 0→0 1→0 g4
0→1
0→0 g5
0→1
0→0
g7
0 0
D D
Encryption MUX−B F
MUX−A F CipherText
Hardware F
PlainText F 1
1
D D D
F F F
F F F
1. RTL Design: This step involves the description of the design in a HDL language, like
verilog. The step involves the architecture planning for the design into sub-modules,
understanding the data-path and control-path of the design and in developing the RTL
codes for the sub-modules. This step also involves the integration of the sub-modules
to realize the complete design. Testing individual sub-modules and the complete design
via test-benches, also written in a high-level language, often verilog, system verilog
etc. is also an integral part of this step.
RTL Design
RTL Elaboration
Architecture
Independent
Optimization
Technology
Mapping
Design Verification
Placement
Placement
Driven
Optimization
Routing
Bit Stream
Generation
4. Technology Mapping: In this step, the various elements of the design are optimally
assigned to the resources of the FPGAs. Hence this step is specific to the FPGA device,
and depends on the underlying architecture. Depending on the platform the data-path
elements get inferred to adders, multipliers, memory elements embedded in the device.
The control-path elements and the elements in the control-path, which are not inferred
to special embedded elements are realized in the FPGA logic block. The performance
of the implemented design, both area and delay, depends on the architecture of the
LUTs of the FPGA logic block. We shall discuss later, that the number of inputs to the
LUTs can be suitably used to advantage to have high-performance implementations.
Thus these optimizations are specific to the underlying architecture and depends on
the type of the FPGAs being used.
5. Placement: Placement in FPGA decides the physical locations and inter connections
of each logic block in the circuit design, which becomes the bottleneck of the circuit
performance. A bad placement can increase the interconnects which leads to significant
reduction in performance.
7. Routing: Global and detailed routing are performed to connect the signal nets using
restricted routing resources which are predesigned. The routing resources used are
programmable switches, wire segments which are available for routing, and multiplex-
ers.
8. Bit-stream Generation: This is the final step of the design flow. It takes the routed
design as input, and produces the bit-stream to program the logic and interconnects
to implement the design on the FPGA device.
Data Path
Control Signals
Input/Output
Control Path
Control Signals
Memory
• The data-path elements are the computational units of the design. The data-paths
are central to the performance of a given circuit, and have a dominating effect on
the overall performance. Thus the data-path elements need to be properly optimized
and carefully designed. However it is not trivial, as there are numerous equivalent
circuit topologies and various designs have different effect on the delay, area and power
consumption of the device. Also one has to decide whether the data-path elements
will be combinational or sequential units, depending on the underlying application
and its constraints. Examples of common data-path elements are registers, adders,
shifters etc. These data-path elements often form the components of the Arithmetic
Logic Unit (ALU) of a given design.
• The control-path elements, on the other hand, sequences the data flow through the
data-path elements. Hence the input data is processed or transformed by the data-path
elements, which are typically combinational. On the other hand, the data is switched
and cycled through the data-path elements by the control unit, which is typically a
sequential design. The control signals generated by the sequential controller are often
dependent on the states, or sometimes on both the state or partial outputs from the
datapath. The former form of controller is known as the Moore machine, while the
latter as the Mealy machine.
Whatever be the design type of the controller, a key to a good design is to comprehend the
effective split between the data-path and control-path elements. We illustrate this concept
with the help of a case study in the next section.
the essential states or simple stages of the algorithm. It may be noted that in each
state, the intended hardware is expected to perform certain computations, which are
realized by the computation or data path elements. The pseudo-code shows that there
are six states of the design, denoted by S0 to S5 .
2. Identification of the data path elements: As evident from the pseudo-code, the
data path elements required for the gcd computation are subtracter, complementer,
right shifter, left shifter and counter. The other very common data path element is
the multiplexer which are required in large numbers for the switching necessary for
the computations done in the datapath. The selection lines of the multiplexers are
configured by the control circuitry, which is essentially a state machine.
3. Identification of the state machine of the control path: The control path is
a sequential design, which comprises of the state machine. In this example, there is
a six state machine, which receives inputs from the computations performed in the
data path elements, and accordingly performs the state transitions. It also produces
output signals which configures or switches the data path elements.
4. Design of the data path architecture: The data path of the gcd processor is
depicted in Fig. 3.7. The diagram shows the two distinct parts: data path and control
path for the design. The data-path stores the values of XR and YR (as mentioned in
the HDL-like code) in two registers. The registers are loadable, which means they are
updated by an input when they are enabled by an appropriate control signal (e.g.,
load_XR for the register XR ). The values of the inputs u and v are initially loaded
into the registers XR and YR through the input multiplexer, using the control signals
load_uv. The least bits of XR and YR are passed to the controller to indicate whether
the present values of XR and YR are even or not. The next iteration values of XR and
YR are updated, by feeding back the register values (after necessary computations)
through the input multiplexer and this is controlled by the signals update_XR and
update_YR . The computations on the registers XR and YR are division by 2, which
is performed easily by the two right-shifters, and subtraction and comparison for
equality, both of which are performed by a subtracter. The values stored in XR and YR
are compared using a subtracter, which indicates to the controller the events (XR ! =
YR ) and (XR ≥ YR ) by raising appropriate flag signals. In the case, when XR < YR
and the subtraction YR − XR is to be performed, the result is complemented. The
next iteration values of XR and YR are loaded either after the subtraction or directly,
which is controlled by the signals load_XR _after_sub and load_YR _after_sub. The
circuit also includes an up down counter, which is incremented whenever both XR
and YR are even values. Finally, when XR = YR the result is obtained by computing
2count (XR ), which is obtained by using a left-shifter and shifting the value of XR ,
until the value of count becomes zero.
5. Design of the state machine for the controller: The state machine of the con-
troller is depicted in table 3.3. As discussed there are six states of the controller,
and the controller receives four inputs from the data path computations, namely
(XR ! = YR ), XR [0], YR [0], and XR ≥ YR respectively. The state transitions are self-
explanatory and can be easily followed by relating the table and the data path diagram
of Fig. 3.7. The state machine is an example of a Mealy machine.
u v
MUXA load_uv
update_XR
update_YR
Up U pdate_counter
Down Inc/Dec
Counter load_XR
count_zero
XR YR load_YR
YR [0]
XR [0]
Subtracter XR ! = YR
XR ≥ Y [R]
Complementer Controller
>> >>
<<
lef t_shif t
certain applications, speed may be of utmost importance, while for others it may be the
area budget of the design is of primary concern.
In general, if any standard book of computer architecture is referred to we obtain several
definitions of performance. We revise certain definitions here and consider some more vari-
ants of these. To start with, the performance of a hardware design is often stated through
its critical path, as that limits the clock frequency. In a combinational circuit, the critical
path is of primary concern and a circuit which has a better optimized critical path, ie. a
smaller critical delay is faster. On the other hand for a sequential circuit, it is also impor-
tant to know the number of clock cycles necessary to complete a computation. Like in the
previous example of the gcd processor, the number of clock cycles needed is proportional
to the number of bits in the larger argument. However, the number of clock cycles required
is not a constant, and varies with the inputs. Thus one may consider the average number
of clock cycles needed to perform the computation. Let the fastest clock frequency be de-
noted by fmax and the average number of clock cycles is say denoted by ccavg , then the
ccavg
total computation time for the gcd processor is obtained by tc = fmax . Another important
Nb fmax
metric is the throughput of the hardware, denoted by τ = Nb /tc = ccavg , where Nb is the
number of bytes of data being simultaneously processed.
The other important aspect of hardware designs is the resource consumed. In context to
FPGAs the resources largely comprise of slices, which are made of LUTs and flipflops. As
discussed, the LUTs have typically fixed number of inputs. In order to improve the perfor-
mance of a hardware design, it requires to customize the design for the target architecture
to ensure that the resource used is minimized. The smallest programmable entity on an
FPGA is the lookup table (Section 3.1.1). As an example, Virtex-4 FPGAs have LUTs with
four inputs and can be configured for any logic function having a maximum of four inputs.
The LUT can also be used to implement logic functions having less than four inputs, two
for example. In this case, only half the LUT is utilized the remaining part is not utilized.
Such a LUT having less than four inputs is an under-utilized LUT. For example, the logic
function y = x1 + x2 under utilizes the LUT as it has only two inputs. Most compact imple-
mentations are obtained when the utilization of each LUT is maximized. From the above
fact it may be derived that the minimum number of LUTs required for a q bit combinational
circuit is given by Equation 3.1.
0 if q = 1
1 if 1 < q ≤ 4
#LU T (q) = (3.1)
⌈q/3⌉ if q > 4 and q mod 3 = 2
⌊q/3⌋ if q > 4 and q mod 3 6= 2
The delay of the q bit combinational circuit in terms of LUTs is given by Equation 3.2,
where DLU T is the delay of one LUT.
LU T2 + LU T3
%U nderU tilizedLU T s = ∗ 100 (3.3)
LU T2 + LU T3 + LU T4
It may be stressed that the above formulation provides minimum number of LUTs re-
quired and not an exact count. As an example, consider y = x5 x6 x1 + x5 x6 x2 + x1 x2 x3 +
x2 x3 x4 + x1 x3 x5 . Observe that the number of LUTs required is 3, and the formula says that
the minimum is 2-LUTs. Our analysis and experiments, show that the above formulation
although provides a lower bound, matches quite closely with the actual results. Most im-
portantly the formulation helps us to perform design exploration much faster, which is the
prime objective of such a formulation.
The number of LUTs required to implement a Boolean function is the measure of the
area of a function. The above formulation can also be generalized for any k-input LUT. A
k input LUT (k-LUT) can be considered a black box that can perform any functionality of
a maximum of k variables. If there is a single variable, then no LUT is required. If there
are more than k variables, then more than one k−LUTs are required to implement the
functionality. The lower bound of the total number of k input LUTs for a function with x
variables can thus be similarly expressed as,
0 if x ≤ 1
1 if 1 < x ≤ k
lut(x) = ⌊ x−k
⌋ + 2 if x > k and (k − 1) 6 | (x − k) (3.4)
k−1
x−k
k−1
+1 if x > k and (k − 1)|(x − k)
Delay in FPGAs comprises of LUT delays and routing delays. Analyzing the delay of
4
5 6 7 8 9 10 11
Delay of Circuit
FIGURE 3.8: LUTs in Critical Path vs. Delay for a Combinational Multiplier
a circuit on FPGA platform is much more complex than the area analysis. By experimen-
tation we have found that for designs having combinational components, the delay of the
design varies linearly with the number of LUTs present in the critical path. Fig. 3.8 shows
this linear relationship between the number of LUTs in the critical path and the delay in
multipliers of different sizes. Due to such linear relationship, we can consider that the num-
ber of LUTs in the critical path is a measure of actual delay. From now onwards, we use
the term LUT delay to mean the number of k−LUTs present in the critical path.
For an x variable Boolean function, number of k−LUTs in the critical path is denoted
by the function maxlutpath(x) and is thus expressed as:
We will use Equation 3.4 and 3.5 for estimating area and delay of the architecture
proposed in Fig. 3.7. In the following section, we present gradually the estimation of the
hardware blocks as required in the data-path.
through m number of cascaded MUXCY. Dedicated carry chains are much faster than
generic LUT based fabric in FPGA, hence carry propagation delay is small. Since these
MUXCY are used only for fast carry propagation, and other blocks present are constructed
of LUTs, we need to scale the delay of MUXCY circuits for comparing delay of an adder
with any other primitive. Let us consider that delay of a MUXCY is s times lesser than
that of a LUT. This scaling factor s depends on device technology. For Xilinx Virtex IV
FPGAs, s ≈ 17. So, for an m bit adder, we can say that the LUT delay of the carry chain
is ⌈m/s⌉. Since the delay of the adder is determined by the delay of the carry chain, we can
consider that the delay of the adder
The architecture of the gcd circuit also requires a complementer, which is obtained in
the usual 2’s complement sense. This requires also a subtracter, and hence has similar area
and delay requirements as above.
Delay of the MUX in terms of LUTs is equal to the maxlutpath of 2t + t variables and is
given by,
If 2t−1 < number of inputs < 2t , then estimations in Equations (3.8) and (3.9) for 2t
inputs give an upper bound. Practically, the values in this case are slightly lesser than the
values for 2t inputs in Equations (3.8) and (3.9), and the difference can be neglected.
DP AT H = 2Dsub + DM U XD + DM U XB + DM U XA
≈ 2⌈m/s⌉ + 1 + 1 + 1
nonumber ≈ 3 + 2⌈m/s⌉
Note that the last part of the equation, namely the delay of M U XA comes from the
fact that the multiplexer is made of two smaller 2-input multiplexers in parallel: one input
writing into the register XR and the other into the register YR .
(a) LUT utilization of the gcd pro- (b) LUT utilization of the gcd pro-
cessor (hierarchy on) cessor (hierarchy flattened)
FIGURE 3.10: LUT utilization of the gcd processor (both theoretical and actual)
3.7 Conclusions
Both the estimates for the LUTs and the critical path show that the designer can make
estimates of the performance ahead of the design. Thus, the theoretical model may be used
as a guideline for design exploration. Again it may be noted that the estimates are not
exact, and provides approximations to the exact requirements but nevertheless, can be used
for design explorations in an analytic way. We show subsequently in the design of finite field
circuits and an Elliptic Curve Crypto-processor how to leverage these models and framework
for designing efficient architectures.
(a) Critical Path Delay of the gcd (b) Critical Path Delay of the gcd
processor (hierarchy on) processor (hierarchy flattened)
FIGURE 3.11: Delay Modeling of the gcd processor (both theoretical and actual)
Hardware Design of
Cryptographic Algorithms
83
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.2 Algorithmic and Architectural Optimizations for AES Design . . . . . . . . . . . . . . . . . . . . . . . . 86
4.3 Circuit for the AES S-Box . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.3.1 Subfield Mappings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.3.2 Circuit Optimizations in the Polynomial Basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.3.3 Circuit Optimizations in the Normal Basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.3.4 Most Compact Realization: Polynomial vs. Normal basis . . . . . . . . . . . . . . . . . . . . . 94
4.3.5 Common Subexpressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.3.6 Merging of Basis Change Matrix and Affine Mapping . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.3.7 Merged S-Box and Inverse S-Box . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.4 Implementation of the MixColumns Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.4.1 The AES S-Box and MixColumns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.4.2 Implementing the Inverse MixColumns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.5 An Example Reconfigurable Design for the Rijndael Cryptosystem . . . . . . . . . . . . . . . . . . 101
4.5.1 Overview of the Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.5.1.1 Encryption Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.5.1.2 Pipelining and Serializing in FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.5.1.3 Back to the Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.5.2 Key-Schedule Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.5.3 The Control Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.7 Single Chip Encryptor/Decryptor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
(Albert Einstein)
4.1 Introduction
Advanced Encryption Standard (AES) is the de facto standard worldwide for the Ever
since 2001, the National Institute for Standard and Technology (NIST) selected the Rijndael
block cipher as the new AES. This lead to several applications requiring encryption starting
to adopt this algorithm into their products. Due to the high-performance requirement of
these applications, namely in the form of high speed, less latency, less area footprint, and
85
even less power, several approaches for software and hardware designs for AES have been
studied even since its inception. Requirements of throughput, power and compactness have
made design of Rijndael quite challenging. Various approaches have been developed. Often
these approaches have been found to be conflicting, implying proper understanding of design
rationales.
In this chapter, we focus to aim at understanding the key principles of developing a
hardware design of AES, the scopes that the algorithm provides etc. We shall also be using
FPGAs as a validating platform, though most of the discussions are valid for ASICs as well.
is to be pipelined and unrolled for further high throughput applications, like counter-mode
of ciphers, they require separate copies for each of the 10 rounds for AES-128. For the
other versions of AES, the cost is even more. To reduce this large resource allocation for a
combinational logical representation of the SubBytes, composite field representation have
been found to lead to compact and efficient designs.
(δ1 Y + δ0 )mod (Y 2 + τ Y + µ). Thus, the inverse of the element is expressed by the following
equations:
The above equations can be used to realize a hardware circuit. As τ appears in two
equations, we make τ = 1 for compact realization. Note as earlier stated we cannot make
µ = 1. Thus with the above choices, Equation 4.1 can be rewritten as:
γ1 µγ 2
δ1
γ −1
γ0 δ0
The corresponding circuit is depicted in Fig. 4.1. The circuit shows that the 8 bit
computation is performed in terms of 4 bit computations: namely additions, multiplica-
tions, squarings and inverses. The addition is the simple XOR, while the multiplication is
performed in GF (24 ). When one of the inputs to the multiplier is a constant, or when a
squaring is performed, a specialized circuit is used rather than using a general multiplier.
In the paper of [71], the multiplication with a constant is given a special name as scaling.
The polynomial GF (24 ) multiplier can also be obtained by similarly expressing the
elements as polynomials with coefficients in GF (22 ). Consider the product of two elements
in GF (24 ), being mapped to GF (22 )2 and denoted by γ = Γ1 Z + Γ0 , and δ = ∆1 Z + ∆0 .
Using the irreducible polynomial s(Z) = Z 2 + Z + N , we have:
For scaling, proper choices of the constant µ can lead to some scopes for optimization.
Like for µ = ∆1 Z + ∆0 , we set ∆0 = 0. Note that ∆1 6= 0, as then µ cannot make
the polynomial r(Y ) irreducible over GF (24 ). Further, choosing N = ∆−1
1 we can further
simplify the scaling operation.
As discussed, the choice of N makes the polynomial s(Z) = Z 2 + Z + N irreducible over
GF (22 ). Thus N is the root of t(W ) = W 2 + W + 1, and the roots cannot be 0, 1. They
are denoted by N and N + 1 (note that the sum of the roots is 1). Depending on the root
chosen for the polynomial basis (W, 1), either N = W , or N 2 = N + 1 = W . Also note
N −1 = N 2 = N + 1.
This leads to further improvement by combining the squaring operation with the scaling
operation. For completeness we state the following equation, leaving the details to the
reader:
µγ 2 = µ(Γ1 Z + Γ0 )2
= µ(Γ21 Z + (Γ20 + N Γ21 ))
Using the above choices for µ = N 2 Z for suitably optimizing the scaling operation we
have:
Similarly, the element in GF (24 ) can also be converted to an element in GF (22 )2 and
the inverse can be computed in the subfield GF (22 ). The irreducible polynomial of GF (22 )2
is denoted as s(Z) = Z 2 + T Z + N , where all coefficients are in GF (22 ).
The inverse of an element in GF (22 )2 is Γ = (Γ1 Z + Γ0 ) and let its inverse ∆ =
(Γ1 Z + Γ0 )−1 = (∆1 Z + ∆0 )mod (Z 2 + T Z + N ).
We have likewise the following equations:
Γ1 N Γ2
∆1
Γ−1
Γ0 ∆0
The corresponding circuit is shown in Fig 4.2. The underlying operations are likewise
to Fig 4.1, scaling and squaring in GF (22 ), multiplication and inversion in the subfield.
The multiplications in GF (22 ), the product Γ∆ can be similarly obtained by reducing
with the polynomial t(W ) = W 2 +W +1. Thus we have for Γ = g1 W +g0 and ∆ = d1 W +d0 ,
Here the multiplications and additions are in GF (2) and are thus equivalent to AND
and XOR gates respectively.
The scaling operation to compute N Γ, can be computed using the fact that N = W or
N 2 = W . In the combined squaring and scaling operation, we need scaling in GF (22 ) for
both N and N 2 . Assuming N = W , thus we need a W -scaler and W 2 -scaler circuit.
Thus we have:
W (g1 W + g0 ) = W (g1 + g0 ) + g1
2
W (g1 W + g0 ) = g0 W + (g1 + g0 )
Like before, we can also combine the squaring and multiplication operation for efficiency.
Thus assuming N = w, squaring and scaling with N can be computed as:
W Γ2 = W (g1 W + g0 )2
= W (g1 W + (g1 + g0 ))
= (g1 + (g1 + g0 ))W + g1
= g0 W + g1
Thus we see that the squaring and multiplication are free! One can obtain it by simply
swapping the inputs.
Using this fact, one can further optimize the square and scale architecture in GF (24 ).
When N = W we can further simplify µγ 2 as follows:
Observe that the entire operation can now be performed with one addition and two
scaling operations, as the square and scale operations in GF (22 ) denoted in { } are free. Note
that we have used the fact that for any non-zero element Γ = (g1 W + g0 ) ∈ GF (22 ), Γ3 = 1.
Thus Γ−1 = Γ2 . Hence, an inverter in GF (22 ) is same as squarer: Γ2 = g1 W + (g1 + g0 ).
We can also verify the inversion in GF (22 ) in a similar fashion as for the fields GF (28 )
and GF (24 ). Let the inverse of an element in GF (22 ), say G = g1 W + g0 is D = d1 W + d0 ,
where the coefficients belong to GF (2). The irreducible polynomial is t(W ) = W 2 + W + 1.
Thus, we have:
Since, for an element g ∈ GF (2), we have g 2 = g −1 = g, one can further simplify the
above equations to:
It should be observed that in all the above equations, a zero input is converted into an
all zero output. Thus the special case is handled by the equations implicitly.
All the above equations can be used together to derive a compact inversion circuit for
an element in GF (28 ). The steps are summarized as follows:
3. We apply the Equations 4.1,4.3, and 4.4 to obtain the complete inverse.
The field isomorphism can be obtained using Algorithm 1.3 presented in section 1.12.1.
Another way of obtaining the mappings is explained below.
Say an element g ∈ GF (28 ), which is the standard representation of an element of the
state matrix of AES be denoted by the byte: (g7 g6 g5 g4 g3 g2 g1 g0 ). The polynomial repre-
senting the byte is g7 X 7 + g6 X 6 + g5 X 5 + g4 X 4 + g3 X 3 + g2 X 2 + g1 X + g0 . We map the
element to a new element (b7 b6 b5 b4 b3 b2 b1 b0 ) in a new basis. In polynomial basis, thus for
g ∈ GF (28 )/GF (24 ) we have g = γ1 Y + γ0 , where for each element γ ∈ GF (24 )/GF (22 ),
γ = Γ1 Z + Γ0 . Further each element Γ ∈ GF (22 ) can be viewed as (b1 W + b0 ), and can be
represented as a pair of bits, (b1 b0 ). Thus the relation between the two byte representations
of g is as follows:
g7 X 7 + g6 X 6 + g5 X 5 + g4 X 4 + g3 X 3 + g2 X 2 + g1 X + g0
= [(b7 W + b6 )Z + (b5 W + b4 )]Y + [(b3 W + b2 )Z + (b1 W + b0 )]
= b7 (W ZY ) + b6 (ZY ) + b5 (W Y ) + b4 (Y ) + b3 (W Z) + b2 (Z) + b1 (W ) + b0
Thus the map is decided for a choice of the basis denoted by (Y, Z, W ). As discussed
these values are fixed by the choices of the parameters µ and N which fixes the polynomials
r(Y ) and s(Z). As an example, consider µ=0xEC and N =0xBC, then the basis choices are
Y =0xFF, Z=0x5C, W =0xBD. Using this we have, W ZY =0x60, ZY =0xBE, W Y =0x49,
Y =0xFF, W Z=0xEC, Z=0x5C, W =0xBD. The mapping can be represented in the form
of a matrix:
g7 0 1 0 1 1 0 1 0 b7
g6 1 0 1 1 1 1 0 0 b6
g5 1 1 0 1 1 0 1 0 b5
g4 0 1 0 1 0 1 1 0 b4
=
g3 0 1 1 1 1 1 1 0 b3
g2 0 1 0 1 1 1 1 0 b2
g1 0 1 0 1 0 0 0 0 b1
g0 0 0 1 1 0 0 1 1 b0
The above matrix, denoted by X thus converts an element from GF ((22 )2 )2 to GF (28 ).
The inverse mapping X −1 can be obtained by computing the inverse of the above matrix
modulo 2.
Thus to compute the S-Box output for a given byte, we apply the transformation X −1 ,
then utilize the discussed circuitry to compute the inverse, finally we apply the transforma-
tion X to get the inverse. The affine mapping of the AES S-Box is applied to get the result.
One can further experiment by combining the fixed linear transformation A of the affine
map of the AES SubBytes with the transformation X. That is, we apply AX to the output
inverse computed in the composite field. We leave the details of obtaining the corresponding
matrices as an exercise to the reader.
As mentioned, the inversion circuit can also be developed in normal basis, which we
discuss next.
Analogously, an element in GF (24 )/GF (22 ) is expressed using the normal basis (Z 4 , Z).
d0 = [g1 g0 + g1 + g0 ]g1 = g1
d1 = [g1 g0 + g1 + g0 ]g0 = g0
The above equations lead to several optimizations and, like for the polynomial basis, can
be simplified with prudent choices of the coefficients in the irreducible polynomials r(Y )
and s(Z). It appears that setting T = τ = 1 gives efficient circuits also for the normal basis.
Likewise, the value of µ and N are selected as discussed before.
γ1 µγ 2
δ1
γ −1
γ0 δ0
With the choice of τ = 1, the circuit for the GF (28 ) normal multiplier is depicted in
Fig 4.3. The important blocks in the architecture are the scaling and squarer block, the
GF (24 ) inverse and the GF (24 ) multipliers, all implemented in the normal basis.
Consider the multiplication of two elements γ = Γ1 Z 4 + Γ0 Z and δ = ∆1 Z 4 + ∆0 Z
in GF (24 ). Since, the basis elements (Z 4 , Z) are both roots of s(Z), we have the following
identities:
Z2 = Z +N
4
Z +Z = 1
Z(Z 4 ) = N
γδ = Γ1 ∆1 (Z 8 ) + Z 4 Z(Γ1 ∆0 + ∆0 Γ1 ) + Γ0 ∆0 (Z 2 )
= Γ1 ∆1 (Z 2 + 1) + N (Γ1 ∆0 + Γ0 ∆1 ) + Γ0 ∆0 Z 2
= (Z + N )(Γ1 ∆1 + Γ0 ∆0 ) + (Γ1 ∆1 + N (Γ1 ∆0 + Γ0 ∆1 ))
= Z(Γ1 ∆1 + Γ0 ∆0 ) + (N (Γ1 ∆1 + Γ0 ∆0 ) + (Γ1 ∆1 + N (Γ1 ∆0 + Γ0 ∆1 )))(Z 4 + Z)
= Z 4 (Γ1 ∆1 + N (Γ1 + Γ0 )(∆1 + ∆0 )) + Z(Γ0 ∆0 + N (Γ1 + Γ0 )(∆1 + ∆0 ))
The product thus has several multiplications in the subfield GF (22 ). The subfield prod-
uct Γ∆ = (g1 W 2 + g0 W )(d1 W 2 + d0 ) can be obtained as:
The other important component in Fig. 4.3 is the scaling of squaring of γ = Γ1 Z 4 +Γ0 Z,
by µ = ∆1 Z 4 + ∆0 Z. Thus we have:
Thus the squaring and scaling operation can be done by two scalings and one addition.
Note that the squaring operations in the above equations are in GF (22 ) and are free! To
see, we assume the basis as (W 2 , W ) and use the fact that for any non-zero W ∈ GF (22 )
W 4 = W . Thus the squaring operation in GF (22 ) in normal basis can be expressed as
(g1 W 2 + g0 W )2 = g0 W 2 + g1 W which is just the swap of the input.
Likewise, for a choice of T = 1, the inversion circuit in the subfield GF (24 ) is shown
in Fig. 4.4. This consists of operations like multiplication and squaring and scaling in the
subfield GF (22 ).
Γ1 N Γ2
∆1
Γ−1
Γ0 ∆0
The circuit for squaring and scaling in GF (22 ) can be simplified by assuming the choices
of N which can be W or W 2 . In any case, we need both a W -scaler and W 2 -scaler in the
circuit of squaring and scaling in GF (22 ). Hence we have:
W (g1 W 2 + g0 W ) = (g1 + g0 )W 2 + g1 W
W 2 (g1 W 2 + g0 W ) = g0 W 2 + (g1 + g0 )W
The change in basis from the polynomial representation of AES to a normal basis rep-
resentation can be obtained as when we converted to the polynomial basis.
Consider an element g ∈ GF (28 ), which is an element of the state matrix of AES, and
represented in the standard way. Let it be denoted by the byte: (g7 g6 g5 g4 g3 g2 g1 g0 ). The
polynomial representing the byte is g7 X 7 +g6 X 6 +g5 X 5 +g4 X 4 +g3 X 3 +g2 X 2 +g1 X +g0 . We
map the element to a new element (b7 b6 b5 b4 b3 b2 b1 b0 ) in a new basis. In normal basis, thus for
g ∈ GF (28 )/GF (24 ) we have g = γ1 Y 16 +γ0 Y , where for each element γ ∈ GF (24 )/GF (22 ),
γ = Γ1 Z 4 + Γ0 Z. Further each element Γ ∈ GF (22 ) can be viewed as (b1 W 2 + b0 W ),
and can be represented as a pair of bits, (b1 b0 ). Thus the relation between the two byte
representations of g is as follows:
g7 X 7 + g6 X 6 + g5 X 5 + g4 X 4 + g3 X 3 + g2 X 2 + g1 X + g0
= [(b7 W 2 + b6 W )Z 4 + (b5 W 2 + b4 W )]Y 16 + [(b3 W 2 + b2 W )Z 4 + (b1 W 2 + b0 W )Z]
= b7 (W 2 Z 4 Y 16 ) + b6 (W Z 4 Y 16 ) + b5 (W 2 ZY 16 ) + b4 (W ZY 16 ) + b3 (W 2 Z 4 Y )
+b2 (W Z 4 Y ) + b1 (W 2 ZY ) + b0 (W ZY )
Thus the map is decided for a choice of the basis denoted by (Y, Z, W ). As before for
polynomial basis the basis values are chosen by fixing Y =0xFF, Z=0x5C, W =0xBD (note
here, N = W 2 ). Using this we have, W 2 Z 4 Y 16 =0x64, W Z 4 Y 16 =0x78, W 2 ZY 16 =0x6E,
W ZY 16 =0x8C, W 2 Z 4 Y =0x68, W Z 4 Y =0x29, W 2 ZY =0xDE, W ZY =0x60.
The mapping can be represented in the form of a matrix:
g7 0 0 0 1 0 0 1 0 b7
g6 1 1 1 0 1 0 1 1 b6
g5 1 1 1 0 1 1 0 1 b5
g4 0 1 0 0 0 0 1 0 b4
=
g3 0 1 1 1 1 1 1 0 b3
g2 1 0 1 1 0 0 1 0 b2
g1 0 0 1 0 0 0 1 0 b1
g0 0 0 0 0 0 1 0 0 b0
The above transformation denoted by the basis change matrix T can be inverted to
obtain the matrix T −1 which converts an element from GF (28 ) to GF ((22 )2 )2 in normal
basis at each level of the decomposition. The inverse is then computed by the above circuits,
the result is finally multiplied with T to obtain the result in the standard polynomial basis
of AES. The affine transformation matrix A can be combined with T to allow further
optimizations.
varies depending on the basis of the underlying field GF (22 ) and the value of w. To explain
consider the case when µ = N 2 Z+1, assuming a polynomial basis for the field GF (24 ). Thus,
as N ∈ GF (22 ), we have N 3 = 1. Further, the basis Z is the root of Z 2 + Z + N = 0, where
N 2 + N + 1 = 0. Using this fact, µ4 = N 2 Z 2 + N 2 , µ3 = N 2 Z 2 + Z + 1, µ2 = N Z 2 + 1. Thus,
µ4 +µ3 +µ2 +µ+1 = N 2 +Z(1+N 2 )+N Z 2 = N 2 +N (Z +Z 2 ) = 0, thus showing that µ is a
root of the irreducible polynomial x4 +x3 +x2 +x+1. Thus, the square and scale operation is
equal to [(N 4 +1)A2 +N 2 B 2 ]Z +[(N 2 +1)N A2 +B 2 ] = N 2 (A+B)2 Z +[N 2 (A+B)2 +N B 2 ].
The underlying operations are in GF (22 ).
When the underlying field operations are performed in normal basis, squaring is free and
squaring and scaling with N = W and N = W 2 both requires one XOR. Thus to sum up,
the computation of A + B requires 2 XORs, the two scalings with N and N 2 each require
one XOR, and the sum N 2 (A + B)2 + N B 2 requires 2 XORs, thus totalling 6 XORs.
When the underlying field uses polynomial basis, and W = N , the underlying operations
are squaring and scaling and additions. The squaring and scaling is with N = W and
N 2 = W 2 . The squaring and scaling with N = W is free in the polynomial basis, while that
for N 2 = W 2 requires 1 XOR. The sum A + B and N 2 (A + B)2 + N B 2 requires 2 XORs
each, thus requiring 5 XORs in total. Similarly, when the underlying polynomial basis is
such that W = N 2 , we have the squaring and scaling with N free, while that with N 2
requiring 1 XOR. This does not change the total number of XOR operations, and keeps it
5.
An important point to address at this juncture, is the number of XOR gates required in
the GF (24 ) inverter for the above basis choices. It may be recollected that for the polynomial
basis GF (24 ) inverter, the main underlying operations are a squaring and scaling with N
and an inverse in GF (22 ). When the underlying field GF (22 ) is implemented with normal
basis, the inverse is free (being equivalent to the square in GF (22 )) and the scaling and
squaring requires always 1 XOR, irrespective of whether the scaling and squaring is with N
or N 2 . However, when the underlying fields are in polynomial basis, the inversion requires
1 XOR. But the scaling and squarer is free when N = W , while it requires 1 XOR when
N = W 2.
TABLE 4.1: XOR count of Square and Scaling circuit for GF (24 ) Polynomial Basis
From the above tables, we observe that the choice of the basis plays a crucial role in
determining the overall number of gates of the AES S-Box, which is based on the finite
field inverse in GF (28 ). However, it may be also noted that the difference in the largest
and smallest inversion is only 4 XORs (considering the above optimizations in the square
and scaling, and the inversion in the field GF (22 )). Hence, the other optimizations for
TABLE 4.2: XOR count of Square and Scaling circuit for GF (24 ) Normal Basis
implementing the change in representations (basis) at the input and output of the inverter,
the affine transformations needs to be done carefully.
The authors in [71] shows further optimizations, some of which are generic for any
hardware platform, others which are more specific to ASIC libraries. We highlight two
generic techniques in the following.
Γ0
M2
Γ1 NΓ
∆1
Γ0 M3
∆0
NΓ
∆1 M1′
Shared
Hardware
∆0
∆′1 M3′
Further, it can be observed that the multipliers have three smaller multipliers in GF (22 )
which share inputs in GF (22 ). For example, the pair of multipliers which share an input are
labeled as (M1 , M1′ ), (M2 , M2′ ), (M3 , M3′ ). These multipliers using the same logic, can each
share a 2-bit XOR. Thus we have a total saving of 5 XORs. The same saving is obtained in
the context of a multiplier in GF (24 ) in normal basis and sharing a common input.
Now consider the effect of these optimizations in the GF (28 ) and GF (24 ) inverters in
the polynomial and normal basis. Refer to the Fig. 4.1 for the polynomial GF (28 ) inversion
circuit. We observe that the three multipliers has two sharings of a common input. However
there is a pairing of the multipliers for which there is no sharing. The same fact holds for the
polynomial inversion circuit for GF (24 ) as well. However, if we observe the circuit diagram
for Fig. 4.4, we observe that the three multipliers pairwise have a common input, which
leads to an additional saving of 5 XORs, for the previously mentioned reason. This leads to
a significant saving, and makes the normal basis an ideal choice for a compact realization
of the AES S-Box.
x′ = T −1 (x) (4.5)
x′′ = fnorm (x′ ) (4.6)
x′′′ = T (x′′ ) (4.7)
y = A(x′′′ ) + B (4.8)
Thus, we can combine the last two steps by performing the two matrix multiplications
in a merged step, (AT ). Likewise for the inverse S-Box computation, one can merge the
steps as (AT )−1 .
Each such matrix multiplication operating on constant 0-1 matrices gives further scopes
of sharing hardware. If we are able to find out combinations of input bits that are shared
between different output bits, we can combine the common inputs and then use the output
as a temporary input to generate the other output bits. This can lead to further saving in
area. One can apply search techniques to find out the minimal circuit. The greedy method,
as the name suggests will involve finding out at each stage two inputs, which are shared
by maximum number of common outputs. Then a temporary signal is derived using the
XOR of the chosen inputs. This temporary signal is used as a new input and the process
is repeated. It must be noted that though the greedy approach is easy to implement, may
not lead to the minimal circuit. More sophisticated search techniques can be employed to
derive the most compact circuit.
employed for compactness. Thus, given an input x, T −1 (x) for encryption and (AT )−1 (x+B)
for decryption are computed. The output byte y is then processed by (AT )y + B or T y for
encryption or decryption respectively. With this merging, thus one can optimize both the
basis change matrices, namely T −1 and (AT ) together as a 16 × 8 matrices. These bigger
matrices provides further scope of optimizations because larger the number of rows, more
the chance of finding commonalities for optimizations.
In [71], there are further optimizations suggested for ASIC designs. For modern tech-
nolgies, like 0.13 µ CMOS cell libraries, a NAND gate is smaller than an AND gate. Since
the AND gates are always involved in the GF (22 ) multiplier and are combined by the XOR
gates, in the following form: [(aAN Db)XOR(cAN Dd)], the AND gates can be combined by
using the [(aN AN Db)XOR(cN AN Dd)]. Further in the library if we find the XNOR to be
of the same size as an XOR, then when in the affine transformation the vector B has ones,
then using the XNOR saves the extra NOT gate. However when the matrix A is such that
only one input bit affects the output bit, then the NOT gate is used. In the inverter circuit,
whenever the XORs are combined using the following combination: a ⊕ b ⊕ (ab), then can
replace them by the equivalent OR gate. If the NAND gate is used, as mentioned before we
have the NOR gates, which is even more compact than the OR gate. Multiplexors are used
to implement the selection function needed to implement both the S-Box and the inverse
S-Box.
Combining all these techniques, the complete merged S-Box and the inverse circuit is
equivalent to 234 NAND gate. Stand-alone S-Box takes 180 gates, while the inverse takes
182 gates. According to [71], the optimizations lead to a 20% saving of resources. This makes
the design suitable for area-limited hardware implementations, including smart cards.
In the following section, we describe implementation aspects of the remaining transfor-
mations of the AES hardware. We first consider the MixColumns transformation.
has been combined with the final XORings. Considering the fact that all modern FPGAs
are four input LUTs this saves using additional LUTs.
a b c d b c d a c d a b d a b c
key
XOR XOR XOR XOR
In[7:4] In[3:0]
XOR
α11 x αx α11 x
XOR XOR
T(2)*input
0B 0D 09 0E
As may be observed that the elements of the inverse MixColumns matrix have larger
Hamming weights than the elements of the MixColumns matrix. However one can obtain
an efficient deisgn using the following observation [332].
0E 0B 0D 09 02 03 01 01 05 00 04 00
09 0E 0B 0D 01 02 03 01 00 05 00 04
0D 09 0E 0B = 01 01 02 03 04 00 05 00
0B 0D 09 0E 03 01 01 02 00 04 00 05
The above equation is useful as we reuse the output of the MixColumns transformation.
This is quite handy when the same hardware supports both encryption and decryption.
Also note that the elements of the second matrix above are more easy to multiply as the
Hamming weights of them are smaller compared to the original elements of the inverse
MixColumns matrix.
As discussed earlier, for an implementation where the entire operation is performed in
the subfield, the elements are transformed by the conversion matrix T , and subsequently
multiplied with the input byte. As an example, consider the element 0e, which is transformed
to T (0e) = 24. Thus,
8 Plain
δ >> >> >>
Datain Text
256
256 256
256
KeyScheduler Control Encryption
Unit Unit
Cipher
Text 8 Dataout
Data Dispatch δ −1
256 8
to save area resources. All computations involving the S-Box, MixColumns have been per-
formed using logic gates, removing completely any memory elements.
111111111
000000000
000000000
111111111
Register 111111111
000000000
000000000
111111111
Register 111111111
000000000
000000000
111111111
Registers
00000000
11111111 00000000
11111111
Encryption
Combinational
Logic
00000000
11111111
00000000
11111111
Register Inner 00000000
11111111
00000000
11111111
Registers
Round
00000000
11111111
00000000
11111111
Register 00000000
11111111
Pipelining
00000000
11111111
Registers
Basic Architecture
00000000
11111111
000
111
00000000
11111111
000
111
00000000
11111111
Outer Registers
Pipelining
00000000
11111111
00000000
11111111
Registers
00000000
11111111
00000000
11111111
Registers
as the algorithm is inherently not capable to taking advantage of this feature. Hence, one
can obtain nice performance if the above alternatives, design choices, and the application
are considered while architecting a hardware for an encryption unit.
SubBytes
ShiftRows
Inner−pipelining
BUFFER 1 clock
MixColumns
AddRoundKey
256 RoundKey
BUFFER2
Addkey_ enable
256
256
Last_round
BUFFER 3
256 Cipher Text
The central blocks of the design are the SubBytes and the MixColumns transformations.
The major computation inside S-Box is to find out the multiplicative inverse of an
element in the finite field GF (28 ). As discussed before using composite fields several authors
attempted to reduce the area requirement of the S-Box [348, 410, 150]. Significant research
and several techniques as presented in the compact designs of [71] were also proposed to
realize compact AES S-Box.
A simple three-stage strategy is used as shown in Fig. 4.11 to realize the AES S-Box
[25, 4]:
• Map the element X ∈ GF (28 ) to a composite field F ∈ GF (24 )2 ) using an isomorphic
function δ [347].
• Compute the multiplicative inverse over the field F ∈ GF (24 )2 ).
• Finally map the computation result back to the original field using the inverse function
δ −1 .
X X’ Y’ Y
8 δ Inverse δ −1 Affine
8
Computation Transformation
Inverse Affine
X’ Transformation Y’
8 Computation 8
Imodified)
GF((24 )2 ) GF((24 )2 )
For processing the entire 128-bit data in SubBytes operations, we can use 32 isomorphic
functions (δ and δ −1 ) parallely, 16 for mapping GF (28 ) elements to GF ((24 )2 ) elements
and 16 for reverse conversion. On the contrary, we use only two isomorphic functions δ and
δ −1 : one in the DataScheduler and another in the DataConverter (Fig. 4.8). The idea to
reduce the number of isomorphic mappings to one at the start of encryption and the reverse
at the end of it was applied to designs in [266, 4].
The function δ converts an element x in GF (28 ) to an element in GF ((24 )2 ) (refer Section
2.4.1). The function δ is defined as δ(x) = T.x, where T is the transformation matrix. The
inverse mapping is performed by the function δ −1 . We have not used any transformation
function (δ or δ −1 ) in the S-Box operation (Fig. 4.11).
The elements of the standard affine matrix of AES is transformed and is defined over
the composite field GF ((24 )2 ).
As discussed in Section 2.3.3.1, Y = AX −1 + B. In the composite field,
′
Y = δ(Y ) = δ(AX −1 ) + δ(B) (4.10)
= δ(Aδ −1 (δ(X −1 ))) + δ(B) (4.11)
′ ′
= A X + B′ (4.12)
′ ′ ′
where A = δAδ −1 , B = δ(B), and X = δ(X −1 ).
Now, because of the isomorphism established by the mapping δ, we have the following
for any X ∈ GF (28 ):
δ −1 (δ(X))−1 = X −1
′ ′ ′ ′ ′
Thus, X = (δ(X))−1 , and hence the equation Y = A (X ) + B is exactly like the
affine transformation of the SubBytes operation in the description of AES. Only here the
X2 Xω 14
X[7 : 4] ×
X
8 X1−1 X −1
8
X[3 : 0] ×
×
Fig. 4.12 depicts the corresponding block-diagram of the three stage inverse circuit. The
field polynomial used for the computations in GF (24 ) is (x4 + x + 1). The multiplication
employs modulo arithmetic of an irreducible polynomial (x2 + x + λ), where λ is a primitive
in GF (24 ). There are various polynomials, out of which we choose the one with λ = ω 14 ,
where ω = (0010)2 is an element in GF (24 ) [4]. The above circuit can be realized in standard
cell using 273 gates.
A key point to be kept in mind is that depending on the values of λ, the values of the
matrix T can change. Further, the number of ones in the Affine transformation matrix also
should be noted. For example, with the above transformation to the composite field, the
′
matrix A is transformed to A where:
0 0 1 0 1 0 0 0
0 1 0 1 0 0 0 1
0 0 1 0 0 1 0 0
′ 0 0 0 0 0 0 0 1
A = (4.13)
1 0 1 0 0 1 0 0
0 1 0 0 0 1 0 1
0 0 1 0 1 0 1 0
0 0 0 1 0 0 0 0
The total number of 1-entries in the fixed affine matrix A is equal to 40 (refer section
′
2.3.3.1). The total number of 1-entries in the new affine matrix A is equal to 18. Imple-
menting the matrices in a straightforward way, the number of XORs would be equal to the
number of 1-entries minus the number of rows in the matrices. This would lead to an XOR
gate count of 10, which results in a reduction of 22 XOR gates.
The implementation of the MixColumns is performed as discussed in Section 4.4.1. The
other important component in the design of the AES architecture is the Key-Schedule,
which we discuss next.
TABLE 4.3: Numbers of rounds (Nr ) as a function of the block and key length
Nr Nb =4 Nb =6 Nb =8
Nk =4 10 12 14
Nk =6 12 12 14
Nk =8 14 14 14
Recalling algorithm 6, suppose that all columns up to W[i-1] have been expanded. The
next column W[i] can be constructed as follows:
W[i-Nk ]⊕T(W[i-1]) if i mod Nk =4(Nk = 8)
W [i] = W[i-Nk ]⊕T(W[i-1],Rcon) if i mod Nk = 0
W[i-Nk ]⊕W[i-1])
otherwise
32
T M3 2
Initial_ Rcon T (3:1 Mux)
key Gen C1
32
M2
(3:1)
i 32
2
<< << <<
W[N k+ 7] W[N k+ 1] W[N k]
The Key-Scheduler architecture is presented in Fig. 4.13. The figure shows the datapath
components in the key expansion block. R is a 256-bit register to store initial key and the W’s
are the 32-bit shift registers for storing intermediate round keys. The Data_enable signal is
set when data blocks are ready to be processed (refer Fig. 4.10). The Data_enable signal
sets the value of the running index i, which is generated from an 8-bit counter, to value 1.
When Data_enable signal is set, a single word (32 bits) among the 8 words goes through the
shift registers W[Nk -8],...,W[Nk -1] at every clock. The corresponding word to be shifted in
is selected by a 3 bits control signal generated by the counter. After Nk cycles all registers
W[Nk -8],...,W[Nk -1] are occupied by the initial key stored at register R. These intermediate
registers are processed by the transformation T as detailed above. The user input C1 is used
to multiplex out one of W[Nk -4], W[Nk -6], or W[Nk -8] (refer Algorithm 6) to be XORed
with W[Nk -1], T(W[Nk -1]) or T(W[Nk -4],Rcon). Now at every clock cycle, one word of
the RoundKey is generated, which is shitfted to another set of 32-bit shift registers, namely
W[Nk ],...,W[Nk +7] as the round key. Each round key is generated in Nb clock cycles. The
generated word of the round key is fed back to the register W[Nk -1] through the multiplexer
M1 for generating the next round key.
First_round Initial_key
Last_round Addkey_enable Data_enable
Key_enable
000000 000000
S0
S83
000000
S15 000000
Else S84
(If C 1 == 00)
(If C 1== 01) Else
S16 000000 S86
000001
S85 000000
000000 S17
S42 000000
000000
000000 S61
S31
Else S49 000000
(If C 2== 00) (C 2= 10)
(C 2= 01)
000110 S69 000110
S32 S50
C2= 00 000110
000100
S33 S51 S70 000100
C2= 01 000100
000000 S77
000000 000000
S36 C2= 00 S56 C2= 10
C2= 01
000000 S82
S60
S38 000000
000000
(If Last_ Else
round) (If First_
round) 001000
Example: In the example let us take 128 bits data and 128 bits key (i.e C2 C1 = 0000).
• The controller starts at S0 with the positive edge of reset. The consecutive 15 clock
cycles states {S0 , S1 , . . . S15 } have same status and in those cases all outputs are 0.
• At the next state S16 (16th cycle) Key_enable signal becomes 1 and rest are 0. It
signifies that 128 bits data are stored in register R as initial key.
• In next cycle (state S17 ) all output signals are set to 0. Similarly after 15 cycles at
state S32 Data_enable and Initial_key are set to 1. In consecutive 4 cycles Initial_key
signal sets to 1. At these stages, Mux M1 selects 4 words from R (Fig. 4.13), and
C2 C1 Key
128 192 256
128 0000 0100 1000
Data 192 0001 0101 1001
256 0010 0110 1010
those words are stored in shift registers W[Nk -4],...,W[Nk -1], i.e., W[0],...,W[3]. Now
the words of the key are ready to be expanded.
• In the next 4 cycles 128 bits the next round key is made. At S38 there are three
branches, S39 , S40 and S41 . Those branching take place depending upon the signals
Last_round and First_round.
• All those states except S40 comes back to S36 in the next cycle, signifying that the
expansion of next round keys can start. State S40 comes back to S32 . It signifies that
all 10 rounds key generation are completed and ready to store initial keys from register
R to the shift registers W[Nk -4],...,W[Nk -1], i.e, W[0],...,W[3] to generate the next 10
round keys.
The proposed design has been implemented on Xilinx XCV1000 device and simulated by
ModelSim8.1i. The design is also implemented in 0.18 µ CMOS technology using Synopsys
Design Compiler tools. The throughput of the design, τ = (β × f )/(ψ), where β, f and
ψ stand for block length, clock frequency and number of clock cycle, respectively. In our
approach Nb × Nr clock cycles are needed to generate a block of cipher text. Thus, τ =
(Nb × 32 × f )/(Nb × Nr ) = (32 × f )/Nr . The performances (throughput and frequency) for
FPGAs and ASICs are shown in Table 4.5 and Table 4.6. We present a comparision of the
present design with some other AES cores in Table 4.7. Note that we do not compare with
implementations of AES which uses pipelining as they are not useful for several modes like
CBC [266] as they give greater throughput at the expense of larger area. The comparisions
show that while the present design is superior in terms of throughput per kilo gates, other
applications may require designs with throughput or area as the primary concern.
Finally, the present design shows the datapath for only the encryption operation. There
may be applications requiring both encryption and decryption. In the next section, we
provide a snapshot of such a shared datapath.
(a)
(b)
FIGURE 4.15: Different Data-Flow of (a) Encryption and (b) Decryption Datapath
functionality is selected by the user defined control signal called Mode. In the figures REG1
and REG2 are used for sub-pipelined S-Box or InvS-Box implementations (further levels of
inner pipelining).
4.8 Conclusions
In this chapter, we presented several design strategies for efficient implementations of
the AES cipher. The S-Box is the most hardware intensive component of the AES structure.
However, isomorphism of the finite field GF (28 ) and various other composite fields, helps
to express the design in terms of smaller field computations. This leads to several compact
designs of the S-Box of AES. Further, we presented design strategies of the MixColumns
transformation, the other important round component. A state machine based area efficient
key schedule is described to ensure support for all the 9 modes of the Rijndael algorithm.
Finally, most of the design tricks are combined to present a complete design of AES. The
rich properties of the AES structure and its underlying characteristic 2 computations indeed
help designers to develop high-performance and compact designs of the cipher.
REG1
SubBytes
Inner−pipelining
REG2
InvShiftRows
Inner−pipelining
ShiftRows InvMixColumns
Addkey_enable
MixColumns
REG3 RoundKey
Addkey_enable
REG3 InvMixColumns
AddRoundKey
RoundKey
REG4 REG4
Last_round Last_round
256 Cipher Text 256 Plain Text
(a) (b)
Mux Data_Enable
REG 1 reg 1
SubBytes InvSubBytes
REG 2 reg 2
Mux Mode
ShiftRows InvShiftRows
Mux Mode
MixColumns InvMixColumns
Mux Mode
Addkey_Enable
REG3 Clock
RoundKey
InvMixColumns
MixRoundKey
Mux Mode
Clock
REG4 Last_Round
(George Santayana)
115
5.1 Introduction
We have seen in the previous chapter, that efficient designs of cryptographic algorithms
require proper understanding of underlying finite field primitives. Like the AES, most ci-
phers are developed using complex mathematical operations, relying largely on finite fields.
The Elliptic Curve Cryptosystems are the current generation choice for public key ciphers.
As discussed in Chapter 2, they also rely heaviliy on finite fields. These are somewhat more
complex, as they operate on larger bit sizes as the primitives required in AES. Further,
unlike AES, the bit-sizes of these designs vary, and thus have to be scalable.
The implementation of elliptic curve crypto systems constitutes a complex interdisci-
plinary research field involving mathematics, computer science and electrical engineering [3].
Elliptic curve crypto systems have a layered hierarchy as shown in Fig. 5.1. The bottom
layer constituting the arithmetic on the underlying finite field most prominently influences
the area and critical delay of the overall implementation. The group operations on the el-
liptic curve and the scalar multiplication influences the number of clock cycles required for
encryption.
To be usable in real-world applications, the crypto system implementation must be ef-
ficient, scalable and reusable. Applications such as smart cards and mobile phones require
implementations where the amount of resources used and the power consumed is critical.
Such implementations should be compact and designed for low power. The computation
speed is a secondary criteria. Also, the degree of reconfigurability of the device can be
kept minimum [409]. This is because, such devices have a short lifetime and are generally
configured only once. On the other side of the spectrum, high-performance systems such
as network servers, data base systems, etc. require high speed implementations of ECC.
The crypto algorithm should not be the bottleneck on the application’s performance. These
implementations must also be highly flexible. Operating parameters such as algorithm con-
stants, etc. should be reconfigurable. Reconfiguration can easily be done in software, however
software implementations do not always scale to the performance demanded by the appli-
cation. Such systems require use of dedicated hardware to speedup computations. When
using such hardware accelerators, the clock cycles required, frequency of operation, and area
are important design criteria. The clock cycles and frequency should be high so that the
overall latency of the hardware is less. The area is important because smaller area implies
more parallelism can be implemented on the same hardware, thus increasing the device’s
throughput.
In the next sections, we gradually develop the underlying finite field operations. We also
focus on ensuring that the architectures utilize the underlying resources, namely the FPGAs
efficiently. We first start with the architecture for a finite field multiplier.
EC
Primitives
Scalar Multiplication
where C(x), A(x), and B(x) are in GF (2m ) and P (x) is the irreducible polynomial that
generates the field GF (2m ). Implementing the multiplication requires two steps. First, the
polynomial product C ′ (x) = A(x) · B(x) is determined, then the modulo operation is done
on C ′ (x). This chapter deals with polynomial multiplication.
where, R(x) is of the form xk and is an element in the field. Also, gcd(R(x), P (x)) = 1. The
division by R(x) reduces the complexity of the modular operation. For binary finite fields,
R(x) has the form 2k therefore division by R(x) can be easily accomplished on a computer.
This multiplier is best suited for low resource environments where speed of operation is not
so important [331].
The Karatsuba multiplier [176] uses a divide and conquer approach to multiply A(x)
and B(x). The m term polynomials are recursively split into two. With each split the size
of the multiplication required reduces by half. This leads to a reduction in the number
of AN D gates required at the cost of an increase in XOR gates. This also results in the
multiplier having a space complexity of O(mlog2 3 ) for polynomial representations of finite
fields. A comparison of all available multipliers show that only the Karatsuba multiplier
has a complexity which is of sub quadratic order. All other multipliers have a complexity
which is quadratic. Besides this, it has been shown in [331] and [136] that the Karatsuba
multiplier if designed properly is also the fastest.
For a high-performance elliptic curve crypto processor, the finite field multiplier with
the smallest delay and the least number of clock cycles is best suited. Karatsuba multiplier,
if properly designed, attains the above speed requirements and at the same time has a
sub-quadratic space complexity. This makes the Karatsuba multiplier the best choice for
high-performance applications.
A(x) = Ah xm/2 + Al
(5.3)
B(x) = Bh xm/2 + Bl
The multiplication is then done using three m/2 bit multiplications as shown in Equation
5.4.
C ′ (x) = (Ah xm/2 + Al )(Bh xm/2 + Bl )
= Ah Bh xm + (Ah Bl + Al Bh )xm/2 + Al Bl
= Ah Bh xm (5.4)
+ ((Ah + Al )(Bh + Bl ) + Ah Bh + Al Bl )xm/2
+ Al Bl
The Karatsuba multiplier can be applied recursively to each m/2 bit multiplication in
Equation 5.4. Ideally this multiplier is best suited when m is a power of 2, this allows the
multiplicands to be broken down until they reach 2 bits. The final recursion consisting of 2
bit multiplications can be achieved by AN D gates. Such a multiplier with m a power of 2
is called the basic Karatsuba multiplier.
circuit having less hardware and latency but requiring several clock cycles to produce the
result. Generally at every clock cycle the outputs are fed-back into the circuit thus reusing
the hardware. The advantage of this approach is that it can be pipelined. Examples of
implementations following this approach can be found in[136][115][397][299]. The second
approach is a combinational circuit having large area and delay but is capable of generating
the result in one clock cycle. Examples of this approach can found in [287][329][253][404].
Our proposed Karatsuba multiplier follows the second approach. Therefore, in the remaining
part of this section we analyze the combinational circuits for Karatsuba multipliers.
The easiest method to modify the Karatsuba algorithm for elliptic curves is by padding.
The padded Karatsuba multiplier extends the m bit multiplicands to 2⌈log2 m⌉ bits by padding
the most significant bits with zeroes. This allows the use of the basic recursive Karatsuba
algorithm. The obvious drawback of this method is the extra arithmetic introduced due to
the padding.
In [329], a binary Karatsuba multiplier was proposed to handle multiplications in any
field of the form GF (2m ), where m = 2k + d and k is the largest integer such that 2k < m.
The binary Karatsuba multiplier splits the m bit multiplicands (A(x) and B(x)) into two
terms. The lower terms (Al and Bl ) have 2k bits while the higher terms (Ah and Bh )
have d bits. Two 2k bit multipliers are required to obtain the partial products Al Bl and
(Ah + Al )(Bh + Bl ). For the latter multiplication, the Ah and Bh terms have to be padded
with 2k − d bits. Ah Bh product is determined using a d bit binary Karatsuba multiplier.
The simple Karatsuba multiplier [404] is the basic recursive Karatsuba multiplier with
a small modification. If an m bit multiplication is needed to be done, m being any integer,
it is split into two polynomials as in Equation 5.3. The Al and Bl terms have ⌈m/2⌉ bits
and the Ah and Bh terms have ⌊m/2⌋ bits. The Karatsuba multiplication can then be done
with two ⌈m/2⌉ bit multiplications and one ⌊m/2⌋ bit multiplication. The upper bound for
the number of AN D gates and XOR gates required for the simple Karatsuba multiplier
is the same as that of a 2⌈log2 m⌉ bit basic recursive Karatsuba multiplier. The maximum
number of gates required and the time delay for an m bit simple Karatsuba multiplier is
given below.
In the general Karatsuba multiplier [404], the multiplicands are split into more than two
terms. For example an m term multiplier is split into m different terms. The number of
gates required is given below.
function having a maximum of four inputs. The LUT can also be used to implement logic
functions having less than four inputs, two for example. In this case, only half the LUT is
utilized the remaining part is not utilized. Such a LUT having less than four inputs is an
under-utilized LUT. For example, the logic function y = x1 + x2 under utilizes the LUT as
it has only two inputs. Most compact implementations are obtained when the utilization of
each LUT is maximized. From the above fact it may be derived that the minimum number
of LUTs required for a q bit combinational circuit is given by Equation 5.7.
0 if q = 1
1 if 1 < q ≤ 4
#LU T (q) = (5.7)
⌈q/3⌉ if q > 4 and q mod 3 = 2
⌊q/3⌋ if q > 4 and q mod 3 6= 2
The delay of the q bit combinational circuit in terms of LUTs is given by Equation 5.8,
where DLU T is the delay of one LUT.
The percentage of under utilized LUTs in a design is determined using Equation 5.9.
Here, LU Tk signifies that k inputs out of 4 are used by the design block realized by the
LUT. So, LU T2 and LU T3 are under-utilized LUTs, while LU T4 is fully utilized.
LU T2 + LU T3
%U nderU tilizedLU T s = ∗ 100 (5.9)
LU T2 + LU T3 + LU T4
Al Bl
Al Bl
AhBh
AhBh
C0 =A0 B0
C1 =A0 B0 + A1 B1 + (A0 + A1 )(B0 + B1 ) (5.10)
C2 =A1 B1
This requires three LUTs on the FPGA: one for each of the output bits (C0 , C1 , C2 ).
The total number of LUTs required for the m bit recursive Karatsuba multiplication is
given by Equation 5.11.
logX
2 m−2
= 3k (2log2 m−k+1 − 1)
k=0
The delay of the recursive Karatsuba multiplier in terms of LUTs is given by Equation
5.12. The first log2 (m) − 1 recursions have a delay of 2LU T s. The last recursion has a delay
of 1LU T .
DELAYR (m) = (2(log2 (m) − 1) + 1)DLU T
(5.12)
= (2log2 (m) − 1)DLU T
General Karatsuba Multiplier: The m bit general Karatsuba algorithm [404] is shown in
Algorithm 5.1. Each iteration of i computes two output bits Ci and C2m−2−i . Computing
the two output bits require same amount of resources on the FPGA. The lines 6 and 7 in
the algorithm is executed once for every even iteration of i and is not executed for odd
iterations of i. The term Mj + Mi−j + M(j,i−j) is computed with the four inputs Aj , Ai−j ,
Bj and Bi−j , therefore, on the FPGA, computing the term would require one LUT. For
an odd i, Ci would have ⌈i/2⌉ such LUTs whose outputs have to be added. The number
of LUTs required for this is obtained from Equation 5.7. An even value of i would have
two additional inputs corresponding to Mi/2 that have to be added. The number of LUTs
required for computing Ci (0 ≤ i ≤ m − 1) is given by Equation 5.15.
1 if i = 0
#LU Tci = ⌈i/2⌉ + #LU T (⌈i/2⌉) if i is odd (5.15)
i/2 + #LU T (i/2 + 2) if i is even
The total number of LUTs required for the general Karatsuba multiplier is given by
Equation 5.16. !
m−2
X
#LU T SG (m) = 2 #LU TCi + #LU TCm−1 (5.16)
i=0
When implemented in hardware, all output bits are computed simultaneously. The delay
of the general Karatsuba multiplier (Equation 5.17) is equal to the delay of the output bit
with the most terms. This is the output bit Cm−1 (lines 15 to 22 in the Algorithm 5.1).
Equation 5.17 is obtained from Equation 5.15 with i = m − 1. The ⌈i/2⌉ computations are
done with a delay of one LUT (DLU T ). Equation 5.8 is used to compute the second term
of Equation 5.17.
m General Simple
Gates LUTs LUTs Under Gates LUTs LUTs Under
Utilized Utilized
2 7 3 66.6% 7 3 66.6%
4 37 11 45.5% 33 16 68.7%
8 169 53 20.7% 127 63 66.6%
16 721 188 17.0% 441 220 65.0%
29 2437 670 10.7% 1339 669 65.4%
32 2977 799 11.3% 1447 723 63.9%
DLU T + DELAY (⌈(m − 1)/2⌉) if m − 1 is odd
DELAYG (m) = (5.17)
DLU T + DELAY ((m − 1)/2 + 2) if m − 1 is even
233 Simple
116 117
58 58 58 59
29 29 29 29 29 29 30 29
14 15 14 15 14 15 14 15 14 15 14 15 15 15 14 15
General
59 bit multiplications are implemented with 29 and 30-bit multipliers, the 29 and 30-bit
multiplications are done using 14 and 15 bit general Karatsuba multipliers.
The number of recursions in the hybrid Karatsuba multiplier is given by
m
r = ⌈log2 ⌉+1 (5.18)
29
The ith recursion (0 < i < r) of the m-bit multiplier has 3i multiplications. The multi-
pliers in this recursion have bit lengths ⌈m/2i ⌉ and ⌊m/2i ⌋. For simplicity we assume the
number of gates required for the ⌊m/2i ⌋ bit multiplier is equal to that of the ⌈m/2i ⌉ bit
multiplier. The total number of AN D gates required is the AN D gates for the multiplier in
the final recursion (i.e. ⌈m/2r−1 ⌉ bit multiplier) times the number of ⌈m/2r−1 ⌉ multipliers
present. Using Equation 5.6,
3r−1 m m
#AN D = ⌈ r−1 ⌉ ⌈ r−1 ⌉ + 1 (5.19)
2 2 2
The number of XOR gates required for the ith recursion is given by 4⌈ 2mi ⌉ − 4. The total
number of two input XORs is the sum of the XORs required for last recursion, #XORgr−1 ,
and the XORs required for the other recursions, #XORsi . Using Equations 5.5 and 5.6,
r−2
X
#XOR = 3r−1 #XORgr−1 + 3i #XORsi
i=1
(5.20)
Xr−2
r−1 m 2 m i m
=3 10⌈ r ⌉ − 7⌈ r ⌉ + 1 + 3 4⌈ i ⌉ − 4
2 2 i=1
2
The delay of the hybrid Karatsuba multiplier (Equation 5.21) is obtained by subtracting
the delay of a ⌈m/2r−1 ⌉ bit simple Karatsuba multiplier from the delay of an m bit simple
Karatsuba multiplier, and adding the delay of a ⌈m/2r−1 ⌉ bit general Karatsuba multiplier.
1.1e+06
1e+06
900000
800000
700000
Area * Delay
600000
500000
400000
300000
200000
Number of bits
TABLE 5.2: Comparison of the Hybrid Karatsuba Multiplier with Reported FPGA Im-
plementations
Multiplier Platform Field Slices Delay Clock Computation Performance
(ns) Cycles Time(ns) AT (µs)
Grabbe [136] [397] XC2V6000 240 1660 12.12 54 655 1087
Gathen [397] XC2V6000 240 1480 12.6 30 378 559
This work [315] XC4V140 233 10434 16 1 16 154
XC2VP100 233 12107 19.9 1 19.9 241
A theoretical model is used for the ITA design to analyze how various design strategies
affect the area, delay, and clock cycle requirements on FPGA platforms. For a given field
GF (2m ), size of an LUT in the FPGA (k), and addition chain, our model predicts the
best exponentiation circuit 2n and the ideal number of replicated circuits which would give
peak performance. The theoretical results are experimentally validated on 4 and 6 input
LUT-based FPGAs over binary fields specifed in NIST’s Digital Signature Standard[393].
The naive method for computing inverse using the above equation requires m − 1 squarings
and m − 2 field multiplications. Since field multiplications are costly, the naive approach is
not efficient. Itoh-Tsujii algorithm [160] reduces the number of field multiplications required
by using an addition chain for m−1. An addition chain for a positive integer n is a sequence
of natural numbers U = (u0 , u1 , · · · , ul ), such that the following properties are satisfied.
• u0 = 1
• ul = n
• ui = uj + uk for i = j + k and uj , uk ∈ U
An example of addition chain for 162 is U = (1, 2, 4, 5, 10, 20, 40, 80, 81, 162).
Brauer chains are a special class of addition chains in which j = i − 1. An optimal chain
for n is the smallest addition chain for n.
The Itoh-Tsujii inversion algorithm was initially proposed for computing inverse using
normal basis representations. Later in [141], ITA was modified for polynomial basis repre-
sentations.
k
The ITA works as follows: For a ∈ GF (2m ), let βk (a) = a2 −1 and k ∈ N . Using
Equation 5.22 we get, a−1 = βm−1 (a)2 . For simplicity, we denote βk (a) by βk . In [333], the
following recursive sequence is used with an addition chain for m − 1 to compute βm−1 .
k j
βk+j (a) = (βj )2 βk = (βk )2 βj (5.23)
Here k, j and k + j are integers and are members of an addition chain for m − 1. As an
example, computing the inverse of a nonzero element a ∈ GF (2163 ) requires computation
162
of β162 (a) = a2 −1 . Finally a squaring is performed to get a−1 = [β162 (a)]2 . The recursive
steps followed in computing β162 are shown in Table 5.4.
As a further example, consider finding the inverse of an element a ∈ GF (2233 ). This
232
requires computing β232 (a) = a2 −1 and then doing a squaring (i.e. [β232 (a)]2 = a−1 ). A
Brauer chain for 232 is as shown below.
Computing β232 (a) is done in 10 steps with 231 squarings and 10 multiplications as
shown in Table 5.4.
In general if l is the length of the addition chain, finding the inverse of an element in
GF (2m ), requires l −1 multiplications and m−1 squarings. The length of the addition chain
is related to m by the equation l ≤ ⌊log2 m⌋ [201], therefore the number of multiplications
required by the ITA is much lesser than that of the naive method.
Input Squarer−1
Squarer−2
Squarer−3
Multiplexer
Squarer−(us−1)
Squarer−us
Control
We estimate the clock cycles required in an ITA architecture, and subsequently show
the gains for performing the ITA computation in higher powers on a LUT based FPGA.
l
X ui − ui−1
#ClockCycles = 1 + (l − 1) + ⌈ ⌉
i=2
us
(5.25)
l
X ui − ui−1
=l+ ⌈ ⌉
i=2
us
In order to reduce the clock cycles a parallel architecture was proposed in [330]. The
reduced clock cycles is achieved at the cost of increased hardware. In the remaining part of
this section we propose a novel ITA designed for the FPGA architecture. The proposed de-
sign, though sequential, requires the same number of clock cycles as the parallel architecture
of [330] but has better area×time product.
m−1
X
a(x)2 = ai x2i mod p(x) (5.26)
i=0
This is a linear equation and hence can be represented in the form of a matrix (T ) as shown
in the equation below.
a2 = T · a
The matrix depends on the finite field GF (2m ) and the irreducible polynomial of the field.
Exponentiation in the ITA is done with squarer circuits. We extend the ITA so that the
exponentiation can be done with any 2n circuit and not just squarers. Raising a to the
power of 2n is also linear and can be represented in the form of a matrix as shown below.
n
a2 = T n (a) = T ′ a
where k1 , k2 , and n ∈ N
Proof 1
nk2
RHS = (αk1 (a))2 αk2 (a)
nk1 nk2 nk2
= (a2 −1 2
) (a2 −1
)
n(k1 +k2 ) nk2 nk2
2 −2 +2 −1
= (a )
n(k1 +k2 )
= (a2 −1
)
= αk1 +k2 (a)
= LHS
TABLE 5.5: Comparison of LUTs Required for a Squarer and Quad Circuit for GF (29 )
Proof 2 When n | (m − 1)
h i2 h n( m−1 ) i2
α m−1 (a) = a2 n −1
n
h m−1 i2
= a2 −1
= a−1
When n ∤ (m − 1)
h r
i2 h nq r r
i2
(αq (a))2 βr (a) = (a2 −1 )2 (a2 −1 )
h nq+r i2
= a2 −1
h m−1 i2
= a2 −1
= a−1
We note that elliptic curves over the field GF (2m ) used for cryptographic purposes [393]
have an odd m, therefore we discuss with respect to such values of m, although the results
are valid for all m. In particular, we consider the case when n = 2; such that
k
αk (a) = a4 −1
To implement this we require quad circuits. To show the benefits of using a quad circuit
on an FPGA instead of the conventional squarer, consider the equations for a squarer and
a quad for an element b(x) ∈ GF (29 ) (Table 5.5). The irreducible polynomial for the field
is x9 + x + 1. In the table, b0 · · · b8 are the coefficients of b(x). The #LUTs column shows
the number of LUTs required for obtaining the particular output bit.
We would expect the LUTs required by the quad circuit be twice that of the squarer.
However this is not the case. The quad circuit’s LUT requirement is only 1.5 times that of
the squarer. This is because the quad circuit has a lower percentage of under utilized LUTs
(Equation 5.9). For example, from Table 5.5 we note that output bit 4 requires three XOR
gates in the quad circuit and only one in the squarer. However, both circuits require only 1
TABLE 5.6: Comparison of Squarer and Quad Circuits on Xilinx Virtex 4 FPGA
LUT. This is also the case with output bit 8. This shows that the quad circuit is better at
utilizing FPGA resources compared to the squarer. Moreover, both circuits have the same
delay of one LUT. If we generate the fourth power by cascading two squarer circuits (i.e.,
(b(x)2 )2 ), the resulting circuit would have twice the delay and require 25% more hardware
resources than a single quad circuit.
These observations are scalable to larger fields as shown in Table 5.6. The circuits for
the finite fields GF (2233 ) and GF (2193 ) use the irreducible polynomials x233 + x74 + 1 and
x193 + x15 + 1, respectively. They were synthesized for a Xilinx Virtex 4 FPGA. The table
shows that the area saved even for large fields is about 25%. While the combinational delay
of a single squarer is equal to that of the quad.
Based on this observation we propose a quad-ITA using quad exponentiation circuits
instead of squarers. The procedure for obtaining the inverse for an odd m using the quad-
ITA is shown in Algorithm 5.3. The algorithm assumes a Brauer addition chain.
The overhead of the quad-ITA is the need to precompute a3 . Since we do not have a
squarer this has to be done by the multiplication block, which is present in the architecture.
Using the multiplication unit, cubing is accomplished in two clock cycles without any ad-
ditional hardware requirements. Similarly, the final squaring can be done in one clock cycle
by the multiplier with no additional hardware required.
Consider the example of finding the multiplicative inverse of an element a ∈ GF (2233 )
using the quad-ITA. From Theorem 5.11.2, setting n = 2 and m = 233, a−1 = [α 232 2
(a)]2 =
2.116 116
[α116 (a)]2 . This requires computation of α116 (a) = a2 −1
= a4 −1 and then doing a
squaring, a = (α116 (a)) . We use the same Brauer chain (Equation 5.24) as we did in
−1 2
the previous example. Excluding the precomputation step, computing α116 (a) requires 9
steps. The total number of quad operations to compute α116 (a) is 115 and the number of
multiplications is 9. The precomputation step requires 2 clock cycles and the final squaring
takes one clock cycle. In all, 12 multiplications are required for the inverse operation. In
general for an addition chain for m − 1 of length l, the quad-ITA requires two additional
multiplications compared to the ITA implementation of [330].
#M ultiplications : l + 1 (5.28)
(m − 1)
#QuadP owers : −1 (5.29)
2
The number of clock cycles required is given by the Equation 5.30. The summation in
the equation is the clock cycles required for the quadblock, while l + 1 is the clock cycles of
the multiplier.
l−1
X ui − ui−1
#ClockCycles = (l + 1) + ⌈ ⌉ (5.30)
i=2
us
The difference in the clock cycles between the ITA of [330] (Equation 5.25) and the
quad-ITA (Equation 5.30) is
ul − ul−1
⌈ − 1⌉ (5.31)
us
In general for addition chains used in ECC, the value of ul − ul−1 is as large as (m − 1)/2
and much greater than us , therefore the clock cycles saved is significant.
Clk
Control
Reset
sel1
1 MUX
B QOUT
2 qsel Clk
sel3 Quadblock
0
MUX
C
1
Regbank
rcntl
FIGURE 5.6: Quad-ITA Architecture for GF (2233 ) with the Addition Chain 5.24
generating the fourth power of its input. If qin is the input to the quadblock, the powers of
2 3 14
qin generated are qin4 , qin4 , qin4 · · · qin4 . A multiplexer in the quadblock, controlled
quad circuit − 2
quad circuit − 3
Multiplexer
quad circuit − us
qsel
TABLE 5.8: Control Word for GF (2233 ) Quad-ITA for Table 5.7
by the select lines qsel, determines which of the 14 powers gets passed on to the output.
qsel
The output of the quadblock can be represented as qin4 .
Two buffers M OU T and QOU T store the output of the multiplier and the quadblock
respectively. At every clock cycle, either the multiplier or the quadblock (but not both)
is active (The en signal if 1 enables either the M OU T , otherwise the QOU T buffer). A
register bank may be used to store results of each step (αui ) of Algorithm 5.3. A result is
stored only if it is required for later computations.
The controller is a state machine designed based on the adder chain and the number of
cascaded quad circuits in the quadblock. At every clock cycle, control signals are generated
for the multiplexer selection lines, enables to the buffers and access signals to the register
bank. As an example, consider the computations of Table 5.7. The corresponding control
signals generated by the controller is as shown in Table 5.8. The first step in the computation
of a−1 is the determination of a3 . This takes two clock cycles. In the first clock, a is fed
to both inputs of the multiplier. This is done by controlling the appropriate select lines of
the multiplexers. The result, a2 , is used in the following clock along with a to produce a3 .
This is stored in the register bank. The second step is the computation of α2 (a). This too
requires two clock cycles. The first clock uses a3 as the input to the quadblock to compute
Computing α5 (a) = α2+3 (a) requires α2 (a), therefore α2 (a) needs to be stored. Similarly,
α1 (a), α5 (a) and α12 (a) needs to be stored to compute α3 (a), α17 (a) and α29 (a) respec-
tively. In all four registers are required. Minimizing the number of registers is important
because, for cryptographic applications m is generally large, therefore each register’s size is
significant.
Using Brauer chains has the advantage that for every step (except the first), at least
one input is read from the output of the previous step. The output of the previous step is
stored in M OU T , therefore need not be read from any register and no storage is required.
The second input to the step would ideally be a doubling. For example, computing α116 (a)
requires only α58 (a). Since α58 (a) is the result from the previous step, it is stored in M OU T .
Therefore, computing α116 (a) does not require any stored values.
us · tp ≤ Delay of multiplier
However, reducing us would increase the clock cycles required. Therefore, we select us so
that the quadblock delay is close to the multiplier delay.
The graph in Fig. 5.8 plots the computation delay (clock period in nanoseconds × the
clock cycles) required versus the number of quads in the quad-ITA for the field GF (2233 ).
For small values of us , the delay is mainly decided by the multiplier, while the clock cycles
850
800
700
650
600
550
500
450
400
350
300
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Number of cascaded Quads
FIGURE 5.8: Clock Cycles of Computation Time versus Number of Quads in Quadblock
on a Xilinx Virtex 4 FPGA for GF (2233 )
required is large. For large number of cascades, the delay of the quadblock exceeds that of
the multiplier, therefore the delay of the circuit is now decided by the quadblock. Lowest
computation time is obtained with around 11 cascaded quads. For this, the delay of the
quadblock is slightly lower than the multiplier. Therefore, the critical delay is the path
through the multiplier, while the clock cycles required is around 30. Therefore for the quad-
ITA in a field GF (2233 ), 11 cascaded quads result in least computation time. However, in
order to make the clock cycles required to compute the finite field inverse in GF (2233 ) equal
to the parallel implementation of [330], 14 cascaded quads are used even though this causes
a marginal increase in the computation time (which is still quite lesser than the parallel
implementation at 0.55 µsec).
Quad-ITA
Squarer-ITA
500
400
300
200
100
0
100
110
120
130
140
150
160
170
180
190
200
210
220
230
240
250
260
270
280
290
300
Finite Field GF(2^x)
synthesis tool. The graph shows that the quad-ITA has better performance compared to
the squarer-ITA for most fields.
Table 5.9 compares the quad-ITA with the best reported ITA and Montgomery inverse
algorithms available. The FPGA used in all designs is the Xilinx Virtex E. The quad-ITA
has the best computation time and performance compared to the other implementations. It
may be noted that the larger area compared to [333] and [330] of the quad-ITA is because
it uses distributed RAM [418] for registers, while [333] and [330] use block RAM [417]. The
distributed RAM requires additional CLB resources while block RAM does not.
Other higher order exponentiation circuits can also be used in ITA [321]. In the next
section, we show a generalization of ITA that uses exponentiation by 2n to compute inverse.
nk1 nk2
Theorem 16 If a ∈ GF (2m ), αk1 (a) = a2 −1
and αk2 (a) = a2 −1
then
nk2
αk1 +k2 (a) = [αk1 (a)]2 · αk2 (a)
where k1 , k2 and n ∈ N.
Addition chain for (m − 1)/n is used when n|(m − 1), while addition chain for q is used for
the other case. The working principle of ITA using 2n exponentiation circuit is presented
in Algorithm 5.4.
m−1
U = {1, 2, . . . , q} where q = ⌊ ⌋
n
Output: a−1 ∈ GF (2m ) such that a−1 · a = 1
1 begin
2 l = length of U
n
3 αu1 = a2 −1
r
4 βr = a2 −1
5 for i = 2 to l do
6 k1 = ui−1
7 k2 = ui − ui−1
nk2
8 αui = αk21 ∗ αk2
9 end
r
10 αul = αu2 l · βr
11 a = αul ∗ αul
−1
12 end
n 2 n−1
Algorithm 5.4 starts with the precomputation of α1 = a2 −1 = a · a2 · a2 · · · a2 .
The precomputation can be done efficiently using an addition chain for 2n − 1. For the case
r
n ∤ (m−1) computation of βr (a) = a2 −1 is required. Since 1 ≤ r ≤ (n−1), the computation
cost of βr (a) can be reduced by using an addition chain for 2n − 1 which contains r. When
such an addition chain is used, βr (a) becomes an intermediate value during the computation
of α1 . The iteration in line 5 of the algorithm is used for computation of the recursive
relation shown in Theorem 16. After computation of αul , a final squaring is required to get
the inverse for the case n|(m − 1), while small amount of extra computation (line number
10 of Algorithm 5.4) is required for the case n ∤ (m − 1).
The next section describes a generic hardware architecture for the generalized ITA.
A hardware architecture for generalized ITA is presented in Fig. 5.10. The finite field
multiplier is used to perform field multiplications required in Algorithm 5.4. The multiplier
is of combinational type and follows the hybrid Karatsuba algorithm. Buffer M OU T is used
to latch output from the multiplier.
Computing αui , 2 ≤ i ≤ l in line 8 of Algorithm 5.4 requires exponentiation of αk1
by 2nk2 . Exponentiations are performed by the Powerblock in Fig. 5.10 which contains 2n
exponentiation circuits. Output from the Powerblock is latched in buffer QOU T . If a single
nk2
2n circuit is used, it will take k2 number of clock cycles to compute αk21 . To reduce the
number of clock cycles required for repeated exponentiations, us number of cascaded 2n
circuits are kept in the Powerblock. An explored diagram of the Powerblock is shown in
Fig. 5.11. In a single clock cycle, the Powerblock can raise an element a to a maximum
nus
of a2 . The multiplexer in the Powerblock is used to tap out interim outputs when the
number of repeated exponentiations k2 is less than us . The control signal for the Powerblock
is qsel. For the case k2 > us , more than one clock cycles are required and buffer QOU T
works as a feedback for the Powerblock.
Control signal en is used in each clock cycle to latch the output of either the multiplier
or the Powerblock (but not both). Any intermediate result that is required in a later step
for the recursive sequence, is stored in a register bank Regbank. The Control block generates
all the control signals that drive a Finite State Machine (FSM).
The performance of the architecture as a function of the area, delay, and number of clock
cycles required, mainly depends on the multiplier and the Powerblock. For a given field, the
multiplier architecture is fixed, while the exponentiation circuit (n) and the number of such
circuits (us ) are variable. In order to achieve optimal performance, these design parameters
have to be tuned. Later in Section 5.15, a formal approach is presented which models the
area and delay requirement of the generalized ITA architecture for various values of n and
us , and tries to estimate the ideal values for these two design parameters for any given field
GF (2m ) on k−LUT based FPGAs. The next few lines of this section present an analysis
for the number of clock cycles required to compute inverse using the 2n ITA architecture
over field GF (2m ).
nk2
Clock Cycle Requirement for the 2n ITA Architecture: Computing αk21 in line 8
of Algorithm 5.4 requires k2 = ui − ui−1 number of repeated exponentiations, where ui and
ui−1 are elements in the addition chain for q. Considering us number of cascaded 2n circuits,
nk2
⌈ ui −u
us
i−1
⌉ number of clock cycles are required for computing αk21 . For the multiplication
with αk2 , a single clock cycle is required (assuming a combinational multiplier). Additionally,
l′ − 1 clock cycles are required to compute α1 , where l′ is the length of the addition chain
for 2n − 1. The final squaring requires a clock cycle, and an extra clock cycle is used to
indicate completion of the field inversion. In all, the total number of clock cycles required
when n | (m − 1) is
l l
′ X ui − ui−1 m
#ClockCycles = l + l + (5.34)
i=2
us
r
When n ∤ (m − 1), r additional clock cycles are required for (αq (a))2 calculation and
r
one clock cycle for (αq (a))2 βr (a) calculation (line 10 in Algorithm 5.4). Thus, the total
clock-cycle requirement for n ∤ (m − 1) is
l l
′ X ui − ui−1 m
#ClockCycles = l + l + 1 + r + (5.35)
i=2
us
10
9
Number of LUTs in Critical Path
4
5 6 7 8 9 10 11
Delay of Circuit
Delay in FPGAs is comprised of LUT delays and routing delays. Analyzing the delay of a
circuit on FPGA platform is much more complex than area analysis. By experimentation we
have found that for designs having combinational components, the delay of the design varies
linearly with the number of LUTs present in the critical path. Fig. 5.12 shows this linear
relationship between the number of LUTs in the critical path and the delay in multipliers of
different sizes. Due to such linear relationship, we can consider that the number of LUTs in
the critical path is a measure of actual delay. From now onwards we use the term lut delay
to mean the number of k-LUTs present in the critical path.
For a x variable Boolean function, number of k-LUTs in the critical path is denoted by
the function maxlutpath(x) and is
We will use Equation 5.36 and 5.37 for estimating area and delay of different finite field
primitives used in the generalized ITA architecture.
5.15.2 Area and LUT Delay Estimates for the Field Multiplier
Multiplication in binary finite fields involves a polynomial multiplication followed by a
modular reduction. Multipliers are categorized into different configurations based on the
scope of the parallelism in them. Three common architectures are serial multipliers, word-
parallel multipliers, and bit-parallel multipliers. Serial multipliers and word-parallel mul-
tipliers use less area and have less delays but require several clock cycles. A bit-parallel
multiplier is one in which all the computed output bits are available in the same clock
cycle. Bit-parallel multipliers are fast, but comes with an enormous area overhead, and
has large delays. For fast ECC architectures, number of clock cycles is an important factor
affecting the performance and thus bit-parallel multipliers are preferred.
Several algorithms for bit-parallel finite field multiplication exist in literature. Of these
the Karatsuba multiplier is the only one with sub-quadratic space and time complexity. In
a Karatsuba multiplier, the m degree polynomial operands a and b are split into half as
shown:
m m
a = ah x⌊ 2 ⌋ + al b = bh x⌊ 2 ⌋ + bl
If m is odd, ah and bh are padded with a bit to make all terms equal size. The m bit
multiplication is given by
The Karatsuba algorithm is applied recursively for the three m/2 bit multiplications: (ah ·
bh ), (al · bl ), and (ah + al ) · (bh + bl ). Each recursion reduces the size of the input by half
while tripling the number of multiplications required.
A fully recursive Karatsuba multiplier is inefficient due to two reasons. First, for small
sized multiplications, the percentage of under utilized LUTs is high causing a bloated area
requirement. Second, in the Karatsuba algorithm the number of multiplications triple with
every recursion. So there are several instances of small multipliers. Thus the bloated size of
the small multipliers affects the overall area.
For small operands, other quadratic complexity multiplication algorithms out-perform
the Karatsuba algorithm. There exists a threshold (τ ) in the operand size, above which the
Karatsuba algorithm performs better. The multiplier uses the Simple Karatsuba algorithm
when the size of the multiplicands is above the threshold and a general Karatsuba algorithm
[404] for multiplications smaller than the threshold. Our field multiplier is designed in the
similar way described in Section 5.7.1, but uses the classical polynomial multiplier instead of
the general Karatsuba multiplier for multiplications below the threshold τ . We have found
that classical multipliers deliver better results compared to general Karatsuba multipliers
when used in threshold level.
For the multiplication a · b, the operands a and b are recursively split into smaller
operands until the size of each operand reaches the threshold. Classical multiplications are
performed at the threshold level and then output from threshold multipliers are recursively
combined and the final multiplication result is obtained after modular reduction. Thus,
the entire bit-parallel Karatsuba multiplier consists of four stages (Fig. 5.13): the first
stage splits the input operands to produce threshold operands; the second stage consists of
threshold level multipliers; the third stage recursively combines the outputs from threshold
level multipliers; and final stage does modular reduction.
Here we present an analysis for estimating the number of LUTs required for the field
multiplier. In Equation 5.38, it can be seen that the splitting of a and b introduces several
smaller operands, out of which (ah + al ) and (bh + bl ) requires addition operations (bitwise
exclusive-or) and thus requires LUTs. We call such operands as overlapped operands. We
consider the LUT size k is not less than 4. For a m bit Karatsuba multiplier where m > τ ,
the splitting of the m bit operands requires 2 · ⌈ m
2 ⌉ number of LUTs, while the combination
stage has an area requirement of 2m − 1 LUTs. The total LUT requirement for the recursive
Karatsuba multiplier of size m > τ is given by the recursive formula:
m m
#LU Thkmul (m) = 2 · LU Thkmul (⌈ ⌉) + LU Thkmul (⌊ ⌋)
2 2
m
+2 · ⌈ ⌉ + 2m − 1 (5.39)
2
When the multiplication size reaches τ , classical multipliers are used. The total number of
LUTs for a τ bit classical multiplier is given by the following formula:
τ
X −1
#LU Tsbmul = 2 lut(2i) + lut(2τ ) (5.40)
i=1
Now we present an analysis for delay of the combinational bit-parallel hybrid Karatsuba
multiplier. As the splitting is recursive, a chain of overlapped operands are generated. The
critical delay path of the splitting stage is through the longest chain of overlapped operands.
Fig. 5.14 shows the first two steps of recursive splitting of overlapped operands for a 256-bit
input operand. After first round, the overlapped operand size reduces to 128 bits and each
bit is sum of two input bits. Similarly, after second round of splitting, size of overlapped
operands become 64 bits and each bit of the new overlapped operand is sum of four input
bits. If the threshold operand size τ bits, then after r = log2 ⌈ 256
τ ⌉ rounds, the operand will
reduce to a size of τ and each bit of the threshold overlapped operand will be sum of 2r
input bits. Thus, from Equation 5.37, the maximum LUT delay of the splitting stage is
Dsp = ⌈logk (2r )⌉ = ⌈logk mτ ⌉.
The threshold multipliers follow classical multiplication algorithm and has LUT delay
given below
Dth = maxlutpath(2τ ) = ⌈logk (2τ )⌉ (5.41)
The Combine Multiplication Outputs block in the Fig. 5.13 has ⌈log2 ( m τ )⌉ levels and
each level has one LUT delay for k ≥ 4. So, the delay of this stage is Dcom = ⌈log2 ( m
τ )⌉.
After the polynomial multiplication, the modular reduction circuit is used to reduce the
2m − 2 bit output of the multiplier to m bits. The delay of the reduction circuit Dmod
depends on the irreducible polynomials. Dmod is 1 for trinomials and 2 for pentanomials
considering 4 ≤ k ≤ 6.
The delay for the entire hybrid Karatsuba multiplier is given by the following equation.
almost twice the field size m for 4 ≤ k < 6 and is almost equal to m for k ≥ 6, while the
maxlutpath is 2.
Similarly, Equation 5.37 gives the delay per output bit di in terms of LUTs. Since all output
bits are computed in parallel, the delay of the 2n circuit is the maximum LUT delays of all
the di output bits and is given by,
Delay of the MUX in terms of LUTs is equal to the maxlutpath of 2s + s variables and is
given by,
If 2s−1 < number of inputs < 2s , then estimations in (5.43) and (5.44) for 2s inputs give
an upper bound. Practically, the values in this case are slightly lesser than the values for
2s inputs in (5.43) and (5.44), and therefore the difference can be neglected.
#LU TM U XP = m × lut(2r + r)
Since, there are us number of cascaded 2n circuits, the delay of the entire cascade is us
times the delay of a single 2n circuit. The delay of the multiplexer in the Powerblock is
added to this delay to get the total delay of the Powerblock.
DP AT H2 = DM U XC + DP owblk
≈ 1 + us × D2n + ⌈logk (us + log2 (us ))⌉
A finite state machine drives the control signals. Each control signal depends only on the
value of the state variable and nothing else. Therefore it can be assumed that the delay
in generating the control signals depends on the number of bits in the state register. The
state variable increments up to the value of #ClockCycles. Assuming a binary counter, the
number of bits in the state variable is log2 (#ClockCycles). The delay in the control signals
is
This delay is added to the path delay. The total delay in the ITA in terms of LUTs is
For a given field GF (2m ) and k-LUT based FPGA, the Powerblock can be configured
with different exponentiation circuits (2n ) and number of cascades (us ). An increase in us ,
decreases the clock cycles required at the cost of an increase in area and delay through
the Powerblock. To get maximum performance, us should be chosen in such a way that it
minimizes clock cycles without increasing the delay (DIT A ) and area (LU TIT A ) significantly.
DP AT H1 is not dependent on n or us and is constant. From Equation 5.47, it follows that
DIT A is minimum when
DP AT H1 ≈ DP AT H2 . (5.48)
When Equation 5.48 is satisfied, we can assume that DIT A in Equation 5.47 is
#DIT A = DP AT H1 + DCntrl .
DCntrl has a small value and is less dependent on n and us . Since DP AT H1 is constant,
DIT A can be considered to be a constant when Equation 5.48 is satisfied.
In Equation of LU TIT A , the multiplier consumes a significant portion of total area. The
area of a 2n circuit is small. So, variation of n and us under Equation 5.48 causes negligible
variation in total area. Thus, we can assume that LU TIT A remains almost unaffected when
Equation 5.48 is satisfied.
The #ClockCycles changes significantly with n and us . Clock cycle requirement in
Equations 5.34 and 6.7 can be approximated as ,
˜ m−1
ClockCycles ≈ (2n − 1) + l + r + ⌈ ⌉ (5.49)
nus
where 2n − 1 is the clock cycle requirement for α1 computation and l is the length of
addition chain for ⌊ m−1
n ⌋. From Equation 5.49, it can be seen that with increase in n, α1
computation increases linearly with n, but the term ⌈ m−1
nus ⌉ reduces. When n is small, ⌈ nus ⌉
m−1
ˆ
is significant compared to 2n − 1. So, ClockCycles reduces significantly with increase in
n. For large values of n, the term (2n − 1) dominates the term ⌈ m−1 ˜
nus ⌉. So, ClockCycles
increases monotonically with n. This implies that we could restrict n to those values for
which we get savings in #ClockCycles. This effect of increase in n on clock cycles for α1
computation and on #ClockCycles is shown in Fig. 5.16 for the field GF (2233 ).
Minimization of #ClockCycles without increasing the DIT A leads to maximum perfor-
mance. The following algorithm is used to obtain best design parameters.
2. Now, for a higher value of n say ni = i, where i > 1, the number of cascades usi is de-
˜
termined as in Step 1 and the corresponding value of ClockCycles ˜
(say ClockCycles i
is obtained.
˜
3. The second step is repeated for incremental values of ni until ClockCyclesi increases
monotonically (as per the previous discussion) as ni increases.
The best design strategy for the Powerblock is the configuration (n̂, uˆs ) which gives
˜
the minimum ClockCycles. For example, when m = 233 and k = 4, (2, 8) and (4, 5) give
˜
minimum value of ClockCycles and satisfy Equation 5.48 closely. In the next section we
support our theoretical estimates with experimental results.
Here C is a constant which depends on the device technology. In our experiments we used
4vlx80ff1148-11 and xc5vlx220-2-ff1760 FPGAs, which have 4-LUT and 6-LUT respectively.
For the 4-LUT FPGA, we experimentally found C = 0.2, while the 6-LUT had C = 0.1.
Table 5.10 shows comparisons between the theoretical and experimental results for quad
based ITA in GF (2233 ) with different number of cascaded quad blocks for 4-LUT based
FPGAs. The table (5.10) shows that our estimation for number of cascades uˆs = 8 gives best
performance for n̂ = 2. Graphs in Fig. 5.17 and Fig. 5.18 show performance variations
with exponentiations on 4 and 6−LUT based FPGAs for the trinomial field GF (2233 ) and
pentanomial fields GF (2163 ) and GF (2283 ) [393]. In the graphs Ex implies performance
TABLE 5.10: Theoretical and Experimental Comparison for Different Number of Cascades
of a 22 based ITA in the field GF (2233 )
obtained by experiments and Th implies theoretical estimates. From the graphs, it can
be seen that experimental and theoretical performances follow the same trend. The small
differences occur due to unpredictablity of FPGA routing.
Table 5.11 shows the best performing configurations for some of the NIST recommended
binary fields [393] and GF (2193 ). The exponentiation circuit used (ITA type) and the num-
ber of cascades in the table result in best performance for the respective field. It can be
seen that in most cases exponentiations higher than quad give best results.
Experimental and theoretical results show that performance of the ITA architecture
increases significantly in 6-LUT based FPGAs compared to 4-LUT based FPGAs. This
happens due to lesser LUT requirement and better routing in 6-LUT based FPGAs.
Table 5.12 shows a comparison of performances with existing ITA architectures for the
field GF (2193 ). All results are taken on the same Xilinx Virtex E platform. Our design is
a 23 (octet) based ITA with 8 exponentiation circuits, since for GF (2193 ) on Virtex E, our
estimation model found (n̂ = 3, uˆs = 8). It is clear from Table 5.12 that 23 (octet) based
ITA gives the best performance.
5.18 Conclusions
In this chapter, we study some design approaches for implementing basic finite field
operations, namely multipliers and inversions. We focus on the Karatsuba multiplier and
the Itoh-Tsujii inversion algorithms for their speeds. Various design challenges are discussed,
keeping in mind the underlying LUT structure of the FPGAs. We observe that for Karatsuba
multipliers, proper thresholding, selection of suitable algorithms help in obtaining efficient
designs. In context with the Itoh-Tsujii inversion circuit, we observe that higher powers
often give circuits which utilize the LUTs in a better way; however, there is a limiting
condition beyond which the advantage is lost. In the next chapter, we discuss on design of
an efficient Elliptic Curve Cryptosystem using the above underlying primitives.
6.1 Introduction
This chapter presents the construction of an elliptic curve crypto processor (ECCP) for
the NIST specified curve [393] given in Equation 6.1 over the binary finite field GF (2233 ).
y 2 + xy = x3 + ax2 + b (6.1)
The processor implements the double and add scalar multiplication algorithm described in
Algorithm 2.3. The processor (Fig. 6.1), is capable of doing the elliptic curve operations of
153
point addition and point doubling. Point doubling is done at every iteration of the loop in
Algorithm 2.3, while point addition is done for every bit set to one in the binary expansion
of the scalar input k. The output produced as a result of the scalar multiplication is the
product kP . Here, P is the basepoint of the curve and is stored in the ROM in its affine form.
At every clock cycle, the register bank (regbank) containing dual ported registers feed the
arithmetic unit (AU) through five buses (A0, A1, A2, A3, and Qin). At the end of the clock
cycle, results of the computation are stored in registers through buses C0, C1 and Qout.
There can be at most two results produced at every clock. Control signals (c[0] · · · c[32])
generated every clock cycle depending on the elliptic curve operation control the data flow
and the computation done. Details about the processor, the flow of data on the buses, the
computations done, etc. are elaborated on in following sections.
The scalar multiplication implemented in the processor ofFig. 6.1 is done using the
López-Dahab (LD) projective coordinate system. The LD coordinate form of the elliptic
curve over binary finite fields is
Y 2 + XY Z = X 3 + aX 2 Z 2 + bZ 4 (6.2)
In the ECCP, a is taken as 1, while b is stored in the ROM along with the basepoint P .
Equations for point doubling and point addition in LD coordinates are shown in Equations
2.31 and 2.32, respectively.
During the initialization phase the curve constant b and the basepoint P are loaded
from the ROM into the registers after which there are two computational phases. The first
phase multiplies the scalar k to the basepoint P . The result produced by this phase is in
projective coordinates. The second phase of the computation converts the projective point
result of the first phase into the affine point kP . The second phase mainly involves an inverse
computation. The inverse is computed using the quad Itoh-Tsujii inverse algorithm, which
is a special case for the generalized Itoh-Tsujii proposed in Algorithm 5.4, with n = 2.
The next section describes in detail the ECCP. Section 6.3 describes the implementation
Qin
Qout
A0
Regbank Arithmetic
A1 C0
Unit
A2
C1
kP A3
c[10:25] c[0:9],c[29:26]
c[24]
c[21]
0
c[10] ad1 RA1 out1 0
MUX din
IN1 MUX A0
1
c[11] ad2 RA2 out2
OUT1
c[12] we
C0 c[25] 1
c[22]
0
c[14:13] ad1 RB1 out1 0
MUX din
IN2 c[16:15] MUX A2
ad2 RB2 out2
OUT2
C1 1 c[31]
c[17] RB3 1 0
we
c[32],c[30]
MUX Qin
00
RB4 c[23]
OUT4
0 1
of the elliptic curve operations in the processor. Section 6.4 presents the finite state machine
that implements Algorithm 2.3. Section 6.5 has the performance results, while the final
section has the conclusion.
Register Description
RA1 1. During initialization it is loaded with Px .
2. Stores the x coordinate of the result.
3. Also used for temporary storage.
RA2 Stores Px .
RB1 1. During initialization it is loaded with Py .
2. Stores the y coordinate of the result.
3. Also used for temporary storage.
RB2 Stores Py .
RB3 Used for temporary storage.
RB4 Stores the curve constant b.
RC1 1. During initialization it is set to 1.
2. Store z coordinate of the projective result.
3. Also used for temporary storage.
RC2 Used for temporary storage.
register file is either the arithmetic unit outputs, the curve constant (b of Equation 6.2), or
the basepoint P = (Px , Py ).
Multiplexers M U XIN 1, M U XIN 2, and M U XIN 3 determine which of the three inputs
gets stored into the register banks. Further, bits in the control word select a register, or
enable or disable a write operation to a particular register bank. Multiplexers M U XOU T 1,
M U XOU T 2, M U XOU T 3, and M U XOU T 4 determine which output of a register bank get
driven on the output buses. Table 6.1 shows how the each register in the bank is utilized.
Qin
c[2:0]
A0
A0 QUADBLOCK Qout
A02
SQUARE
A2 A2 c[29:26] qsel
A0+A2 MUX
A04 A04+A1 A c[7:6]
SQUARE
A1 A1
A14 M
SQUARE SQUARE
A2
MUX C0
A12 C
c[5:3] A3
A1 KARATSUBA
A12 M
MULTIPLIER c[9:8]
A12+A2
A1+A3 MUX A0 M+A0
A22+A1+A3 B A22 A22+M+A0
SQUARE
A3 A3 MUX
A14 D C1
A24
SQUARE
A14 A04+A1
AU. The control lines c[26] to c[29] are used for the select lines to the multiplexer in the
quadblock (Fig. 5.11). The remaining control bits are used in the register file to read and
write data to the registers. Section 6.4 has the detailed list of all control words generated.
Z3 = X12 · Z12
X3 = X14 + b · Z14 (6.3)
Y3 = b · Z14 · Z3 + X3 · (a · Z3 + Y12 +b· Z14 )
This doubling operation is mapped to the elliptic curve hardware using Algorithm 6.1.
TABLE 6.3: Inputs and Outputs of the Register File for Point Doubling
Clock A0 A1 A2 A3 C0 C1
1 RA1 RC1 - - RC1 RB3
2 - RB4 RB3 - RB3
3 RA1 RB3 RB1 RC1 RC2 RA1
4 RB3 RC1 - RC2 RB1 -
On the ECCP, the LD doubling algorithm can be parallelized to complete in four clock
cycles as shown in Table 6.2 [328]. The parallelization is based on the fact that the multiplier
is several times more complex than the squarer and adder circuits used. So, in every clock
cycle the multiplier is used and it produces one of the outputs of the AU. The other AU
output is produced by additions or squaring operations alone.
Table 6.3 shows the data held on the buses at every clock cycle. It also shows where
the results are stored. For example, in clock cycle 1, the contents of the registers RA1 and
RC1 are placed on the bus A0 and A1, respectively. Control lines in M U XA and M U XB
of the AU are set such that A02 and A1 are fed to the multiplier. The output multiplexers
M U XC and M U XD are set such that M and A14 are sent on the buses C0 and C1. These
are stored in registers RC1 and RB3 , respectively. Effectively, the computation done by the
AU are RC1 = RA21 · RC12 and RB3 = RC14 . Similarly the subsequent operations required
for doubling as stated in 6.2 are performed.
A = y2 · Z12 + Y1
B = x2 · Z1 + X1
C = Z1 · B
D = B 2 · (C + a · Z12 )
Z3 = C 2
(6.4)
E =A·C
X3 = A2 + D + E
F = X3 + x2 · Z3
G = (x2 + y2 ) · Z32
Y3 = (E + Z3 ) · F + G
The equation for adding an affine point to a point in LD projective coordinates was shown
in Equation 2.32 and repeated here in Equation 6.4. The equation adds two points P =
(X1 , Y1 , Z1 ) and Q = (x2 , y2 ) where Q 6= ±P . The resulting point is P + Q = (X3 , Y3 , Z3 ).
The addition operation is mapped to the elliptic curve hardware using Algorithm 6.2.
Note, a is taken as 1. On the ECCP the operations in Algorithm 6.2 are scheduled efficiently
to complete in eight clock cycles [328]. The scheduled operations for point addition is shown
in Table 6.4, and the inputs and outputs of the registers at each clock cycle is shown in
Table 6.5.
TABLE 6.5: Inputs and Outputs of the Register Bank for Point Addition
Clock A0 A1 A2 A3 C0 C1
1 RB2 RC1 RB1 - RB1 -
2 RA1 RC1 RA2 - RA1 -
3 RA1 - - RC1 RB3 -
4 RA1 RC1 RB3 - RA1 -
5 RA1 RB3 RB1 - RC2 RA1
6 RA1 RB3 RA2 - RC1 RB3
7 RB2 RC1 RA2 - RB1 -
8 RB3 RC1 RB1 RC2 RB1 -
A2
A1 A3
ki=1
D4 A4
D3 A5
ki=0 complete
D2 A6
D1 A7
A8
Detect leading 1
complete
Init1 Init2 Init3 I1 I2 I21 I22 I23 I24
D1 x x x x 0 0 1 x 0 0 x 0 1 x 0 1 x x 1 0 0 x 0 1 0 0 0 0 0 1 0 0 1
D2 x x x x 0 0 0 x x 1 0 x 0 x x 1 1 1 1 0 0 x x x x 0 0 0 0 0 0 1 0
D3 x x x x 0 0 x 1 0 1 0 0 1 0 1 0 1 0 0 0 1 x 0 1 1 0 0 1 0 0 1 0 0
D4 x x x x 0 0 0 x 0 0 x 1 0 1 0 1 1 0 0 0 0 x x x x 1 1 0 0 0 0 0 0
A1 x x x x 0 0 0 x 0 0 0 1 0 x 0 1 0 1 0 0 0 x x x x 0 1 0 0 1 0 0 0
A2 x x x x 0 0 x 1 0 0 1 0 0 x 0 0 x x 0 0 1 1 0 0 0 x x 0 0 0 0 1 0
A3 x x x x 0 0 x x 0 0 x 0 0 0 x 1 x x 1 0 0 x 0 x x 0 0 1 0 1 0 0 0
A4 x x x x 0 0 x 0 0 0 0 0 0 1 0 0 x x 1 0 1 x 0 x x 0 0 0 1 0 0 0 1
A5 x x x x 0 0 x 1 0 1 0 0 1 x 1 0 1 0 0 0 1 x 0 0 1 0 0 0 0 0 0 1 0
A6 x x x x 0 0 1 x 0 1 1 0 1 x 0 1 1 0 1 0 0 1 0 0 0 1 0 0 0 1 0 1 0
A7 x x x x 0 0 0 x 0 0 1 1 0 x 0 1 0 1 0 0 0 1 x x x 0 0 0 0 1 0 1 1
A8 x x x x 0 0 0 x 0 0 0 1 0 1 0 1 1 0 0 0 0 x x x x 0 1 0 1 1 0 0 0
I1 x x x x 0 0 x x 0 0 x x 1 x 0 x x x x x 0 x x x x 0 0 0 0 1 1 0 1
I2 x x x x 0 0 0 x 0 0 0 x 0 x 0 1 x x 1 0 0 x x x x 0 0 0 0 0 1 1 0
I3 x x x x 0 0 0 x 0 0 0 x x x 0 1 x x 1 0 0 x x x x 0 0 1 1 0 1 0 1
I4 0 0 1 1 0 1 x x 0 0 0 x 1 x 1 0 x x 1 0 0 x x x x x x x x x x x x
I5 x x x x 0 0 0 x 0 0 0 x 0 x 1 1 x x 1 0 0 x x x x 0 0 0 0 0 0 1 0
I6 x x x x 0 0 0 x 0 0 0 x 0 x 0 1 x x 1 0 0 x x x x 0 0 1 1 0 1 0 1
I7 0 1 1 1 0 1 x x 0 0 0 x 1 x 1 0 x x 1 0 0 x x x x x x x x x x x x
I8 x x x x 0 0 0 x 0 0 x x 0 x 1 1 x x 1 0 0 x x x x 0 0 0 0 0 0 1 0
I9 1 1 1 0 0 1 x x 0 0 0 x 1 x 1 0 x x 1 0 0 x x x x x x x x x x x x
I10 x x x x 0 0 0 x 0 0 x x 0 x 1 1 x x 1 0 0 x x x x 0 0 0 0 0 0 1 0
I11 x x x x 0 0 0 x 0 0 0 x 0 x 0 1 x x 1 0 0 x x x x 0 0 1 1 0 1 0 1
I12 1 1 1 0 0 1 x x 0 0 0 x 1 x 1 0 x x 1 0 0 x x x x x x x x x x x x
I13 1 1 1 0 0 1 x x 1 0 0 x 1 x 1 0 x x x x 0 x x x x x x x x x x x x
I14 x x x x 0 0 0 x 0 0 0 x 0 x 1 1 x x 1 0 0 x x x x 0 0 1 1 1 0 1 0
I15 1 1 1 0 0 1 x x 0 0 0 x 1 x 1 0 x x 1 0 0 x x x x x x x x x x x x
I16 1 1 1 0 0 1 x x 1 0 0 x 1 x 1 0 x x x x 0 x x x x x x x x x x x x
I17 1 1 1 0 0 1 x x 1 0 0 x 1 x 1 0 x x x x 0 x x x x x x x x x x x x
I18 1 1 1 0 0 1 x x 1 0 0 x 1 x 1 0 x x x x 0 x x x x x x x x x x x x
I19 0 0 1 0 0 1 x x 1 0 0 x 1 x 1 0 x x x x 0 x x x x x x x x x x x x
I20 x x x x 0 0 0 x 0 0 0 x 0 x 1 1 x x 1 0 0 x x x x 0 0 0 0 0 0 1 0
I21 x x x x 0 0 0 x 0 1 x x 1 x 0 0 1 0 x x 0 x x x x 1 0 x x x x x x
I22 x x x x 0 0 x 0 0 0 x 0 0 x 0 0 x x x x 1 x 0 x x 0 0 0 0 0 0 0 0
I23 x x x x 0 0 0 x 0 0 x 1 0 x 0 1 0 0 x x 0 x x x x 0 0 0 0 1 0 0 0
I24 x x x x 0 0 0 x 0 0 0 0 0 x x 0 x x 0 0 0 x 0 x x x x x x x x x x
the curve constant and basepoint coordinates are loaded from ROM into the registers (Table
6.6). These states also detect the leading MSB in the scalar key k.
After initialization, the scalar multiplication is done. This consists of 4 states for doubling
and 8 for the point addition. The states that do the doubling are D1 · · · D4. In state D4,
a decision is made depending on the key bit ki (i is a loop counter initially set to the
position of the leading one in the key, and ki is the ith bit of the key k). If ki = 1 then
a point addition is done and state A1 is entered. If ki = 0, the addition is not done and
the next key bit (corresponding to i − 1) is considered. If ki = 0 and there are no more
key bits to be considered then the complete signal is issued and it marks the end of the
scalar multiplication phase. The states that do the addition are A1 · · · A8. At the end of the
addition (state A8) state D1 is entered and the key bit ki−1 is considered. If there are no
more key bits remaining the complete signal is asserted. Table 6.7 shows the control words
generated at every state.
At the end of the scalar multiplication phase, the result obtained is in projective coor-
dinates and the X, Y and Z coordinates are stored in the registers RA1 , RB1 and RC1 ,
respectively. To convert the projective point to affine, the following equation is used.
x = X · Z −1
(6.5)
y = Y · (Z −1 )2
The inverse of Z is obtained using the quad-ITA discussed in Algorithm 5.3. The addition
chain used is the Brauer chain in Equation 5.24. The processor implements the steps given
in Table 5.7. Each step in Table 5.7 gets mapped into one or more states from I1 to I21. The
number of clock cycles required to find the inverse is 21. This is lesser than the clock cycles
estimated by Equation 5.30. This is because, inverse can be implemented more efficiently
in the ECCP by utilizing the squarers present in the AU.
At the end of state I21, the inverse of Z is present in the register RC1 . The states I22
and I23 compute the affine coordinates x and y, respectively.
The number of clock cycles required for the ECCP to produce the output is computed as
follows. Let the scalar k has length l and hamming weight h, then the clock cycles required
to produce the output is given by the following equation.
#ClockCycles = 3 + 12(h − 1) + 4(l − h) + 24
(6.6)
= 15 + 8h + 4l
Three clock cycles are added for the initial states, 24 clock cycles are required for the final
projective to affine conversion. 12(h − 1) cycles are required to handle the 1’s in k. Note
that the MSB of k does not need to be considered. 4(l − h) cycles are required for the 0’s
in k.
n n n
2 2 2
Circuit 1 Circuit 2 A0
Circuit us
Input (#2) (#2) (#2) (#2) MUX
Muxes MUX
Qin C C0
F
(#1)
(#2)
A3
A2 Powerblock
MUX
Pow_sel C_sel
A1 A
(#2)
A0
MUX Squarer
Quad Squarer Critical Path D C1
Output (#2) (#2) A_sel (#2)
Muxes (#2) Field (#1)
A2 (#1)
Multiplier A0
(HBKM) A0
MUX (#11) Quad MUX D_sel
Registers
A1 (#2)
B E C2
(#2) A2
A3 (#2)
A3
REGISTER B_sel E_sel
BANK ARITHMETIC UNIT
FIGURE 6.5: Block Diagram of the Processor Organization showing the critical path
to use these Flip-Flops, no additional area (in terms of the number of slices) is required.
These available Flip-Flops provide opportunity for improving the performance of the design
through pipelining.
In this section, we discuss the piplelining strategy for this datapath in a proper fashion
to increase the clock frequency. Ideally, an L-stage pipeline can boost the clock frequency
up to L times. In order to achieve the maximum effectiveness of the pipelines, the design
should be partitioned into L equal stages. That is, each stage of the pipeline should have
the same delay. However to date the only means of achieving this is by trial-and-error. Here
we use the theoretical model for FPGA designs developed in Chapter 3, to show how to
estimate the delay in the critical path and there by find the ideal pipelining. As L increases,
there is likely to be more data dependencies in the computations, thus resulting in more
stalls (bubbles) in the pipeline. Thus suitable scheduling strategies for the Montgomery
scalar multiplication algorithm is required developing an efficient method for pipelining the
processor [249].
All combinational data paths in the ECM start from the register bank output and end
at the register bank input. The maximum operable frequency of the ECM is dictated by the
longest combinational path, known as the critical path. There can be several critical paths,
one such example is highlighted through the dashed line in Fig. 6.5.
Estimating Delays in the ECM : Let t∗cp be the delay of the critical paths and f1∗ = t∗1
cp
the maximum operable frequency of the ECM prior to pipelining. Consider the case of
pipelining the ECM into L stages, then the maximum operable frequency can be increased
to at-most fL∗ = L × f1∗ . This ideal frequency can be achieved if and only if the following
two conditions are satisfied.
1. Every critical path in the design should be split into L stages with each stage having
t∗
a delay of exactly Lcp .
2. All other paths in the design should be split so that any stage in these paths should
t∗
have a delay which is less than or equal to Lcp .
While it is not always possible to exactly obtain fL∗ , we can achieve close to the ideal clock
frequency by making a theoretical estimation of t∗cp and then identifying the locations in
the architecture where the pipeline stages have to be inserted. We denote this theoretical
estimate of delay by t#
cp . The theoretical analysis is based on the following prepositions[338].
These facts were stated in the above discussions on the inversion circuits in the context of
designing high-speed inversion circuits.
Fact 1 For circuits which are implemented using LUTs, the delay of a path in the circuit
is proportional to the number of LUTs in the path.
Fact 2 The number of LUTs in the critical path of an n variable Boolean function having
the form y = gn (x1 , x2 , · · · , xn ) is given by ⌈logk (n)⌉, where k is the number of inputs to
the LUTs (k − LU T ).
Using these two facts it is possible to analyze the delay of various combinational circuit
components in terms of LUTs: namely the adder, multiplexers, exponentiation circuit,
Powerblock, Modular Reduction, and the hybrid bit-parallel Karatsuba multiplier (denoted
as HBKM). The field multiplier is the central part of the arithmetic unit. We choose to use
a hybrid bit-parallel Karatsuba field multiplier (HBKM), which was first introduced in [316]
and then used in [314]. The advantage of the HBKM is the sub-quadratic complexity of
the Karatsuba algorithm coupled with efficient utilization of the FPGA’s LUT resources.
Further, the bit-parallel scheme requires lesser clock cycles compared to digit level multi-
pliers used in [37]. The HBKM recursively splits the input operands until a threshold (τ ) is
reached, then threshold (school-book) multipliers are applied. The outputs of the threshold
multipliers are combined and then reduced (Fig. 6.6).
Field inversion is performed by a generalization of the Itoh-Tsujii inversion algorithm for
FPGA platforms [338]. The generalization requires a cascade of 2n exponentiation circuits
(implemented as the Powerblock in Fig. 6.5), where 1 ≤ n ≤ m − 1. The ideal choice
for n depends on the field and the FPGA platform. For example, in GF (2163 ) and FPGAs
having 4 input LUTs (such as Xilinx Virtex 4), the optimal choice for n is 2. More details on
choosing n can be found in [164]. The number of cascades, us , depends on the critical delay
of the ECM and will be discussed in Section 6.7. Further, an addition chain for ⌊ m−1 n ⌋ is
required. Therefore, for GF (2163 ) and n = 2, an addition chain for 81 is needed. The number
of clock cycles required for inversion, assuming a Brauer chain, is given by Equation 6.7,
where the addition chain has the form (u1 , u2 , · · · , ul ), and L is the number of pipeline
stages in the ECM [69].
l l
X ui − ui−1 m
ccita = L(l + 1) + (6.7)
i=2
us
The LUT delays of relevant combinational components are summarized in Table 6.8.
The reader is referred to [338] for detailed analysis of the LUT delays. The LUT delays of
all components in Fig. 6.5 are shown in parenthesis for k = 4. Note that the analysis also
considers optimizations by the synthesis tool (such as the merging of the squarer and adder
before Mux B (Fig. 6.5), which reduces the delay from 3 to 2).
Pipelining Paths in the ECM: Table 6.8 can be used to determine the LUT delays
of any path in the ECM. For the example critical path, (the dashed line in Fig. 6.5),
MUX MUX C1
A1
Squarer B HBKM D Squarer
(#1) (#2)
(#2)
(#2) (#2) (#2) (#1) 4:1 Input Mux
(#11)
4:1 Output Mux Merged Squarer and Adder
Register Bank Register Bank
Pipeline Register
the estimate for t∗cp is the sum of the LUT delays of each component in the path. This
evaluates to t#
cp = 23. Fig. 6.7 gives a detailed view of this path. Pipelining the paths in
the ECM require pipeline registers to be introduced in between the LUTs. The following
fact determines how the pipeline registers have to be inserted in a path in order to achieve
the maximum operable frequency (fL# ) as close to the ideal (fL∗ ) as possible (Note that
fL# ≤ fL∗ ).
Fact 3 If t# cp is the LUT delay of the critical paths, and L is the desired number of stages
in the pipeline, then the best clock frequency (fL# ) is achieved only if no path has delay more
t#
cp
than ⌈ L ⌉.
For example for L = 4, no path should have a LUT delay more than ⌈ 23 4 ⌉. This identifies
the exact locations in the paths where pipeline registers have to be inserted. Fig. 6.7 shows
the positions of the pipeline register for L = 4 for the critical path.
On the Pipelining of the Powerblock: The powerblock is used only once during the
computation; at the end of the scalar multiplication. There are two choices with regard
to implementing the powerblock, either pipeline the powerblock as per Proposition 3 or
reduce the number of 2n circuits in the cascade so that the following LUT delay condition
is satisfied (refer Table 6.8),
t#
cp
Dpowerblock (m) ≤ ⌈ ⌉−1 (6.8)
L
,where −1 is due to the output mux in the register bank. However the sequential nature of
the Itoh-Tsujii algorithm [160] ensures that the result of one step is used in the next. Due
to the data dependencies which arise the algorithm is not suited for pipelining and hence
the latter strategy is favored. For k = 4 and m = 163, the optimal exponentiation circuit is
n = 2 having an LUT delay of 2 [338]. Thus a cascade of two 22 circuits would best satisfy
the inequality in (6.8).
In the following portion of the chapter, we outline the efficient scheduling of the Mont-
gomery algorithm for performing the point multiplication.
Xi ← Xi · Zj ; Zi ← Xj · Zi ; T ← Xj ; Xj ← Xj4 + b · Zj4
(6.9)
Zj ← (T · Zj )2 ; T ← Xi · Zi ; Zi ← (Xi + Zi )2 ; Xi ← x · Zi + T
Depending on the value of the bit sk , operand and destination registers for the point oper-
ations vary. When sk = 1 then i = 1 and j = 2, and when sk = 0 then i = 2 and j = 1. The
final step in the algorithm, P rojective2Af f ine(·), converts the 3 coordinate scalar product
in to the acceptable 2 coordinate affine form. This step involves a finite field inversion along
with 9 other multiplications [249].
For each bit in the scalar (sk ), the eight operations in Equation 6.9 are computed. In
this architecture we consider a situation where only one finite field multiplier. This can be
compared with architectures in literature where two field multipliers are used, like [37]. There
are other architectures proposed in literature like [87] which uses a single multiplier; we
follow the later approach in the discussed design. This restriction makes the field multiplier
the most critical resource in the ECM as Equation 6.9 involves six field multiplications, which
have to be done sequentially. The remaining operations comprise of additions, squarings,
and data transfers can be done in parallel with the multiplications. Equation 6.9 can be
rewritten as in Table 6.8 using 6 instructions, with each instruction capable of executing
simultaneously in the ECM.
Proper scheduling of the 6 instructions is required to minimize the impact of data de-
pendencies, thus reducing pipeline stalls. The dependencies between the instructions ek1 to
ek6 are shown in Fig. 6.8(a). In the figure a solid arrow implies that the subsequent instruc-
tion cannot be started unless the previous instruction has completed, while a dashed arrow
implies that the subsequent instruction cannot be started unless the previous instruction
has started. For example ek6 uses Zi , which is updated in ek5 . Since the update does not
require a multiplication (an addition followed by a squaring here), it is completed in one
Clock Cycle
1 2 3 4 5 6 7 8
ek1 ek2 ek3 ek1 or ek2
clock cycle. Thus ek5 to ek6 has a dashed arrow, and ek6 can start one clock cycle after ek5 . On
the other hand, dependencies depicted with the solid arrow involve the multiplier output in
the former instruction. This will take L clock cycles, therefore a longer wait.
The dependency diagram shows that in the longest dependency chain, ek5 and ek6 has
dependency on ek1 and ek2 . Thus ek1 and ek2 are scheduled before ek3 and ek4 . Since the addition
in ek6 has a dependency on ek5 , operation ek5 is triggered just after completion of ek1 and ek2 ;
and operation ek6 is triggered in the next clock cycle. When L ≥ 3, the interval between
starting and completion of ek1 and ek2 can be utilized by scheduling ek3 and ek4 . Thus, the
possible scheduling schemes for the 6 instructions is
Where { } implies that there is no strict order in the scheduling (either e1 or e2 can be
scheduled first). An example of a scheduling for L = 3 is shown in Fig. 6.8(b). For L ≥ 31 ,
the number of clock cycles required for each bit in the scalar is 2L + 2 . In the next part of
this section we show that the clock cycles can be reduced to 2L + 1 (and in some cases 2L)
if two consecutive bits of the scalar are considered.
When the Consecutive Key Bits are Equal : Fig. 6.9(a) shows the data dependencies
when the two bits are equal. The last two instructions to complete for the sk bit are ek5 and
ek6 . For the subsequent bit (sk−1 ), either ek−1
1 or ek−1
2 has to be scheduled first according
to the sequence in (6.10). We see from Fig. 6.9(a) that ek−1 1 depends on ek6 , while ek−1
2
depends on e5 . Further, since e5 completes earlier than e6 , we schedule ek−1
k k k
2 before ek−1
1 .
Thus the scheduling for 2 consecutive equal bits is
1 The special case of L <= 2 can trivially be analyzed. The clock cycles required in this case is six.
Clock Cycles
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Clock Cycles
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
ek5 ek4
e5k
ek6
ek6
sk
e1k−1
sk−1
e2k−1
ek−1
1
ek−1
2 ek−1
3 e3k−1
ek−1
4 e4k−1 sk−1 starts
e5k−1
ek−1
5 e6k−1
sk−2 starts
ek−1
6
When the Consecutive Key Bits are Not Equal : Fig. 6.10(a) shows the data
dependency for two consecutive scalar bits that are not equal. Here it can be seen that ek−1 1
and ek−1
2 depend on ek5 and ek6 , respectively. Since, ek5 completes before ek6 , we schedule ek−1
1
before ek−1
2 . The scheduling for two consecutive bits is as follows
Effective Clock Cycle Requirement : Starting from ek1 (or ek2 ), completion of ek2 (or
ek1 ) takes L + 1 clock cycles, for an L stage pipelined ECM. After completion of ek1 and
ek2 , ek5 starts. This is followed by ek6 in the next clock cycle. So in all 2L + 2 clock cycles
are required. The last clock cycle however is also used for the next bit of the scalar. So
effectively the clock cycles required per bit is 2L + 1. Compared to the work in [87], our
scheduling strategy saves two clock cycles for each bit of the scalar. For an m bit scalar, the
saving in clock cycles compared to [87] is 2m. Certain values of L allow data forwarding to
take place. In such cases, the clock cycles per bit reduces to 2L, thus saving 3m clock cycles
compared to [87].
ek4 ek4
ek5 ek5
TABLE 6.10: Computation Time Estimates for Various Values of L for an ECM over
GF (2163 and FPGA with 4 input LUTs
L us DataForwarding cc3scm ccita ccconv cc2scm ct
Feasible
#
1 9 No 978 25 16 1019 1019tcp
#
2 4 No 978 44 25 1047 524tcp
#
3 3 No 1141 61 34 1236 412tcp
#
4 2 Yes 1304 82 43 1429 357tcp
#
5 1 No 1793 130 52 1975 395tcp
#
6 1 Yes 1956 140 61 2157 360tcp
#
7 1 Yes 2282 150 70 2502 358tcp
requires ccconv clock cycles. The value of ccconv for the ECM was found to be 7 + 9L. Thus,
h i h l l
X ui − ui−1 mi h i
cc2scm = cc3scm + L(l + 1) + + 7 + 9L (6.11)
i=2
us
1. Determine t#
cp (the LUT delay of the critical path of the combinational circuit) using
Table 6.8.
t#
2. Compute the maximum operable frequency (⌈ Lcp ⌉) and determine the locations of the
pipeline registers. Therefore, determine if data forwarding is possible.
3. Determine us , the number of cascades in the power block, using Equation 6.8 and the
delay of a single 2n block (Table 6.8).
For an ECM over GF (2163 ), the threshold for the HBKM set as 11, an addition chain of
(1, 2, 4, 5, 10, 20, 40, 80, 81), and 22 exponentiation circuits in the power block, the t#
cp is 23.
The estimated computation time for various values of L are given in Table 6.10. The cases
L = 1 and L = 2 are special as for these cc3scm = 6m. The table clearly shows that the
least computation time is obtained when L = 4.
1
Clock
Reset Control Unit
Scalar
Data Forwarding Path
Curve Constant X
MUX Z
Register Base Point in ROM X_sel
Z_sel
Write Enable
{
X1_w
Z2 Z1 X2X1
REGISTER BANK
Multiplexer X2_w b y x A T
Select Lines A_sel Z1_w
B_sel Z2_w
C_sel T_w
D_sel A_w
E_sel
Pow_sel
A0_sel
A1_sel
A2_sel
A3_sel
Qin_sel
X_sel
Qin_sel K MUX J MUX I MUX H MUX G A0_sel
Z_sel
A3_sel A2_sel A1_sel
Qin 2
n
2
n
2
n
Powerblock
Circuit 1 Circuit 2 Circuit us
Pow_sel
MUX F Qout
A3 A2 A1 A0
HBKM
A0
A0 MUX C0
C
A1
MUX
A2 A
A3 C_sel
C1
A_sel
MUX
A2 D
Squarer
B_sel MOD
A1
Squarer
A1 D_sel C2
MUX
A1 B Combine Level 4
Quad
A3
Combine Level 1
E_sel
A0
A0 Split Operands Threshold Mults Combine Outputs
Quad MUX
A2 E
A3
ARITHMETIC UNIT
STAGE 1 STAGE 2 STAGE 3 STAGE 4
I0 I1 I5 AD1 C1 C2 C125
sk−1 = 0 1
ADneq
st−2 = 0
Initialization sk−1 = sk Coordinate Conversion
sk−1 = 0
sk−1 6= sk
Completion
AD02 AD03 AD04 AD09
again channel the data into the multiplier. The results are written back into the registers
through buses C0, C1, C2, Qout. Note the placement of the pipeline registers dividing the
circuit in 4 stages and ensuring that each stage has an LUT delay which is less than or equal
to ⌈ 23
4 ⌉ = 6. Note also the pipeline register present immediately after the field multiplier
(HBKM) used for data forwarding.
TABLE 6.11: Comparison of the Proposed ECM with FPGA based Published Results
Work Platform Field Slices LUTs Freq Comp.
(m) (MHz) Time (µs)
Orlando [279] XCV400E 163 - 3002 76.7 210
Bednara [43] XCV1000 191 - 48300 36 270
Gura [144] XCV2000E 163 - 19508 66.5 140
Lutz [228] XCV2000E 163 - 10017 66 233
Saqib [345] XCV3200 191 18314 - 10 56
Pu [306] XC2V1000 193 - 3601 115 167
Ansari [33] XC2V2000 163 - 8300 100 42
Järvinen1 [163] Stratix II 163 (11800ALMs) - - 48.9
Kim 2 [195] XC4VLX80 163 24363 - 143 10.1
Chelton [87] XCV2600E 163 15368 26390 91 33
XC4V200 163 16209 26364 153.9 19.5
Azarderakhsh3 [37] XC4CLX100 163 12834 22815 196 17.2
XC5VLX110 163 6536 17305 262 12.9
ECCP [314] XC4V140 233 19674 37073 60 31
ECM [320] (Virtex 4 FPGA) XC4VLX80 163 8070 14265 147 9.7
XC4V200 163 8095 14507 132 10.7
XC4VLX100 233 13620 23147 154 12.5
ECM [320] (Virtex 5 FPGA) XC5VLX85t 163 3446 10176 167 8.6
XC5VSX240 163 3513 10195 148 9.5
XC5VLX85t 233 5644 18097 156 12.3
;
1. uses 4 field multipliers ;
2. uses 3 field multipliers 3. uses 2 field multipliers
Fig. 6.13 shows the finite state machine for L = 4. The states I0 to I5 are used for
initialization (line 2 in Algorithm 19). State AD1 represents the first clock cycle for the
scalar bit st−2 . States AD12 to AD19 represent the computations when sk = 1, while AD02
to AD09 are for sk = 0. Each state corresponds to a clock cycle in Fig. 6.11(b). Processing
for the next scalar bit (sk−1 ) begins in the same clock cycle as AD09 and AD19 in states
ADeq1
and ADneq
1
. The states ADeq1
or ADneq
1
are entered depending on the equality of sk
and sk−1 . If sk = sk−1 then ADeq is entered, else ADneq
1 1
is entered. After processing of all
scalar bits is complete, the conversion to affine coordinates (ccita + ccconv ) takes place in
states C1 to C125 .
multiplier instead of the quadratic Mastrovito multiplier result in 50% reduction in area on
Virtex 4.
In [37], two highly optimized digit field multipliers were used. This enabled parallelization
of the instructions and higher clock frequency. However, the use of digit field multipliers
resulted in large clock cycle requirement for scalar multiplication (estimated at 3380). We
use a single fully parallel field multiplier requiring only 1429 clock cycles and an area which
is 37% lesser in Virtex 4.
In [195], a computation time of 10.1µs was achieved while on the same platform our
ECM achieves a computation time of 9.7µs. Although the speed gained is minimal, it
should be noted that [195] uses three digit-level finite field multipliers compared to one in
ours, thus has an area requirement which is about three times ours. The compact area is
useful especially for cryptanalytic applications where our ECM can test thrice as many keys
compared to [195].
6.12 Conclusion
This chapter integrates the previously developed finite-field arithmetic blocks to com-
prise the arithmetic unit (AU) for an elliptic curve processor. The AU is used in a elliptic
curve crypto processor to compute the scalar product kP for a NIST specified curve. The
initial design is further augmented by implementing the Montgomery’s ladder, pipelining
the architecture in an optimized manner, and also suitably scheduling the instructions. The
final design demonstrates that these principles provide on FPGA platforms one of the best
reported timings, and also a compact utilization of the available resources.
177
7.1 Introduction
As we have seen in the previous chapters, implementation of cryptographic algorithms
is complex. These ciphers are complex mathematical constructs, often requiring large-field
arithmetic, multiple iterations etc. Hence various techniques are required to make the de-
signs efficient. The ciphers are designed carefully with chosen parameters to thwart the
cryptanalysts. Depending on the underlying platform, several hardware and software de-
signs of the same ciphers are developed, to ensure that security comes at a minimal cost
to the user. Though conventional attacks are independent of these implementations, most
often the standard ciphers, like AES and ECC are designed in a fashion that the attacks are
more often not practical. However there is a form of cryptanalysis, known as side channel
analysis (SCA) which exploits not only the mathematical structures of the ciphers, but
179
also the inherent properties of the implementations. These menacing category of attacks
jeopardizes some of the basic assumptions on which the conventional cryptographer designs
ciphers, and thus requires special consideration in the context of hardware and embedded
security.
Although research in the area of side channel analysis started with the seminal work of
[284] in around 1996, several forms of cryptanalysis through other unconventional informa-
tion channels existed much before. The idea of stealing information through side channels is
far older than the personal computer. In World War I, the intelligence corps of the warring
nations were able to eavesdrop on one anothers battle orders because field telephones of the
day had just one wire and used the earth to carry the return current. Spies connected rods
in the ground to amplifiers and picked up the conversations. During the World War II, Bell
Labs was aware of attack methods which would use electromagnetic radiations to reveal
75% of the plaintext that was sent in a secured fashion from a distance of 80 feet. Later
around 1985, Wim van Eck published the first unclassified technical analysis of the security
risks of emanations from computer monitors. Van Eck successfully eavesdropped on a real
system, at a range of hundreds of metres, using just $15 worth of equipment plus a television
set. This paper showed the feasibility of such attacks to the security community, which had
previously believed that such monitoring was a highly sophisticated attack available only
to governments. American military scientists began studying these radio waves given off
by computer monitors and launched a program, code-named Tempest, to develop shielding
techniques. Without these shielding, the image being scanned line by line onto the screen
of a standard cathode-ray tube monitor can be reconstructed from a nearby room or even
an adjacent building by tuning into the monitors radio transmissions.
Many people assumed that the growing popularity of flat-panel displays would make
Tempest problems obsolete, because flat panels use low voltages and do not scan images
one line at a time. But in 2003 Markus G. Kuhn, a computer scientist at the University
of Cambridge Computer Laboratory, demonstrated that even flat-panel monitors, including
those built into laptops, radiate digital signals from their video cables, emissions that can
be picked up and decoded from many meters away.
Several side channels and corresponding attacks have been subsequently developed. The
challenge in tackling side channel attacks comes from the fact that it violates the classical
notions of cryptography. Conventional cryptographic defenses do not work in hindering
them. To quote none other than Adi Shamir, it is extremely difficult to envisage new forms
of these attacks. According to him, after years of research in provable properties in RSA, like
reduction proofs, equivalence to factoring, bit security of RSA, the cipher seemed robust,
well studied, and fit for applications like banking etc. where security is needed. However,
the work of Boneh, Demillo and Lipton in 1996 showed that a single faulty computation
can completely break the scheme by factoring the moduli. Hence, the danger of side channel
attacks is from the fact that it can expose the fragility of a mathematically sound cipher
and make it unsuitable for applications. Hence, we need a deeper, thorough and theoretical
study of the topic. In the following section, we present an overview and define side channel
attacks.
which was not available to the classical cryptanalyst. Hence, the proofs or guarantees of
security in classical cryptography are no more valid. Examples of such side channels are
timing, power, electromagnetic radiations, visual, acoustics, cache, testability features of
hardware devices and there may be many more. A very closely related class of attacks is
called as fault attacks, where the device under the induction of faults, perform wrong com-
putations. The adversary uses the correct ciphertexts and the faulty ciphertexts and obtains
the keys. These class of attacks are extremely dangerous and show that fault tolerance is of
utmost importance for cryptographic hardware.
Cryptographic Algorithms
Protocols
Cryptographic
Humans Users Software Implementations
System
1. Timing Attacks
2. Power Attacks
3. Fault Attacks
5. Cache Attacks
These side channel analysis methods can also be combined with each other and also with
conventional cryptanalysis techniques to develop even stronger forms of attacks. For exam-
ple, fault attacks are often combined with algebraic attacks to yield a new class of attacks
called Algebraic Side Channel Attacks[323].
In the next sections, we provide a brief overview on the above mentioned side channel
attacks.
#include <time.h>
is the time required to perform the multiplication and squaring for the bit i and e includes
measurement error, loop overhead and other sources of inaccuracies.
As stated above we assume that the attacker knows or has correctly evaluated in the
previous iterations the first b − 1-bits, x[0], · · · , x[b − 2] of the secret exponent x. Now the
Pw−1
attacker guesses x[b−1]. If the guess is correct, subtracting from T yields Tr = e+ i=0 ti −
Pb−1 Pw−1
i=0 ti = e + i=b ti .
The attacker obtains a distribution by varying the value of y and observing the above
timing Tr . Assuming the modular multiplication times are mutually independent and from
the measurement error. If the guess is correct, the variance reduces and the observed variance
is V ar(e) + (w − b)V ar(t). For correctly estimated timings (due to correct guesses of the
exponent bits) the variance decreases. However if the guess for x[b − 1] is wrong, then
the variance increases, the resultant variance becomes V ar(e) + [w − (b − 1) + 1]V ar(t) =
V ar(e) + [w − b + 2]V ar(t). This observed increase in variance can be used as an indicator
to distinguish a wrong estimate of a bit from a correct one.
shows that at every iteration, squaring is performed. However, if the current bit in the ith
iteration is a one, then a multiplication operation is also performed. Hence, it is expected
that the power consumption in the ith cycle would be more, if the key bit is one, compared
to when it is zero. This helps one to build a distinguisher to determine the key bits from a
few power traces. In cases of an unprotected and naïve square and multiply implementation,
a single trace is sufficient for most purposes.
• The duration of an iteration depends on the key bit. A key bit of 0 leads to a short
cycle compared to a key bit of 1. Thus measuring the duration of an iteration will
give an attacker knowledge about the key bit.
• Each state in the FSM has a unique power consumption trace. Monitoring the power
consumption trace would reveal if an addition is done thus revealing the key bit.
To demonstrate the attack we used Xilinx’s XPower 1 tool. Given a value change dump
(VCD) file generated from a flattened post map or post route netlist, XPower is capable of
generating a power trace for a given testbench.
Fig. 7.5 and Fig. 7.6 are partial power traces generated for the key (F F F F F F F F )16
and (80000000)16 respectively. The graphs plot the power on the Y axis with the time line
on the X axis for a Xilinx Virtex 4 FPGA. The difference in the graphs is easily noticeable.
The spikes in Fig. 7.5 occurs in state A6. This state is entered only when a point addition
is done, which in turn is done only when the key bit is 1. The spikes are not present in
1 http://www.xilinx.com/products/design_tools/logic_design/verification/xpower.htm
FIGURE 7.5: Power Trace for a Key with FIGURE 7.6: Power Trace for a Key with
all 1 all 0
Fig. 7.6 as the state A6 is never entered. Therefore the spikes in the trace can be used to
identify ones in the key.
The duration between two spikes in Fig. 7.5 is the time taken to do a point doubling
and a point addition. This is 12 clock cycles. If there are two spikes with a distance greater
than 12 clock cycles, it indicates that one or more zeroes are present in the key. The number
of zeroes (n) present can be determined by Equation 7.1. In the equation t is the duration
between the two spikes and T is the time period of the clock.
t
n= −3 (7.1)
4T
The number of zeroes between the leading one in k and the one due to the first spike can
be inferred by the amount of shift in the first spike.
As an example consider the power trace (Fig. 7.7) for the ECCP obtained when the
key was set to (B9B9)16 . There are 9 spikes indicating 9 ones in the key (excluding the
leading one). Table 7.1 infers the key from the time duration between spikes. The clock has
a period T = 200ns.
The first spike t1 is obtained at 3506th ns. If there were no zeros before t1 the spike
should have been present at 2706th ns (this is obtained from the first spike of Fig. 7.5).
The shift is 800 ns equal to four clock cycles. Therefore a 0 is present before the t1 spike.
The key obtained from the attack is (1011100110111001)2 , and it matches the actual
key.
power models treat each bit independently, while in real life the influence of each bit may be
simultaneous. Though such a multibit power model may be more effective, but notably the
Hamming weight or the Hamming distance power models also gives quite effective results
in most cases. In this chapter, we study the Hamming weight model more closely. The same
discussion can be adapted to Hamming distance model as well.
1 1
4 4
Probability of Occurance
1 1
16 16
0 1 2 3 4
Hamming Weights
DOM method, one can split the HWs of the s values into two bins: zerobin and onebin and
compute the means of the values in each bin. We observe from the table that the zerobin
has values which add to 12 (there are 8 values), whereas the values in the onebin add upto
20 (there are 8 values too). Thus the difference of means of these two values is 20
8 − 8 = 1.
12
It is trivial to observe that if the partitioning was done through a random bit sequence
uncorrelated to the Hamming weights, the expected sums in each bin would have been
16, and hence the difference zero. In Fig 7.9 we show a simulation of 210 runs where the
paritioning of the HWs have been done using bits, simulated by the rand function in C.
One can observe that the DOM is typically close to zero. The same observation holds for
any bit length, indicating that a non-zero DOM value indicates that the Hamming weight
(i.e. the power consumption) is strongly correlated with the target bit. Let us now analyze
0.05
0
DOM
-0.05
-0.1
FIGURE 7.9: Distribution of DOM when the Hamming weights are distributed based on
an uncorrelated bit
a DPA with the above hindsight on a simple modular exponentiation algorithm. algorithm:
z ≡ y x mod 256. We present the following attack scenario to show the effectiveness of DPA
in obtaining the secret key.
In this toy example, the state of the modular exponentiation can be encoded in 8 bits.
Consider an attack scenario, where the attacker knows the first four bits of the key, and
would like to obtain the next bit: which can be either 0 or 1 with a probability significantly
larger than 1/2. The attacker has access to the hardware, which has the key x embedded
inside in some tamper-proof fashion. The attacker thus generates several y’s randomly, and
observes the corresponding power traces.
Simultaneously, the attacker also feeds in the same input y under which a trace has been
obtained. Under our assumption of the attack scenario, the attacker subsequently guesses
the next bit and based on the guess computes the value of the temporary variable s using
the square and multiply algorithm after the 3rd iteration (note that the square and multiply
algorithm starts processing from the MSB of x denoted by 7th iteration and the processing
of the LSB denoted by the 0th iteration). The attacker say uses the LSB of the register s
after the 3rd iteration to partition the traces either in a zero bin or a one bin. After the
attacker has done so for a large number of inputs y, the attacker computes the difference
of mean of the two bins. It is expected that for the correct guess of the next bit of x, there
will be a significant difference of mean at the 3rd iteration. However, for the wrong guess,
since the paritioning of the traces is done based on a random bit which is not correlated
to the actual state, s will yield a very small value of the difference of mean. This helps to
distinguish between the incorrect key and the correct key.
FIGURE 7.10: Plot of DOM for a wrong guess of next bit of x compared to a correct
guess with simulated power traces
One can try to experimentally verify the above attack algorithm, either using actual
power traces or through simulated power traces obtained from the Hamming weights of
the intermediate register s. It may be pointed if the power consumption is based on the
Hamming distance power model, then one needs to adapt the attack strategy accordongly.
Rather than paritioning the traces on the target bit (LSB of s) value, one needs to partition
the traces based on the difference of the LSB of the state s between the 4th iteration and
the 3rd iteration. Fig 7.10 refers to the plot of the DOMs for a run of the attack after
observing 210 samples. The correct key in the simulation is 0x8F , thus the correct next key
bit (3rd bit) is 1. It can be observed that using the difference of mean technique the attacker
observes the larger DOM (around 0.9) compared to the smaller DOM (around 0.21) for the
wrong guess. It should be noted that the plot only shows the absolute value of the DOMs.
We discuss the topic of power analysis in more details in a subsequent chapter, explaining
some other improvements of the DOM method using correlations. We also discuss some
suitable counter-measures against such power attacks, using a well-known technique called
masking.
In the next section, we present an overview of another well known attack technique:
called fault attacks. These attacks are extremely powerful attack techniques which extract
the keys from faulty computation of an encryption device.
complete verification is ruled out. Hence, these designs have a chance of being fault prone.
Apart from these unintentional faults, faults can also be injected intentionally. Literature
shows several ways of fault injection: accidental variation in operating conditions, like volt-
age, clock frequency, or focused laser beams in hardware. Software programs can also be
subjected to situations, like missing of certain instructions to inflict faults. Apart from the
general issue of fault tolerance in any large design, faults bring a completely new aspect
when dealing with cryptographic algorithms: security.
The first thing that comes to mind is the relation between faults and secrets. In this
section, we first attempt to motivate the impact of faults in the leakage of information.
Motivating Example: Consider a pedadogical example comprised of two hardware
devices as illustrated in Fig. 7.11.
(a,b)
(6,0) (0,2)
m=8
* *
6a mod m 2b mod m
The first device has a register storing the values R1 = (6, 0), and computes the product
ylef t = (6, 0) × (a, b)T = 6a mod m. The value of m is fixed as, say, 8. The second device
on the other hand has the register with value R2 = (0, 2) and computes yright = (0, 2) ×
(a, b)T = 2b mod m. The users can feed in values of (a, b), st. (a, b) ∈ {2, 6} × {2, 6}. For
the rest of the discussion all the computations are mod 8 and are not explicitly stated.
The user can only input the values (a, b) chosen from the 4 values of {2, 6} × {2, 6}.
On receiving the inputs (a, b) both the hardwares compute the values of ylef t and yright .
However the user is given either ylef t or yright , chosen on the basis of a random toss of
an unbiased coin which is hidden from the user. The challenge of the user is to guess the
outcome of the random coin with a probability better than 21 . The user is allowed to make
multiple queries by providing any of the 4 inputs (a, b). It can be assumed that the random
choice is kept constant for all the inputs.
It may be easily observed that ylef t = yright for all the 4 values of (a, b) which implies
that the output ylef t or yright does not reveal which output is chosen by the random toss.
For all the input values of (a, b) the output is 4.
Now consider that one of the hardwares is subjected to a permanent stress, which creates
a fault in either the registers R1 or R2. Hence, either R1 = r 6= 6 or R2 = r 6= 2. If the
fault occurs in R1, ylef′
t = ra, while yright = 2b. Else if the fault occurs in R2, ylef t = 6a,
while yright = rb. WLOG. assume that the fault is in the first device.
′
Now the attacker provides two inputs: (2, 2) and (6, 6) to the hardware devices. The
attacker observes both of the outputs. If both the outputs are the same then the attacker
concludes that the right output is chosen, while if they are different the left output is chosen
with probability 1.
Thus this simple example shows that a fault can leak information, which seemed to be
perfectly hidden in the original design. Thus apart from the malfunction of hardware or
software designs, algorithms which hide information (like ciphers) should be analyzed wrt.
faults. Next, we consider a more non-trivial example of fault-based analysis of the popular
RSA cryptosystem.
cache hit. This difference is manifested through covert channels such as execution time,
power consumption, and electro-magnetic radiation.
Depending on the side channel used, cache attacks are categorized into three: trace,
access, and timing.
1. Trace-driven attacks: In these attacks, the adversary such as in [10, 46, 121, 127,
317, 428] monitors power consumption traces or electro-magnetic radiation to de-
termine if a memory access resulted in a cache hit or miss. These attacks are most
applicable on small embedded devices where such side-channels can be easily moni-
tored.
2. Access-driven attacks: Examples of these attacks, as in [142, 272, 281, 298, 384])
require the use of a spy process running in the same host as the cryptographic algo-
rithm in order to retrieve memory access patterns made by the cipher. These attacks
require a multi-user environment and are a threat to the security of cloud computing
[327].
3. Time-driven attacks: These attacks, on the other hand, use variations in the en-
cryption time to determine the secret key [44, 58, 318, 312, 384, 386, 387, 388]. They
can be applied to a wide range of platforms ranging from small micro-controllers to
large server blades. Just as in access attacks, timing attacks are also applicable in vir-
tualized environments as was demonstrated in [406]. Further, timing measurements do
not need close proximity to the device. This can result in remote attacks [15, 66, 101].
However, it is considerably more difficult to mount a timing attack compared to trace
and access attacks. Unlike these attacks, timing attacks rely on statistical analysis
and require significantly more side-channel measurements to distinguish between a
cache hit and a miss. The number of measurements depend not only on the system
parameters but also on the implementation of the cipher and its algorithm.
In the book, we stress on cache-timing attacks, being one of the most practical forms of
cache attacks. In the following section, we provide an overview of the working principle of
cache timing attacks, while in Chapter 9 we present an actual attack on a standard block
cipher.
d1 d2 da db dk
S S S S S
time
The most widely used software implementation of AES is based on Barreto’s code [296].
This performance optimized implementation uses four 1KB lookup tables T0 , T1 , T2 , and
T3 for the first 9 rounds of the algorithm, and an additional 1KB lookup table T4 for the
final round. The structure of each round is shown in Equation 7.2 and encapsulates the four
basic AES operations of SubByte, ShiftRow, MixColumn, and AddRoundKey. The input to
round i (1 ≤ i ≤ 9) is the state S i comprising of 16 bytes (si0 , si1 , · · · , si15 ) and round key
K i split into 16 bytes (k0i , k1i , · · · , k15
i
). The output of the round is the next state S i+1 . The
first round S comprises of inputs (P ⊕ K) and round key K 1 .
1
S i+1 = {T0 [si0 ] ⊕ T1 [si5 ] ⊕ T2 [si10 ] ⊕ T3 [si15 ] ⊕ {k0i , k1i , k2i , k3i },
T0 [si4 ] ⊕ T1 [si9 ] ⊕ T2 [si14 ] ⊕ T3 [si3 ] ⊕ {k4i , k5i , k6i , k7i },
(7.2)
T0 [si8 ] ⊕ T1 [si13 ] ⊕ T2 [si2 ] ⊕ T3 [si7 ] ⊕ {k8i , k9i , k10
i i
, k11 },
i
T0 [si12 ] ⊕ T1 [si1 ] ⊕ T2 [si6 ] ⊕ T3 [si11 ] ⊕ {k12 i
, k13 i
, k14 i
, k15 }}
Consider a 2n element lookup table used in a block cipher implementation. When stored
in the main memory of a processor these 2n elements are grouped into blocks, which is
termed as memory blocks. Let 2v elements of the table map to the same block. When any
element in a block gets accessed, the entire block gets loaded into the cache memory and
thereafter resides in a cache line unless evicted. Thus, the table can occupy at most 2n−v
lines in the cache. These lines require l = n − v bits to be addressed.
From the cache attack perspective, block ciphers can be viewed as k memory accesses
made to one or more lookup tables. The index of each access has the form (di ⊕rki ), where di
is the data and rki the key material. The block cipher memory accesses can be represented
by a sequence as shown in Figure 7.12.
Consider a pair a, b (1 ≤ a ≤ b < k) in the figure, if the access db ⊕ rkb collides with
da ⊕ rka , a cache hit occurs and the equality hda ⊕ rka i = hdb ⊕ rkb i can be inferred, where
h·i is the most significant l bits of the table index. If da and db can be controlled by the
adversary, the XOR of the keys rka and rkb is revealed.
Depending on controllability of the memory accesses in Figure 7.12, two categories can
be defined: the first round memory accesses (these directly depend on the plaintext), and
inner round memory accesses (these depend on the plaintext as well as the cipher algo-
rithm). The attacker can easily control all the di in the first round. However it becomes
increasingly difficult to control di as the rounds increase. For example, by manipulating the
plaintext inputs and with some amount of additional information, the second round di can
be controlled. More information is required to control the di of the third round and so on.
The later rounds of the cipher are several times more difficult to control. The controllability
of di depends on the cipher structure. For example, CLEFIA has a type-2 generalized Feis-
tel structure, and the inputs are more easily controllable compared to CAMELLIA, which
has a classical Feistel structure. More details about the structure can be found in [317] and
[301], respectively.
Generally rka and rkb have entropies of n bits each. If the adversary can identify a
collision such as in Equation 7.3, then the entropy of the pair of keys reduces from 2n to
n + v. Identifying a collision in the memory access of the cipher can be done by monitoring
side-channels such as power consumption, electro-magnetic radiations, or timing of the
encryption.
Details of the attacks are presented in a subsequent chapter. In the next section, we
provide an overview on an yet another source of leakage in hardware designs: namely the
design-for-testability techniques. Scan chains are a very common technique for testing hard-
ware devices. However, attackers have used these techniques originally for testability to con-
trol and observe internal states of the cipher, which leads to fairly straightforward attacks
against ciphers. In the next section, we present the idea behind the attacks.
plex Boolean mappings, and are realized by an array of smaller mappings, called as S-boxes.
The output of the S-boxes are diffused through the permutation layers, to give the output
of one round. The plaintext is XORed with the input key, while the output of a round is
XORed with a round key before being fed as an input to the next round. The round keys are
generated through a key-scheduling algorithm one after the other, starting from the input
key. The ciphertext is the output of certain number of rounds, estimated based on classical
cryptanalysis to provide sufficient security margin against an adversary. To put some num-
bers, the DES has a 16 rounds, while the AES has 10, 12, and 14 rounds, depending on
whether the key-sizes are 128, 192, and 256 bits.
Plaintext
Key
+
a
S1 ... Sn
Diffusion Layer
Register b
To Next Rounds
Side Channel Attacks using Scan Chains: The security of the block cipher is
obtained due to the properties of the round function, and the number of rounds in the
cipher. However, when the design is prototyped on a hardware platform, and a scan chain is
provided to test the design, the attacker uses the scan chain to control the input patterns,
and observe the intermediate values in the output patterns. The security of the block cipher
is thus threatened, as the output after few rounds is revealed to the adversary. The attacker
then analyzes the data and applies conventional cryptanalytic methods on a much lessened
cipher.
We next summarize the scan based attack, wrt. Fig 7.13. Without loss of generality,
let us assume that the S-boxes are bytewise mappings, though the discussion can be easily
adapted for other dimensions. The attack observes the propagation of a disturbance in a
byte through a round of the cipher. If one byte of the plaintext is affected, say p0 , then one
byte of a, a0 gets changed (see figure). The byte passes through an S-box and produces an
output, which is diffused in the output register, b0 .
The diffusion layer of AES-like ciphers are characterized by a property called as, branch
number, which is the minimum total number of disturbed bytes at the input and output of
the layer. For example, the MixColumns step of AES has a branch number of 5, indicating
that if b1 input bytes are disturbed at the input of MixColumns, resulting in b2 bytes at
the output which get affected, then b1 + b2 ≥ 5.
Depending upon the branch number of the diffusion layer, the input disturbance spreads
to say t-number of output bits in the register b. The attacker tries to exploit this property
to first ascertain the correspondence of the flip-flops of register b, with the output bits in
the scan-out pattern. Next the attacker applies a one-round differential attack to determine
the secret key.
1. The attacker first resets the chip and loads the plaintext p and the key k, and applies
one normal clock cycle. The XOR of p and k is thus transformed by the S-boxes and
the diffusion layers, and is loaded into the register b.
2. The chip is now switched to the test mode and the contents of the flip-flops are scanned
out. The scanned out pattern is denoted by T P1 .
3. Next, the attacker disturbs one byte of the input pattern and repeats the above two
steps. In this case, the output pattern is T P2 .
It may be observed that if the attacker observes the difference between T P1 and T P2 ,
the attacker can observe the positions of the contents of the register b. The ones in the
difference are all because of the contents of register b. In order to better observe all the bit
positions of register b, the attacker repeats the process with further differential pairs. There
can be a maximum 256 possible differences in the plaintext byte being changed. However,
the ciphers satisfy avalanche criteria, which states that if one input bit is changed, on an
average at least half of the output bits get modified. Thus in most cases because of this
avalanche criteria of the round, much fewer plaintexts are necessary to obtain the locations
of all the registers b.
However, the attacker has only ascertained the location of the values of the register b
in the scanned-out patterns. However, it is surely an unintended leakage of information.
For example, the difference of the scanned out patterns giving away the Hamming distance
after one round of the cipher.
The attacker now studies the properties of the round structures. The S-box is a non-
linear layer, with the property that all possible input and output pairs are not possible.
As an example, for the present day standard cipher, the Advanced Encryption Standard
(AES), given a possible input and output pair, on an average one value of the input to the
S-box is possible. That is, if the input to the S-box, denoted by S is x, and the input and
output differentials are α, β, then there is one solution on an average to the equation:
β = S(x) ⊕ S(x ⊕ α)
The adversary can introduce a differential α through the plaintext. However, he uses the
scanned out data to infer the value of β. In order to do so, he observes the diffusion of the
unknown β through the diffusion layer. In most of the ciphers, like the AES, the diffusion
layers are realized through linear matrices.
To be specific in case of AES, a differential 0 < α ≤ 255 in one of the input bytes, is
transformed by the S-box to β and then passes to the output after being transformed by
the diffusion layer as follows:
α 0 0 0 β 0 0 0 2β 0 0 0
0 0 0 0 ⇒ 0 0 0 0 ⇒ 3β 0 0 0
0 0 0 0 0 0 0 0 β 0 0 0
0 0 0 0 0 0 0 0 β 0 0 0
Thus the attacker knows that the differential in the scanned-out patterns, T P1 and T P2 ,
has the above property. That is there are 4 bytes in the register b, denoted by d0 , d1 , d2 , d3 ,
such that:
The attacker in the previous attack has ascertained the positions of the 32 bits of the
register b in the scanned out pattern. But he does not know the correct pattern. The above
property says that the correct pattern will satisfy the above property. If w is the number of
ones in the XOR of T P1 and T P2 , then there are 32Cw possible patterns. Out of that, the
correct one will satisfy the above property. The probability of a random string satisfying
the above property is 2−24 . Thus, if w = 24 as an example, then the number of satisfying
permutations is 32C4 × 2−24 ≈ 1. Thus, there is a single value which satisfies the above
equations. This helps the attacker to get the value of β. The attacker already knows the
value of α from the plaintext differential. Thus, the property of the S-box ensures that there
is on an average one single value of the input byte of the S-box. Thus, the attacker gets the
corresponding byte for a (see figure). The attacker then computes one byte of the key by
XORing the plaintext, p0 with the value of the byte a0 , that is, k0 = p0 ⊕ a0 .
The remaining key bytes may be similarly obtained. In the literature, there are several
reported attacks on the standard block ciphers, namely DES and AES [368], but all them
follows the above general ideas of attacking through controllability and observability through
scan chains.
7.8 Conclusions
In this chapter, we have provided a summary of side channel analysis of implementation
of cryptographic algorithms. We have highlighted the differences of such attack methods
from conventional cryptanalysis. The attack methodologies have also been explained with
focus on timing, power, fault, cache, and scan-based attacks. All these attacks emphasize
that one needs closer look at these and their attack techniques to be capable of developing
a suitable counter-measures.
201
8.8.2
Time Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
8.8.3
Information Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
8.8.3.1 Parity-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
8.8.3.2 Parity-16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
8.8.3.3 Parity-32 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
8.8.3.4 Robust Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
8.8.4 Hybrid Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
8.8.5 Other Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
8.9 Invariance based DFA Countermeasures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
8.9.1 An Infective Countermeasure Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
8.9.2 Infection Countermeasure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
8.9.3 Attacks on the Infection Countermeasure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
8.9.3.1 Further Loop Holes in the Countermeasure: Attacking the Infection
Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
8.9.4 Infection Caused by Compulsory Dummy Round . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
8.9.4.1 Attacking the Top Row . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
8.9.5 Piret & Quisquater’s Attack on the Countermeasure . . . . . . . . . . . . . . . . . . . . . . . . . . 259
8.10 Improved Countermeasure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
8.11 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
“When a secret is revealed, it is the fault of the man who confided it.”
The growing complexity of the cryptographic algorithms and the increasing applications
of ciphers in real-time applications has lead to research in the development of high speed
hardware designs or optimized cryptographic libraries for these algorithms. The complex
operations performed in these designs and the large state space involved indicates that a
complete verification is ruled out. Hence these designs have a chance of being fault prone.
Apart from these unintentional faults, faults can also be injected intentionally. Literature
shows several ways of fault injection: accidental variation in operating conditions, like volt-
age, clock frequency, or focussed laser beams in hardware. Software programs can also be
subjected to situations, like missing of certain instructions to inflict faults. Apart from the
general issue of fault tolerance in any large design, faults bring a complete new aspect when
dealing with cryptographic algorithms: security.
(SPN), as AES belongs to this family. However, similar observations and results can be
obtained for Feistel structures. Thus fault attacks pose a very powerful threat to hardware
implementations of all block ciphers of the modern day.
shown in Fig. 8.1. (1) Run the cryptographic algorithm and obtain non-faulty ciphertexts.
(2) Inject faults, i.e., unexpected environmental conditions into cryptographic implementa-
tions, rerun the algorithm with the same input, and obtain faulty ciphertexts (3) Analyze
relationship between the non-faulty and faulty ciphertexts to significantly reduce the key
space.
Practicality of DFA depends on the underlying fault model and the number of faulty
ciphertext pairs needed. In the following section we will analyze all the fault models DFA
of AES uses and point out their relationships. In this section, we continue the discussion on
the working principle of DFA wrt. a generalized block cipher model.
DFA works under the assumption of underlying faults. These faults are often caused
by various mechanisms, like: fluctuation of operating voltage, varying the clock frequency,
changing the temperature of a device and with the most accurate injection of laser beams.
However, in all of the above techniques the faults are created by sudden variation of the
operating conditions. It may be noted that apart from the mentioned means of malicious or
intentional fault injections, faults can be also unintentional. With the growing complexity
of crypto devices, chances of mistakes in the design also increase.
Faults can be categorized depending on whether they are permanent or transient. From
the point of view of cryptography, we would like to point out that transient faults are of
high concern as they are hard to detect. These faults can be of such a short duration that
most simulation based techniques of fault detection may be unable to detect the advent of
the faults. However, as we shall soon observe that few faults are enough to leak the entire
key of a standard cipher, like AES.
1. Bit model: This fault model assumes that the fault is localized to 1 bit. The fault
control is crucial here, as there is a high probability that a random fluctuation of the
operating conditions can lead to more than one bit getting affected. Hence attacks
based on such models are often unrealistic and may not be practically viable.
2. Single byte: A more practical and most common fault model is the single byte model.
This fault model assumes that the faults are spread to bytes and the fault model can
be any random non-zero value. This non-specificity of the fault value makes these
types of DFAs very powerful and practical techniques.
3. Multiple byte: In this fault model, it is assumed that the faults propagate to more
than 1 byte. More often, these models are more practical, in the sense that the DFAs
based on them work even with lesser fault control. In context to DFA of AES, we
shall observe a special multiple byte fault model, namely the Diagonal fault model
which helps to generalize the DFA of AES. The fault values are again arbitrary, and
hence makes these attacks very powerful, as they even work when control on the fault
induction is less.
S S S S
D0
S S S S
Dr−1
Kr−1
0 Kr−1
1 Kr−1
2 Kr−1
n−1
S S S S
C0 C1 C2 Cn−1
As the diffusion layer is a linear operation wrt. the key mixing operation namely xor,
the output difference can be expressed as a linear relation of the input differences. The
attacker exploits these properties in the following fashion to obtain the key. Say a single
byte fault is induced at the input of (r − 1)th (penultimate) round and the corresponding
difference at the input of Dr−1 is α 6= 0. If the branch number of the diffusion layer is
b, the input byte fault will spread to b − 1 bytes (απ0 , . . . , απb−2 ) at the output of Dr−1 ,
where π denotes the transformation of the diffusion layer. Each of these active bytes then
pass through the S-boxes, which non-linearly transform them. The attacker then represents
these output bytes in terms of a pair of fault-free and faulty ciphertexts (C, C ∗ ) as follows:
where j ∈ {0, . . . , b − 2} and S −1 represent the inverse of the S-box operation. Now the
attacker knows the S-box input difference, Cπj ⊕ Cπ∗j . From the difference distribution table
the attacker knows on an average few values which satisfy a chosen (απj , Cπj ⊕ Cπ∗j ) pair.
Further, because of the linear mapping in Dr−1 , απj depends linearly on α. Therefore, the
attacker guesses the value of α and gets the values of απj i.e., the output differences. Using
the input-output difference the attacker retrieves the value Cπj ⊕ Kπj from the difference
distribution table of the S-box. As Cπj and Cπ∗j are known to the attacker, -he can retrieve
the value of Kπj . The attacker may need to induce faults multiple times in order to get all
the bytes of the round key.
In the next section, we present the fault models used for DFA of AES in the literature
and a summary of all the attacks performed [415]. Subsequently, we present the fault attacks
on AES.
8.2.1.3 Faults Are Injected between the Output of 7th and the Input of 8th
MixColumns
The attacker uses various fault models and analysis in this scenario including single-byte
and multiple-byte faults.
Single-byte transient fault: The three different attacks using this fault model are
shown in Table 8.1. In the first DFA [300], 2 faulty ciphertexts are needed to obtain the key.
This fault model is experimentally verified in [356, 192]. In [356], underpowering is used to
inject faults into a smart card with AES ASIC implementation. Although no more than 16%
of the injected faults fall into the single byte fault category, only 13 faulty ciphertexts are
needed to obtain the key. In [192], the authors underpower an AES FPGA implementation
to inject faults with a probability of 40% for single-byte fault injection.
In the second DFA [264], 2 faulty ciphertexts are also needed to reveal the key. Because
this attack exploits the faults in a different way, the key space is 232 .
The attack in [390] is similar to [264] but further improved. For the same fault model,
the key space is reduced to only 28 with a single faulty ciphertext.
Multiple-byte transient fault: [341] proposes a general byte fault model called diag-
onal fault model. The authors divide the AES state matrix into 4 different diagonals and
each diagonal has 4 bytes. A diagonal is a set of 4 bytes of the AES state matrix, where
the ith diagonal is defined as follows:
The authors also validate the diagonal fault model with a practical fault attack on AES
FPGA implementation using overclocking.
FIGURE 8.3: DFA and diagonal fault models. The first state matrix is an example of
DM0. Only diagonal D0 is affected by a fault. The second state matrix is an example of
DM1. Both D0 and D3 are corrupted by a fault. The third state matrix is an example of
DM2. Three diagonals D0 , D1 , and D2 are corrupted by a fault. The last state matrix is
an example of DM3, all 4 diagonals are corrupted in the fourth state matrix.
FIGURE 8.4: Fault propagation of diagonal faults. The upper row shows the diagonals
that faults are injected in. The lower row shows the corresponding columns being affected.
8.2.1.4 Faults Are Injected between the Output of 8th and the Input of 9th
MixColumns
Single-bit transient fault: In [132], the attacker needs only 3 faulty ciphertexts to
succeed with a probability of 97%. The key space is trivial. [22] validates this single bit
attack on a Xilinx 3AN FPGA using overclocking. It is reported that the success rate of
injecting this kind of fault is 90%.
Single-byte transient fault: In [285], the authors use a byte level fault model. They
are able to obtain the key with 40 faulty ciphertexts, and the key is uniquely revealed.
This model is used in a successful attack by underpowering a 65nm ASIC chip [41]. In this
attack, 39,881 faulty ciphertexts are collected during the 10 experiments; 30,386 of them
were actually the outcome of a single-byte fault. Thus, it has a successful injection rate of
76%.
Multiple-byte transient fault: [256] presents a DFA of AES when the faults are
injected in a 32-bit word. The authors propose 2 fault models. In the first model, they
assume that at least one of the bytes among the 4 targeted bytes is non-faulty. This means
the number of faulty bytes can be 1, 2, or 3 bytes. So this fault model includes the single byte
fault model. If only one single byte fault is injected, 6 faulty ciphertexts are required to reveal
the secret key. Whereas the second fault model requires around 1500 faulty ciphertexts.
These faulty ciphertexts derive the entire key at constant time. Though the second fault
model is much more general, the amount of faulty ciphertexts it requires is very large, it is
difficult for the attacker to get all the ciphertexts without triggering the CED alarm.
In summary, the attacker can obtain the secret key with 1 or 2 faulty ciphertexts when
single or multiple byte transient faults are injected. In the following subsection, we present
a detailed analysis on the inter-relationships of the fault models discussed so far.
8.2.2.3 Faults Are Injected between the Output of 7th and the Input of 8th
MixColumns
Fig. 8.5(a) summarizes the relationships between the DFA-exploitable fault models
by injecting faults in the output of 7th round MixColumns and the input of 8th round
MixColumns.
Single-byte faults are, in turn, a subset of the DM0 faults which, in turn, are a subset
of the DM1 faults, and so on. The relationship is summarized in (8.3).
A more careful look reveals that 2 byte faults can be either DM0 or DM1 but not DM0.
Moreover, 3 byte faults can be either DM0, DM1, or DM2. Four byte faults can be either
DM0, DM1, DM2, or DM3. Similarly, the relationship between faulty bytes from 5 to 12
and diagonal fault models are summarized in Fig. 8.5(a).
As shown in Fig. 8.5(a), DM3 includes all possible byte transient faults. The attacks
proposed in [341] show that DFA based on DM0, DM1, and DM2 leads to the successful
retrieval of the key. Remember that DM3 faults are the universe of all possible transient
faults injected in the selected AES round. These faults spread across all 4 diagonals of the
AES state and hence, are not vulnerable to DFA as mentioned in Section 8.2.1.3. These
1 In general, the practical faults used in DFA target the 7th , 8th , and 9th rounds.
(a) (b)
FIGURE 8.5: Relationships between DFA fault models when faults are injected between
(a) the output of 7th and the input of 8th round MixColumns, (b) output of 8th and the
input of 9th round MixColumns.
fault models are multiple-byte transient faults and thus, attacks based on these models are
more feasible than those based on single-byte transient faults, which are a subset of the
model DM0. The considered fault models are vulnerable to DFA in the following order: (i)
the DM0 type faults reduce the key space of AES to 232 , (ii) the DM1 type faults reduce
the key space to 264 , and (iii) the DM2 type faults reduce the key space to 296 after a single
fault induction. The more encompassing the fault model is, the more realistic the attacks
based on it are.
Consider the cardinalities of the identified fault classes, the number of possible DM0,
DM1, and DM2 faults are 234 , 3 × 265 , and 298 , respectively. The number of possible DM3
faults is 2128 in the state matrix2 . If all faults are equiprobable during injection (this is the
perspective of conventional fault injection and detection studies), the probability of injecting
DM0, DM1, and DM2 faults is negligible. The probability that a randomly injected fault
is a DM0, DM1, or DM2 type fault is 2−94 , 31 × 2−63 , and 2−30 , respectively. However,
we stress that a DFA attacker does not use uniformly distributed fault injection. Rather,
he characterizes the device and uses specific fault injections, which results in high success
rates.
8.2.2.4 Faults Are Injected between the Output of 8th and the Input of 9th
MixColumns
Fig. 8.5(b) summarizes the relationships between the DFA-exploitable fault models by
injecting faults in the output of 8th and the input of 9th round MixColumns. Single-bit
transient faults are a subset of single byte faults. Single byte faults are again a subset of
2 The number of faults is calculated based on a simple assumption that the faults are injected at the
input to the round. If the faults can be injected anywhere in the AES round, all of these numbers can
be proportionally scaled. Further, this is ignoring all permanent and intermittent faults as they are not
exploitable from a DFA perspective.
DM0 faults. Two and three byte faults are a subset of DM0 faults. Again, attacks based on
multiple byte faults are more feasible than those based on single bit and single byte faults.
In the following section, we detail the above mentioned fault attacks on AES.
in in ⊕ α
K K
S S
out out ⊕ β
round output byte and K is the round key byte. The AES S-box is a non-linear operation,
therefore input difference α will change to β at the S-box output out. Now if we replace the
value of in ⊕ K by X, we can relate the input output differences by following equation:
According to the properties of AES, the S-box for a particular value of α and β the above
equation can have 0, 2, or 4 solutions of X [275]. For a fixed value of α, in 126 out of 256
choices of β the equation gives 2 solutions of X, and in only one choice of α the equation
gives 4 solutions and the rest of the choices of β will not give any solution for X. This
implies only 127 out of 256 choices of β produce solutions for X and the average number
of solutions of X is 1. It may also be noted that if we know the values of α, β and in we
can get the values of K from the above equation. This property is being used in most of
the advanced DFAs on AES. In the subsequent part of the chapter we explain DFA of AES
using these properties.
Note that l = (j − i) mod 4 provides the corresponding column index where the faulty byte
in the j th column and ith row shifts due to the Shiftrows operation. In other words, the
fault location (i, j) in the difference matrix S0 changes to (i, l) at the tenth round output.
Having obtained the location of the fault, the attacker is now set to ascertain the value
of the fault. Note the similarity of the above equation with that of equation 8.4. The value of
∗
Ci,l ⊕ Ci,l being known to the attacker, in order to get the value of xi,j he guesses 8 possible
values of ε. For each possible value of ε the attacker gets on an average one hypotheses for
0110 K9
1010
1010
S0
100
110 SubByte
10 01
1010
10th Round ShiftRow
10
10
1010 S1
10
10 010000000
10 10101111111 K10
10
S2
xi,j which will satisfy the above equation (refer Section 8.3.1). Thus for all the 8 possible
values of ε, the attacker gets on an average 8 candidates for xi,j . In order to identify the
unique value for xi,j , he obtains another faulty ciphertext by injecting another fault (i.e.,
the fault location is a different bit) in the same byte. A similar approach leads to another
set of 8 values for xi,j . Intersection of these 2 sets from the 2 different faulty ciphertexts is
expected to determine the exact value of xi,j .
This same technique is repeated for the other bytes in order to get all the 16 bytes of x.
On an average thus 2 · 16 = 32 faulty ciphertexts are needed to determine the value of the
state matrix x. Thus the attacker obtains the fault-free input of 10th round. Being aware of
the fault-free ciphertext C, one can easily retrieve the 10th round key K 10 from the relation
C = SR(S(x) ⊕ K 10 ). Then as the AES key-schedule is invertible one trivially retrieve the
master key.
to the SubBytes operation is x. Therefore, we can write x = Pzero ⊕ K. The attacker tries
to detect the value of l by repeating the following simple steps: He compares the fault-free
ciphertext, Czero with that of the faulty one, Czero
∗
. If they are equal it implies that the lth
bit of the (i, j) byte of x, which was reset due to the fault, was already zero and thus the
th
effect of the reset fault was inconsequential. Thus the corresponding bit of the (i, j)th key
byte was zero (as the plaintext is all zero). On the other hand a different value of Czero and
∗
Czero implies that the induced fault reset the bit xli,j with effect. That means the fault-
free value of xli,j was one and after fault induction it changes to zero. This also means the
corresponding bit value of K is one. The fault thus reveals whether a particular bit of the
whitening key is 1 or 0. The same technique is repeated for all the 128 bits, and thus 128
faulty ciphertexts are needed to get the master key.
The attack is relatively simple in nature, however is relatively less practical. The main
reason is the fault model is very strong as it requires very precise control over the fault
location: a specific bit and also an exact location, namely after the key whitening operation
and before the first SubBytes operation. With reference to a real-life hardware implemen-
tation of the block cipher, it is relatively difficult to achieve such a high level of accuracy
even with costly equipments. Thus fault attacks on AES with more relaxed fault models
and lesser fault induction requirements are desirous and topics of the future sections.
In this section, we present an overview on some of the more recent fault attacks on AES.
These attacks use the more practical fault model, namely the byte faults.
In byte-level DFA, we assume that certain bits of a byte is corrupted by the induced
fault and the induced difference is confined within a byte. Due to the fact that the fault is
induced in the penultimate round, implies that apart from using the differential properties
of S-box (as used in the bit-level DFA on last round of AES), the attacker also uses the
differential properties due to the diffusion properties of the MixColumns operation of AES.
As already mentioned in the AES diffusion is provided using a 4 × 4 MDS matrix in the
MixColumns. Due to this matrix multiplication, if one byte difference is induced at the
input of a round function, the difference is spread to 4 bytes at the round output. Fig. 8.8
shows the flow of fault.
The induced fault has generated a single byte difference at the input of the 9th round
MixColumns. Let f be the byte value of the difference and the corresponding 4-byte out-
put difference is (2f, f, f, 3f ), where 2, 1, and 3 are the elements of the first row of the
MixColumns matrix. The 4-byte difference is again converted to (f0 , f1 , f2 , f3 ) by the non-
linear S-box operation in tenth round. The ShiftRows operation will shift the differences
to 4 different locations. The attacker has access to the fault-free ciphertext C and faulty
ciphertext C ∗ , which differs only in 4 bytes. Now, we can represent the 4-byte difference
(2f, f, f, 3f ) in terms of the tenth round key K 10 and the fault-free and faulty ciphertexts
by the following equations:
10
10
2f
1010
f
S2
f
10 3f
1010 K9
10
1010 SubByte
1010 ShiftRow
10
th
10 Round f0
1010 f2
f1
S3
10 f3
K10
11
0000
11
000
111
000
111
000
111
11
00
00
11
00
1100
11
000
111
00
11
000
111
000
111
00
11
00
11 000
111
000
111
000
111
000
111
000
111
0011
1100111
00
11000
111
000
111
00
11
000
111 000
00111
11000
0000
1111
00
11
000
111
00
11
00
11
00
11 000
111
00
11
00
11
10 111
000
10
FIGURE 8.8: Differences across the last 2 rounds
2 f = S −1 (C0,0 ⊕ K0,0
10
) ⊕ S −1 (C0,0
∗ 10
⊕ K0,0 )
f = S −1 (C1,3 ⊕ K1,3
10
) ⊕ S −1 (C1,3
∗ 10
⊕ K1,3 )
(8.6)
f = S −1 (C2,2 ⊕ K2,2
10
) ⊕ S −1 (C2,2
∗ 10
⊕ K2,2 )
3 f = S −1 (C3,1 ⊕ K3,1
10
) ⊕ S −1 (C3,1
∗ 10
⊕ K3,1 )
These four equations can be expressed as the basic equation (8.4). Therefore, it can be
represented in the form A = B ⊕ C where A, B, and C are bytes in F28 , having 28 possible
values each. Now a uniformly random choice of (A, B, C) is expected to satisfy the equation
with probability 218 . Therefore, in this case 216 out of 224 random choices of (A, B, C) will
satisfy the equation.
This fact can be generalized. Consider we have M such related equations. These M
equations consist of N uniformly random byte variables. The probability that a random
choice of N variables satisfy all the M equations simultaneously is ( 218 )M . Therefore, the
reduced search space is given by ( 218 )M · (28 )N = (28 )N −M . For our case, we have four
equations which consist of five unknown variables: f, K0,0 10
, K1,3
10
, K2,2
10
, and K3,2
10
. Therefore,
the four equations will reduce the search space of the variables to (2 ) 8 5−4
= 28 . That
means out of 232 hypotheses of the 4 key bytes, only 28 hypotheses will satisfy the above 4
equations. Therefore, using one fault the attacker can reduce the search space of the 4 key
byte to 28 . Using two such faulty ciphertexts one can uniquely determine the key quartet.
This implies for one key quartet one has to induce 2 faults in the required location. For all
the 4 key quartets i.e., for the entire AES key an attacker thus needs to induce 8 faults.
Therefore using 8 faulty ciphertexts and a fault-free ciphertext, it is expected to uniquely
determine the 128-bit key of AES.
1
0
0
1
111
000
0
1
000
111
000p
111
0
1
0
1 S1
0
1
0
1
0
1
0
1 MixCol
0
1
8th Round
0
1
0
1 2p
0
1
0
1
0
1 p
S2
0
1
0
1
0
1
p
0
1
0
3p
1
0
1
0
1
0
1 K8
0
1
0
1
1
0 SubByte
0
1
0
1
1
0
1
0
1
0 ShiftRow
0
1
1
0
0
1
1
0
1
p0
0
0
1 p1
1
0
1
S3
0 p2
1
0
1
0 01
p3
0
1
1
0 10
1
0
1
0
1
0
1
0 10
10
MixCol
9th Round
0
1 10
0
1
0
1
2p0 p3 p2 3p1
p0 p3 3p2 2p1
0
1
S4
p0 3p3 2p2 p1
0
1
0
1
3p0 2p3 p2 p1
0
1 K9
0
1
0
1
0
1
SubByte
0
1 ShiftRow
0
1
10th Round
0
1
1 111
000
00000
111
0 11
00
00 11
K10
000
111
00
11
000
111
11
0 11
1 00
11
000
111
000
111
00
11
00
11
00
11
000
111
00
000
111
000
111
00
11
000
111
00
11
00
11
000
111
000
111
000
111
000
111
00
11 00
11
00
11
00
11
000
111
0
1
000
111
000
111
00
11
00
11
000
111
000
111
00
11
1010 11
000
111 00
FIGURE 8.9: Differences across the Last Three Rounds
2 p0 = S −1 (C0,0 ⊕ K0,0
10
) ⊕ S −1 (C0,0
∗ 10
⊕ K0,0 )
p0 = S −1 (C1,3 ⊕ K1,3
10
) ⊕ S −1 (C1,3
∗ 10
⊕ K1,3 )
(8.7)
p0 = S −1 (C2,2 ⊕ K2,2
10
) ⊕ S −1 (C2,2
∗ 10
⊕ K2,2 )
3 p0 = S −1 (C3,1 ⊕ K3,1
10
) ⊕ S −1 (C3,1
∗ 10
⊕ K3,1 )
In the above 4 differential equation we only guess the 28 values of p0 and get the correspond-
ing possible 28 hypotheses of the key quartet by applying the S-box difference distribution
table. Therefore, one column of S4 will reduce the search space of one quartet of key to 28
choices. Similarly, solving the differential equations from all the 4 columns we can reduce
the search space of all the 4 key quartets to 28 values each. Hence, if we combine all the
4 quartets we get (28 )4 = 232 possible hypotheses of the final round key K 10 . We have
assumed here that the initial fault value was in the (0, 0)th byte of S1 . If we allow the fault
to be in any of the 16 locations, the key space of AES is around 236 values. This space can
be brute-force-searched within practical time and hence shows that effectively one fault is
sufficient to reduce the key space to practical limits.
The search space of the final round key can be further reduced if we consider the relation
between the fault values at the state matrix S2 , which was not utilized in the previous
attacks. This step serves as a second phase, which is coupled with the first stage on all the
232 keys (for an assumed location of the faulty byte). We can represent the fault value in
the first column of S2 in terms of the 9th round key K 9 and the 9th round fault-free and
faulty output C 9 and C ∗9 , respectively, by the following 4 differential equations:
2 p0 = S −1 (14(C0,0
9 9
⊕ K0,0 9
) ⊕ 11(C1,0 9
⊕ K1,0 )⊕
9 9 9 9
13(C2,0 ⊕ K2,0 ) ⊕ 9(C3,0 ⊕ K3,0 ))⊕
S −1 (14(C0,0
∗9 9
⊕ K0,0 ∗9
) ⊕ 11(C1,0 9
⊕ K1,0 )⊕ (8.8a)
∗9 9 ∗9 9
13(C2,0 ⊕ K2,0 ) ⊕ 9(C3,0 ⊕ K3,0 ))
−1 9 9 9 9
p0 = S (9(C0,3 ⊕ K0,3 ) ⊕ 14(C1,3 ⊕ K1,3 )⊕
9 9 9 9
11(C2,3 ⊕ K2,3 ) ⊕ 13(C3,3 ⊕ K3,3 ))⊕
−1 ∗9 9 9 ∗9
S (9(C0,3 ⊕ K0,3 ) ⊕ 14(C1,3 ⊕ K1,3 )⊕ (8.8b)
∗9 9 ∗9 9
11(C2,3 ⊕ K2,3 ) ⊕ 13(C3,3 ⊕ K3,3 ))
−1 9 9 9 9
p0 = S (13(C0,2 ⊕ K0,2 ) ⊕ 9(C1,2 ⊕ K1,2 )⊕
9 9 9 9
14(C2,2 ⊕ K2,2 ) ⊕ 11(C3,2 ⊕ K3,2 ))⊕ (8.8c)
−1 ∗9 9 ∗9 9
S (13(C0,2 ⊕ K0,2 ) ⊕ 9(C1,2 ⊕ K1,2 )⊕
∗9 9 ∗9 9
14(C2,2 ⊕ K2,2 ) ⊕ 11(C3,2 ⊕ K3,2 ))
−1 9 9 9 9
3 p0 = S (13(C0,1 ⊕ K0,1 ) ⊕ 9(C1,1 ⊕ K1,1 )⊕
9 9 9 9
14(C2,1 ⊕ K2,1 ) ⊕ 11(C3,1 ⊕ K3,1 ))⊕ (8.8d)
−1 ∗9 9 ∗9 9
S (13(C0,1 ⊕ K0,1 ) ⊕ 9(C1,1 ⊕ K1,1 )⊕
∗9 9 ∗9 9
14(C2,1 ⊕ K2,1 ) ⊕ 11(C3,1 ⊕ K3,1 ))
In order to utilize the above equations we need the 9th -round key. The 9th -round key
can be derived from the final round key by the following conversion matrix:
(K0,0
10 10
⊕ S[K1,3 10
⊕ K1,2 ] 10
K0,1 10
⊕ K0,0 10
K0,2 10
⊕ K0,1 10
K0,3 10
⊕ K0,2
10 ⊕h 10 )
(K1,0 ⊕ S[K2,3
10 10
⊕ K2,2 ]) 10
K1,1 10
⊕ K1,0 10
K1,2 10
⊕ K1,1 10
K1,3 10
⊕ K1,2
10 .
(K2,0 ⊕ S[K3,3 ⊕ K3,2 ])
10 10 10 10
K2,1 ⊕ K2,0 10 10
K2,2 ⊕ K2,1 10
K2,3 10
⊕ K2,2
(K3,0
10 10
⊕ S[K0,3 10
⊕ K0,2 ]) 10
K3,1 10
⊕ K3,0 10
K3,2 10
⊕ K3,1 10
K3,3 10
⊕ K3,2
Thus for each of the possible hypotheses of K 10 produced by the first stage, and using the
ciphertexts (C, C ∗ ), we get the values of (K 9 , C 9 , C ∗9 ). Then the attacker tests the above
4 equations. If it satisfies the candidate key is accepted, else rejected. For completeness, we
state the detailed equations as follows:
2p0 = S −1 14(S −1 [K0,0
10 10
⊕ C0,0 ] ⊕ K0,0 10
⊕ S[K1,3 10
⊕ K1,2 ] ⊕ h10 )⊕
11(S −1 [K1,3
10 10
⊕ C1,3 ] ⊕ K1,0 10
⊕ S[K2,3 10
⊕ K2,2 ])⊕
13(S −1 [K2,2
10 10
⊕ C2,2 ] ⊕ K2,0 10
⊕ S[K3,3 10
⊕ K3,2 ])⊕
−1 10 10 10 10
9(S [K3,1 ⊕ C3,1 ] ⊕ K3,0 ⊕ S[K0,3 ⊕ K0,2 ]) ⊕
(8.9)
S −1 14(S −1 [K0,0
10 ∗
⊕ C0,0 10
] ⊕ K0,0 10
⊕ S[K1,3 10
⊕ K1,2 ])⊕
11(S −1 [K1,3
10 ∗
⊕ C1,3 10
] ⊕ K1,0 10
⊕ S[K2,3 10
⊕ K2,2 ])⊕
13(S −1 [K2,2
10 ∗
⊕ C2,2 10
] ⊕ K2,0 10
⊕ S[K3,3 10
⊕ K3,2 ])⊕
−1 10 ∗ 10 10 10
9(S [K3,1 ⊕ C3,1 ] ⊕ K3,0 ⊕ S[K0,3 ⊕ K0,2 ])
Similarly, the other 3 faulty bytes can be expressed by the following equations:
p0 = S −1 14(S −1 [K0,3
10 10
⊕ C0,3 ] ⊕ K0,3 10
⊕ K0,2 )⊕
11(S −1 [K1,3
10 10
⊕ C1,3 ] ⊕ K1,3 10
⊕ K1,2 )⊕
13(S −1 [K2,1
10 10
⊕ C2,1 ] ⊕ K2,3 10
⊕ K2,2 )⊕
−1 10 10 10
9(S [K3,0 ⊕ C3,0 ] ⊕ K3,3 ⊕ K3,2 ) ⊕
−1
(8.10)
S 14(S −1 [K0,3
10 10
⊕ C0,3 ] ⊕ K0,3 10
⊕ K0,2 )⊕
11(S −1 [K1,3
10 10
⊕ C1,3 ] ⊕ K1,3 10
⊕ K1,2 )⊕
13(S −1 [K2,1
10 10
⊕ C2,1 ] ⊕ K2,3 10
⊕ K2,2 )⊕
−1 10 10 10
9(S [K3,0 ⊕ C3,0 ] ⊕ K3,3 ⊕ K3,2 ) ⊕
p0 = S −1 14(S −1 [K0,2
10 10
⊕ C0,2 ] ⊕ K0,2 10
⊕ K0,1 )⊕
11(S −1 [K1,1
10 10
⊕ C1,1 ] ⊕ K1,2 10
⊕ K1,1 )⊕
13(S −1 [K2,0
10 10
⊕ C2,0 ] ⊕ K2,2 10
⊕ K2,1 )⊕
−1 10 10 10
9(S [K3,3 ⊕ C3,3 ] ⊕ K3,2 ⊕ K3,1 ) ⊕
−1
(8.11)
S 14(S −1 [K0,2
10 ∗
⊕ C0,2 10
] ⊕ K0,2 10
⊕ K0,1 )⊕
11(S −1 [K1,1
10 ∗
⊕ C1,1 10
] ⊕ K1,2 10
⊕ K1,1 )⊕
13(S −1 [K2,0
10 ∗
⊕ C2,0 10
] ⊕ K2,2 10
⊕ K2,1 )⊕
−1 10 ∗ 10 10
9(S [K3,3 ⊕ C3,3 ] ⊕ K3,2 ⊕ K3,1 )
3p0 = S −1 14(S −1 [K0,1
10 10
⊕ C0,1 ] ⊕ K0,1 10
⊕ K0,0 )⊕
11(S −1 [K1,0
10 10
⊕ C1,0 ] ⊕ K1,1 10
⊕ K1,0 )⊕
13(S −1 [K2,3
10 10
⊕ C2,3 ] ⊕ K2,1 10
⊕ K2,0 )⊕
−1 10 10 10
9(S [K3,2 ⊕ C3,2 ] ⊕ K3,1 ⊕ K3,0 ) ⊕
(8.12)
S −1 14(S −1 [K0,1
10 ∗
⊕ C0,1 10
] ⊕ K0,1 10
⊕ K0,0 )⊕
11(S −1 [K1,0
10 ∗
⊕ C1,0 10
] ⊕ K1,1 10
⊕ K1,0 )⊕
13(S −1 [K2,3
10 ∗
⊕ C2,3 10
] ⊕ K2,1 10
⊕ K2,0 )⊕
−1 10 ∗ 10 10
9(S [K3,2 ⊕ C3,2 ] ⊕ K3,1 ⊕ K3,0 )
and p0 is 232 · 28 = 240 . Therefore, the above 4 equations will reduce this search space of
40
K10 to (228 )4 = 28 . Hence using only one faulty ciphertext one can reduce the search space
of AES-128 key to 256 choices. However, the time complexity of the attack is 232 as we have
to test all the hypothesis of K 10 by the above equations: (8.9), (8.10), (8.11), and (8.12).
In the next subsection, we present an improvement to reduce the time complexity of the
attack to 230 from 232 .
one value of the first quartet is (a1 , b1 , c1 , d1 ). As per the property of the S-Box there will
be another value of K0,0 10
which satisfies the system of equation (8.7) with rest of the key
byte values remaining same. Let us assume the second value of K0,0 10
is a2 , then the 4-tuple
(a2 , b1 , c1 , d1 ) also satisfies the system of equation (8.7).
a1 b1 c1 d1 e1 f1 g1 h1
a2 b1 c1 d1 e2 f1 g1 h1
a3 b2 c2 d2 e3 f2 g2 h2
a4 b2 c2 d2 e4 f2 g2 h2
8
2
28
m3 n2 o2 p2 q3 r2 s2 t2
m4 n2 o2 p2 q4 r2 s2 t2
27
28
L1 L3 L2 L4
L5 L6
Test 1
Using this idea, we can divide the list for the quartet {K0,0 10 10
, K1,3 10
, K2,2 10
, K3,1 } into 2
sublists, L1 , L2 . As depicted in Fig. 8.10 The list L1 contains the pair values for the key
byte K0,010
(note that the key byte K0,0 10
has always an even number of possible choices).
The list L2 contains the distinct values for the remaining part of the quartet, {K1,3 10 10
, K3,1 }.
Thus, the expected size of the lists L1 and L2 is 2 each, compared to the previous list size
7
of 28 when{K0,0 10 10
, K1,3 10
, K3,1 } were stored together.
Similarly, we store the possible values of quartet {K0,1 10 10
, K1,0 10
, K2,3 10
, K3,2 } in 2 lists, L3
and L4 . Here L3 stores the pair values for the key byte K0,1 , while the list L4 contains the
10
Next we select the key bytes from the 6 lists, L1 , L2 , L3 , L4 , L5 , L6 to solve the equations
of the second phase of the attack such that the time complexity is reduced.
Because of the observations regarding the pair of equations (8.9) and (8.12); and (8.10)
and (8.11), the second phase can be divided into 2 parts. In part one we test the keys
generated from the first phase of the attack by the pair of equations (8.10) and (8.11). In
Fig. 8.10 this is denoted as Test1. As the 2 equations for Test1 do not require key bytes
10
K0,0 and K0,110
we only consider all possible keys generated from lists L2 , L4 , L5 , L6 . There
are 2 such possible keys. In the second part, we combine each of the 14 byte keys satisfying
30
Test1 with one of the 4 possible values arising out of the 4 combination of the pair of values
for K0,0
10
in L1 and K0,1 10
in L3 . These keys are further tested in parallel by the equations
(8.9) and (8.12). In Fig. 8.10, we refer to this test as Test2.
The size of the lists L2 and L4 is 27 ; and the size of lists L5 and L6 is 28 . Therefore
the number of possible keys generated from these 4 lists is 27 × 27 × 28 × 28 = 230 . These
230 keys are fed as input to Test1 which is expected to reduce the key hypotheses by 28 .
30
Therefore, each instance of Test2 will receive input of ( 228 ) = 222 expected key hypotheses.
The chance of each key satisfying Test2 is 2 −16
which implies each instance of Test2 will
result in 26 key hypotheses.
The above attack procedure is summarized in Algorithm 8.1.
It may be easily observed that the time required is because of step 3, which is equal to
230 , making the overall attack 4 times faster on an average, and still reducing the overall
keyspace of AES to around 28 values. The summary of the entire attack is presented in
Algorithm 8.2.
The above fault models are based on single byte fault models, which assume that the
fault is localized in a single byte. However due to impreciseness in the fault induction, the
fault can spread to more than one bytes. Such a multiple-byte fault requires a revisit at
16 return Lk
the DFA methods. In [341], a technique for performing DFA when such faults occur where
presented, which generalize further the DFA proposed in [28] and later extended in [197].
The underlying fault models assumed in this attack were already introduced in section 4.1.3
and were called as diagonal fault models. In the next section, we outline the idea of these
attacks.
In general any fault at the input of the 8th round in the ith diagonal, 0 ≤ i ≤ 3,
leads to the ith column being affected at the end of the round. There are 4 diagonals and
faults in each diagonal maps to 4 different byte inter-relations at the end of the 9th round.
These relations are depicted in Fig. 8.12. These relations will remain unchanged for any
combination of faults injected within a particular diagonal. Each of the 4 sets of relations
in Fig. 8.12 will be used to form key dependent equations. Each of the equation sets will
comprise of 4 equations of similar nature as shown in equation 8.6.
FIGURE 8.12: Byte Inter-relations at the end of 9th Round corresponding to different
Diagonals being Faulty
As before these equations reduce the AES key to an average size of 232 . If the attacker
is unaware of the exact diagonal, he can repeat for all the above 4 sets of equations, and
the key size will still be 232 × 4 = 234 , which can be brute forced feasibly with present-day
computation power.
Next we consider briefly the cases when the faults spread to more than one diagonal.
We observe that the nature of the faults in the state matrix at the input of the 9th round
MixColumns and hence at the output remains invariant for all possible faults in these two
diagonals. This property is exploited to develop equations which are used to retrieve the
correct key.
We denote the fault values in the first column of the output of the 9th round MixColumns
by a0 , a1 , a2 , a3 , where each ai is a byte 0 ≤ i ≤ 3. Then using the inter-relationships among
the faulty bytes one can easily show that:
a1 + a3 = a0
2a1 + 3a3 = 7a2
fault-free and faulty ciphertexts, and on AES-256 required three pairs of fault-free and faulty
ciphertexts. Recently, a DFA on AES-256 was proposed in [27], which required two pairs of
fault-free and faulty ciphertexts and a brute-force search of 16 bits with time complexity of
232 .
A DFA on AES-192 has been proposed by Kim [196], which exploits all the available
information. According to our analysis a single byte fault should reveal 120-bit of the secret
key. AES-192 has a 192-bit key, and therefore one would expect the most efficient attack
would need two single byte faults. Kim’s attack required two faults and uniquely determines
the key.
In this section, we propose a two-phase DFA on AES-256 states. The analysis says
that using a single byte fault induction one can reveal maximum of 120 bits of the secret
key. AES-256 has a 256-bit key. Therefore, two fault induction should be able to reveal
(120 · 2) = 240 bits of the key.
According to the AES-256 key schedule, retrieving one round key is not enough to get
the master key. Algorithm 6 shows that the penultimate round key is not directly related
to the final round key. Therefore, the attack on AES-128 cannot be directly applicable to
AES-256.
We propose an attack which requires two faulty ciphertexts C1∗ and C2∗ and a fault
free ciphertext C. The first faulty ciphertext C1∗ is generated by inducing a single byte
fault between the MixColumns operations in the eleventh and twelfth round, whereas C2∗ is
generated by inducing a singe byte fault in between the MixColumns operations in the tenth
and eleventh round. Fig. 8.15(a) shows the flow of faults corresponding to C1∗ , whereas
Fig. 8.15(b) shows the flow of faults corresponding to C2∗ .
The proposed attack works in two phases. In the first phase of the attack, we reduce the
possible choices of final round key to 216 hypotheses and in the second phase of the attack
we deduce 216 hypotheses for the penultimate round key leaving 216 hypotheses for the
master key.
In order to get the final round key we directly apply the first phase of the DFA on AES-
128, described in Section 8.4.1.2, to the faulty ciphertext C1∗ (Fig. 8.15(a)). Therefore,
using the relation between the faulty bytes in state matrix S4 we reduce the possible values
of the final round key K 14 to 232 hypotheses. Next we consider the second faulty ciphertext
C2∗ (Fig. 8.15(b)), where in state matrix S3 we have a relationship between the faulty bytes
that is similar to the state matrix S4 of C1 (Fig. 8.15(a)). We define X as the output of the
13th round SubBytes operation in the computation that produced the fault-free ciphertext.
We also define ρ and ε as the differences at the output of 13th round SubBytes operation
corresponding to two faulty ciphertexts C1∗ and C2∗ , respectively. These two differences can
be expressed as:
1
0
0
1 11
00
00
11
0
1
0
1
00
11
00p’
11
10 0
1
1010 11
00
00
11
00
11
00p
11 0
1
S1
1010 0
1
0
1
0
1
S1
1010 11th Round
MixCol
1010 MixCol
1
0
0
1
0
1
0
1
2p’
12th Round 0
1
0
1
p’
S2
1
0 0
1
0
1
p’
0
1
0
1
2p
0
1 3p’
0
1 0
1
0
1 p
0
1
0
1 S2 0
1
0
1 K11
0
1
0
1
p
0
1
0
1 0
1
0
1
3p
0
1
0
1 0
1 SubByte
0
1 0
1
0
1
0
1
0
1
K12
0
1
0
1 0
1
0
1
0
1 0
1
ShiftRow
0
1
0
1
SubByte
0
1
0
1 12th Round
0
1
0110
MixCol
X
0
1 1
0
0
1
0
ShiftRow 0
1
1
0 0
1
1
0
1 0
1 2p′0 p′3 p′2 3p′1
0
1 p0 0
1
0
1
0
1 0
1 p′0 p′3 3p′2 2p′1
0
1
0
1
p1 0
1 S3
0
1 p2 S3 0
1
0
1
p′0 3p′3 2p′2 p′1
0
1
0
1
0
1
0
1
p3 0110 0
1
0
1
0
1
0
1
3p′0 2p′3 p′2 p′1
0
1
0
1
0
1 10 0
1
0
1
K12
1010 0
1
MixCol
th
0
1
0110
13 Round
0
1
0
1 SubByte
2p0 p3 p2 3p1 0
1
0
1 X
1010 p0 p3 3p2 2p1
p0 3p3 2p2 p1
S4 1
0
th
13 Round ShiftRow
0
1
0
1
1010 3p0 2p3 p2 p1 0
1
0
1
0110
MixCol
0
1
0
1
10 K13 0
1
0
1
0
1 K13
1010 SubByte
0
1
0
1
0
1
0
1
1010
1010 0
1 SubByte
0
1
14th Round
ShiftRow 0
1
0
1
14th Round 10 0
1
0
1
ShiftRow
10 000
111
00
1100
11
000 111
000
111
000
00
11
000
111
000
111
11
00
00
11
00
11
00
11
000
111
000
111
000
111
111
000
00
11
000
111
00
11
000
111
00
11
00
11
000
111
00
11
000
111
00
11
000
111
000
111
00
11
00
11
00
1100
11
00
11
00
11 000
111
000
111 000
111
00
11
000
111
00
11
00
11
000
111
00
11
000
111
000
111
000
111
00
11
000
111
000
111
00
1100
1011
00
11
000
111 000
111
00
1100
11
000
111
000
111
00
11 00
11
1010111
000 000
111
00
11
000
111
00
11
000
111
00
11
000
111
00
11
00
11
000
111
00
11
000
111
00
11
000
111
00
11
000
111
ρ = SR−1 M C −1 SR−1 (SB −1 (C ⊕ K 14 ) ⊕
SR−1 (SB −1 (C1∗ ⊕ K 14 ))
ε = SR−1 M C −1 SR−1 (SB −1 (C ⊕ K 14 ))⊕
SR−1 (SB −1 (C2∗ ⊕ K 14 ))
Therefore, the fault values in the first column of S3 (Fig. 8.15(b)) can be represented in
terms of X and ε by four equations similar to equation (8.4). In that case ε0,0 , ε1,0 , ε2,0 , and
ε3,0 are the values corresponding β and 2p′0 , p′0 , p′0 , and 3 p′0 are the values corresponding
to α in the four equations, respectively.
Similarly, from the first column of state matrix S2 of Fig. 8.15(a), we get four more
differential equations which correspond to the first column of X and ρ. Therefore, corre-
sponding to first column of X, we get two sets of differential equations. Again each byte of
ε and ρ corresponds to one quartet of K 14 .
For example ρ0,0 can be expressed as:
ρ0,0 = 14(SB −1 (C0,0 ⊕ K0,0
14
) ⊕ SB −1 (C1(0,0)
∗ 14
⊕ K0,0 ))⊕
11(SB −1 (C1,3 ⊕ K1,3
14
) ⊕ SB −1 (C1(1,3)
∗ 14
⊕ K1,3 ))⊕
(8.13)
13(SB −1 (C2,2 ⊕ K2,2
14
) ⊕ SB −1 (C1(2,2)
∗ 14
⊕ K2,2 ))⊕
9(SB −1 (C3,1 ⊕ K3,1
14
) ⊕ SB −1 (C1(3,1)
∗ 14
⊕ K3,1 ))
We already know that each of the quartets are independently calculated and produces
28 hypotheses. Therefore, the four pairs (ε0,0 , ρ0,0 ), (ε1,0 , ρ1,0 ), (ε2,0 , ρ2,0 ), and (ε3,0 , ρ3,0 )
correspond to four quartets of K 14 and each having 28 values.
In order to solve two sets of differential equations of first column of X, with minimum
time complexity, we consider them in pairs. First, we choose two equations, for example from
the second set we choose equations corresponding to X0,0 and X1,0 . We guess the values of p
corresponding to each choice of (ρ0,0 , ρ1,0 ) and derive the possible values of X0,0 , X1,0 , ε0,0 ,
and ε1,0 . We test these values by the corresponding equations in the first set. If they satisfy
the relationships they are accepted, otherwise they are rejected. It may be observed that
the mapping between a byte of ρ and the corresponding byte of ε is one-to-one, as both the
bytes are derived from same key quartet.
Therefore, in the two equations of the second set we guess 28 · 28 · 28 = 224 hypotheses
for (ρ0,0 , ρ1,0 , p) which is reduced to 216 hypotheses by corresponding two equations of the
first set. Each of these 216 hypotheses are combined with 28 hypotheses for ρ2,0 in the third
equation of the second set and tested by the corresponding equation in the first set. Again,
the possible hypotheses reduce to 216 . Then these values are combined with 28 hypothe-
ses for ρ3,0 in the fourth equation of the second set and verified using the corresponding
equation in the first set, which will again reduce the number of possible hypotheses to 216 .
Therefore, finally we will have 216 hypotheses for K 14 each corresponding to one value for
(X0,0 , X1,0 , X2,0 , X3,0 ). Throughout the process the time consuming part of the calculation
is where 224 hypotheses are made and the rest is negligible. We, therefore, consider the time
complexity of this process to be 224 .
It can also be explained in straightforward way. There are eight equations, in which p, p′0 ,
(X0,0 , X1,0 , X2,0 , X3,0 ) and K 14 are unknown. The total search space of these variables
would be 280 . Therefore, the reduced search space produced by these eight equations is
280
(28 )8 = 2 .
16
In the second phase of the attack we deduce the values of penultimate round key K 13
corresponding to 216 choices of K 14 .
12 12
2 p′ = SB −1 14(SB −1 (X0,0 ) ⊕ K0,0 ) ⊕ 11(SB −1 (X1,0 ) ⊕ K1,0 )⊕
12 12
13(SB −1 (X2,0 ) ⊕ K2,0 ) ⊕ 9(SB −1 (X3,0 ) ⊕ K3,0 ) ⊕
12 12
(8.14)
SB −1 14(SB −1 (X0,0 ⊕ ε0,0 ) ⊕ K0,0 ) ⊕ 11(SB −1 (X1,0 ⊕ ε1,3 ) ⊕ K1,0 )⊕
12 12
13(SB −1 (X2,0 ⊕ ε2,2 ) ⊕ K2,0 ) ⊕ 9(SB −1 (X3,0 ⊕ ε3,1 ) ⊕ K3,0 )
12 12
p′ = SB −1 9(SB −1 (X0,3 ) ⊕ K0,3 ) ⊕ 14(SB −1 (X1,3 ) ⊕ K1,3 )⊕
12 12
11(SB −1 (X2,3 ) ⊕ K2,3 ) ⊕ 13(SB −1 (X3,3 ) ⊕ K3,3 ) ⊕
12 12
(8.15)
SB −1 9(SB −1 (X0,3 ⊕ ε0,3 ) ⊕ K0,3 ) ⊕ 14(SB −1 (X1,3 ⊕ ε1,2 ) ⊕ K1,3 )⊕
12 12
11(SB −1 (X2,3 ⊕ ε2,1 ) ⊕ K2,3 ) ⊕ 13(SB −1 (X3,3 ⊕ ε3,0 ) ⊕ K3,3 )
12 12
p′ = SB −1 13(SB −1 (X0,2 ) ⊕ K0,2 ) ⊕ 9(SB −1 (X1,2 ) ⊕ K1,2 )⊕
12 12
14(SB −1 (X2,2 ) ⊕ K2,2 ) ⊕ 11(SB −1 (X3,2 ) ⊕ K3,2 ) ⊕
12 12
(8.16)
SB −1 13(SB −1 (X0,2 ⊕ ε0,2 ) ⊕ K0,2 ) ⊕ 9(SB −1 (X1,2 ⊕ ε1,1 ) ⊕ K1,2 )⊕
12 12
14(SB −1 (X2,2 ⊕ ε2,0 ) ⊕ K2,2 ) ⊕ 11(SB −1 (X3,2 ⊕ ε3,3 ) ⊕ K3,2 )
12 12
3 p′ = SB −1 11(SB −1 (X0,1 ) ⊕ K0,1 ) ⊕ 13(SB −1 (X1,1 ) ⊕ K1,1 )⊕
12 12
9(SB −1 (X2,1 ) ⊕ K2,1 ) ⊕ 14(SB −1 (X3,1 ) ⊕ K3,1 ) ⊕
12 12
(8.17)
SB −1 11(SB −1 (X0,1 ⊕ ε0,1 ) ⊕ K0,1 ) ⊕ 13(SB −1 (X1,1 ⊕ ε1,0 ) ⊕ K1,1 )⊕
12 12
9(SB −1 (X2,1 ⊕ ε2,3 ) ⊕ K2,1 ) ⊕ 14(SB −1 (X3,1 ⊕ ε3,2 ) ⊕ K3,1 )
Each of the above equations requires one column of X and one column of K 12 . Therefore, the
last three equations can be solved as we already know the values of the last three columns of
X and K 12 . In order to reduce the time complexity, we conduct a pairwise analysis. We first
choose equations (8.15) and (8.16). We have 28 hypotheses for both (X0,3 , X1,3 , X2,3 , X3,3 )
and (X0,2 , X1,2 , X2,2 , X3,2 ). Each of these hypotheses can be evaluated using these two
equations that will reduce the value to 28 choices. Those which satisfy these equations are
combined with the 28 choices for (X0,1 ,X1,1 ,X2,1 , X3,1 ) and further tested by using (8.17)
which will again reduce the combined hypotheses of the last three column to 28 possibilities.
The values of (X0,0 , X1,0 , X2,0 , X3,0 ) are already reduced to one possibility for a particular
value of K 14 in the first phase of the attack. Therefore, this results on 28 hypotheses for X.
For each of these hypotheses we get the first column of K 12 and test using (8.14). This will
further reduce the number of hypotheses for X to 1. The time complexity here is around
216 as we consider two columns of X.
Therefore, one hypothesis for K 14 will produce one value for X which in turn produces
one value for K 13 by the following: K 13 = M C(SR(X))⊕C 13 , where C 13 is the output from
the 13th round, which is known to the attacker from the ciphertext C and K 14 previously
ascertained. Hence one hypothesis for K 14 will produce one hypothesis for K 13 . Therefore,
the 216 hypotheses of K 14 will produce 216 hypotheses for K 13 . In which case the total time
complexity will be 216 · 216 = 232 . So, finally we have 216 hypotheses for (K 13 , K 14 ) which
corresponds to 216 hypotheses for the 256-bit master key. Two faulty ciphertexts thus reveal
240-bit of the AES-256 key. The summary of the attack is presented in Algorithm 8.3.
The first complete DFA on AES key schedule was proposed in [88]. The attack was
targeted on AES-128 and required less than thirty pairs of fault-free and faulty ciphertexts.
This attack was improved in [297] which was based on multi-byte fault model where the
faults are injected during the execution of AES key scheduling. The attack retrieved the 128-
bit AES key using around 12 pairs of fault-free and faulty ciphertexts. The required number
of fault induction in the initial attacks show the complexity of DFA on AES key schedule.
An improved attack in [377] showed that a DFA on AES key schedule is possible using two
pairs of fault-free and faulty ciphertexts and a brute-force search of 48-bit. Subsequently,
there are two more attacks proposed in [198] and [194] using two pairs of fault-free and
faulty ciphertexts each. Furher optimized attacks on the AES key schedule was proposed
in [26] which required only one pair of fault-free and faulty ciphertexts.
SubWord 111
000
000
111
0000
1111 111
000 1111
0000 0000
1111
0000
1111 000
111
000 1111
0000
0000 0000
p p p 1111
p
RotWord 0000
1111 111 1111 0000
1111
Rcon8
SubWord
1111
0000
0000
1111 1111
0000
0000
1111
0000
1111
p 0000
1111
p
RotWord 0000
1111 0000
1111
Rcon9
0000
1111
1111
0000 000
111
111
000 0000
1111
1111
0000 0000
1111
1111
0000
0000
1111
q
0000
1111 000
111
q
000
111 q
0000
1111
0000
1111 q
0000
1111
0000
1111
0000
1111
0000
1111 000
111
000
111 0000
1111
0000
1111 0000
1111
0000
1111
SubWord
1111
0000 111
000
RotWord
0000
1111
p
0000
1111 000
111
p
000
111
Rcon10
1111
0000
0000
1111
r
111
000
000
111
r
1111
0000
0000
1111
r
0000
1111
1111
0000
r
0000
1111
0000
1111
0000
1111
0000
1111
000 1111
111 0000
0000
1111
0000
1111
0000
1111
0000
1111
1111
0000
q
0000
1111 1111
0000
q
0000
1111
0000
1111 0000
1111
0000
1111 0000
1111
In the first phase of the attack, we reduce the search space of the final round key to 240
hypotheses. In the second phase, we further reduce this search space to 28 hypotheses.
0
1
0
1
0
1 p p p p
0
1
0
1 S1
0
1
0
1
0
1
0
1
SubByte
0
1
9th Round
0
1
ShiftRow
0
1
0
1
MixCol
0
1
0
1
2p0 2p1 2p2 2p3
0
1
p0 p1 p2 p3
S2
0
1 p0 p1 p2 p3
K9 (Faulty)
0
1
0
3p0 3p1 3p2 3p3
p p
1
0
1
0
1 00
11
0
1 000
111
000
111 00
11
00
11 00
11 000
111
000
111
q q q q
000
111
0
1 0
1
000
111
0
1 000
111
00
11
00
11 00
11
00
11 000
111
000
111
0
1 00
11 00
11 000
111
0
1 0
1
S3
00
11 00
11 000
111
0
1 000
111
000
111
000
111
000
111
00
11 00
11 000
111
000
111
0
1 000
111 000
111
0
1
0
1
SubByte
0
1 1010
ShiftRow
0000
1111000
111
000111000
0001111 0000
10th Round
0
1 0000111
1111 10111 0000
1111
0
1
0
1
0
1 0000
1111000
111 000
111 0000
1111
0
1 0000
1111000
111 000
111 0000
1111
S4
0
1 0000
1111000
111 000
111
1010111 0000
1111 K10 (Faulty)
0000
1111000
111 000 0000
1111 p p
0
1 10100000000
1111111
0000
1111
0000111
000
000111
000
0001111
0000
10111
r r r r
1111111 0000
1111 q q
0000111
1111000111 0001111
0000
0000111
1111000111 0001111
0000
S5
0000111
1111000111 0001111
0000
0000111
1111000111 0001111
0000
FIGURE 8.17: Flow of faults in the last three rounds of AES-128
in the MixColumns operation. The faulty bytes in state matrix S2 can be represented by
the fault-free and faulty ciphertexts C and C ∗ . The first column S2 will produce a set of
four differential equations similar to equations (8.6) which corresponds to the key quar-
tet (K0,0
10 10
, K1,3 ,K2,2
10
,K3,1
10
). Similarly, from other three columns we get three more sets of
equations corresponding to key quartets (K0,1 10
,K1,0
10
, K2,3
10 10
, K3,2 ), (K0,2
10 10
, K1,1 10
, K2,0 10
, K3,3 ),
(K0,3 , K1,2 , K2,1 , K3,0 ). We refer to these four key quartets as Kq0 ,Kq1 , Kq2 , and Kq3
10 10 10 10
respectively.
It may be observed that unlike the proposed DFA on AES-128, here the number of
unknown variable are more. We have p, q, and r as extra unknown variables. Therefore,
existing solving techniques will not be applicable to these equations. It may be noted that
these three unknown variables are derived from key schedule operation and related by
following equations:
8 8
q = S[K0,3 ] ⊕ S[K0,3 ⊕ p]
9 9 9 9
= S[K0,3 ⊕ K0,2 ] ⊕ S[K0,3 ⊕ K0,2 ⊕ p] (8.18)
10 10 10 10
= S[K0,3 ⊕ K0,1 ] ⊕ S[K0,3 ⊕ K0,1 ⊕ p]
9 9
r = S[K3,3 ] ⊕ S[K3,3 ⊕ q]
10 10 10 10
= S[K3,3 ⊕ K3,2 ] ⊕ S[K3,3 ⊕ K3,2 ⊕ q] (8.19)
In the first three sets of equations there are 8 unknown variables (p, q, r, pi ) and the
corresponding quartet of key bytes Kqi ; where i corresponds to the i-th quartet. We observe
that the fourth set of equations does not contain p. In order to get the quartets Kq0 , Kq1 , Kq2
from the first three sets of equations, we need to test all possible 232 values for (p, q, r, pi ).
For, each of these hypotheses we get one hypothesis for Kq0 , Kq1 , and Kq2 each. Therefore,
for all possible 232 choices we get 232 hypotheses of each of the quartets. In the last set of
equations, we have only q, r, and p3 . Therefore, in the last set of equations we get 224 possible
hypotheses for Kq3 . Hence, all the possible choices of K 10 are given by (232 )3 · 224 = 2120
which is not practical.
In order to solve the individual set of equations in practical time we apply a divide-and-
conquer technique. We observe that the key bytes K0,3 10
, K0,1
10 10
, K3,2 10
, K3,3 , and (p, q) are also
contained in (8.18) and (8.19). Therefore, we can combine these equations with the last
three sets of equation corresponding to Kq1 , Kq2 , and Kq3 . This will reduce the possible
choices for the corresponding 12 key bytes.
In the first step we test the possible values of (p, q) For, each of these values we guess
the 28 values of p1 in the second set of equations. For each (p, q, p1 ) we get the values of 3
key bytes K0,110 10
, K1,0 , and K3,2 10
from the corresponding equations. Therefore, for one value
of (p, q) we get 2 hypotheses for (K0,1
8 10 10
, K1,0 10
, K3,2 ). Similarly, we guess p3 in fourth set
of equations and get 2 hypotheses for (K0,3 , K1,2 , K3,0
8 10 10 10
). Therefore, for one hypothesis for
(p, q) we get a total of 2 · 2 = 2 hypotheses for 6 key bytes (K0,1
8 8 16 10
,K1,0
10
,K3,2
10
,K0,3
10
,K1,2
10
,
K3,0 ). These values are tested by using (8.18), which will reduce the possible values of these
10
16
6 key bytes to 228 = 28 hypotheses.
In the second step, for each hypothesis for the six key bytes, we guess the values of
p2 and get the 3 key bytes (K0,2 10 10
, K1,1 10
, K3,3 ) from the third set of equations. Therefore,
we have a total of 2 · 2 = 2 hypotheses for nine key bytes (K0,1
8 8 16 10
,K1,0
10
,K3,2
10
, K0,3
10
,K1,2
10
,
K3,0 , K0,2 , K1,1 ,K3,3 ). We use these and get the corresponding values of r from (8.19).
10 10 10 10
Therefore, now using the values of r we can deduce the other 3 key bytes (K2,3 10 10
, K2,0 10
, K2,1 )
from the corresponding equations in the last three sets of equations. So, in the second step
we deduce 216 hypotheses for 12 key bytes from the last 3 sets of equations.
In the third step, we test the 28 values for p0 and get the corresponding choices of the
4 key bytes {K0,0 10
,K1,310
, K2,2
10
, K3,1
10
} from the first set of equations. Therefore, in the third
step we deduce a total of 2 · 2 =224 hypotheses for the 16 key bytes of K 10 corresponding
16 8
to one hypothesis for (p, q). Therefore, for all possible 216 hypotheses for (p, q), we will get
224 · 216 = 240 hypotheses for K 40 .
However, the complexity of this attack is still quite high. In our experiments, we found
out that for a desktop with an Intel CoreT M 2 Duo processor clocked at 3 GHz speed takes
around two and half days to perform brute-force search of 240 possible keys.
10 9 10 9
p = S −1 14(S −1 [K0,0 ⊕ C0,0 ] ⊕ K0,0 ) ⊕ 11(S −1 [K1,3 ⊕ C1,3 ] ⊕ K1,0 )⊕
10 9 10 9
13(S −1 [K2,2 ⊕ C2,2 ] ⊕ K2,0 ) ⊕ 9(S −1 [K3,1 ⊕ C3,1 ] ⊕ K3,0 ) ⊕
10 9 10 9
(8.20)
S −1 14(S −1 [K0,0 ∗
⊕ C0,0 ⊕ p] ⊕ (K0,0 ⊕ p)) ⊕ 11(S −1 [K1,3 ∗
⊕ C1,3 ] ⊕ K1,0 )⊕
10 9 10 9
13(S −1 [K2,2 ∗
⊕ C2,2 ⊕ r] ⊕ K2,0 ) ⊕ 9(S −1 [K3,1 ∗
⊕ C3,1 ] ⊕ (K3,0 ⊕ q))
Similarly, the other three faulty bytes can be expressed by the following:
10 9 10 9
p = S −1 14(S −1 [K0,1 ⊕ C0,1 ] ⊕ K0,1 ) ⊕ 11(S −1 [K1,0 ⊕ C1,0 ] ⊕ K1,1 )⊕
10 9 10 9
13(S −1 [K2,3 ⊕ C2,3 ] ⊕ K2,1 ) ⊕ 9(S −1 [K3,2 ⊕ C3,2 ] ⊕ K3,1 ) ⊕
10 9 10 9
(8.21)
S −1 14(S −1 [K0,1 ∗
⊕ C0,1 ⊕ p] ⊕ (K0,1 )) ⊕ 11(S −1 [K1,0 ∗
⊕ C1,0 ] ⊕ K1,1 )⊕
10 9 10 9
13(S −1 [K2,3 ∗
⊕ C2,3 ⊕ r] ⊕ K2,1 ) ⊕ 9(S −1 [K3,2 ∗
⊕ C3,2 ⊕ q] ⊕ (K3,1 ⊕ q))
10 9 10 9
p = S −1 14(S −1 [K0,2 ⊕ C0,2 ] ⊕ K0,2 ) ⊕ 11(S −1 [K1,1 ⊕ C1,1 ] ⊕ K1,2 )⊕
10 9 10 9
13(S −1 [K2,0 ⊕ C2,0 ] ⊕ K2,2 ) ⊕ 9(S −1 [K3,3 ⊕ C3,3 ] ⊕ K3,2 ) ⊕
10 9 10 9
(8.22)
S −1 14(S −1 [K0,2 ∗
⊕ C0,2 ] ⊕ (K0,2 ) ⊕ p) ⊕ 11(S −1 [K1,1 ∗
⊕ C1,1 ] ⊕ K1,2 )⊕
10 9 10 9
13(S −1 [K2,0 ∗
⊕ C2,0 ⊕ r] ⊕ K2,2 ) ⊕ 9(S −1 [K3,3 ∗
⊕ C3,3 ] ⊕ (K3,2 ⊕ q))
10 9 10 9
p = S −1 14(S −1 [K0,3 ⊕ C0,3 ] ⊕ K0,3 ) ⊕ 11(S −1 [K1,3 ⊕ C1,3 ] ⊕ K1,3 )⊕
10 9 10 9
13(S −1 [K2,1 ⊕ C2,1 ] ⊕ K2,3 ) ⊕ 9(S −1 [K3,0 ⊕ C3,0 ] ⊕ K3,3 ) ⊕
10 9 10 9
(8.23)
S −1 14(S −1 [K0,3 ⊕ C0,3 ] ⊕ (K0,3 ) ⊕ 11(S −1 [K1,3 ⊕ C1,3 ] ⊕ K1,3 )⊕
10 9 10 9
13(S −1 [K2,1 ⊕ C2,1 ⊕ r] ⊕ K2,3 ) ⊕ 9(S −1 [K3,0 ⊕ C3,0 ⊕ q] ⊕ (K3,3 ⊕ q)) ⊕
In the first phase of the attack we have already reduced p, q, r, and K 10 to 240 choices.
Using these values we can get the 9th round fault-free and faulty outputs. As per the attack
on the AES-128 key scheduling algorithm (Fig. 8.16), we can directly deduce the 9th round
key from the 10th round key. Therefore, for each value of K 10 we get the corresponding values
of K 9 and can test it using the four equations. There are four equations, and the total search
40
space is 240 . Therefore, the four equations reduce the search space to (228 )4 = 28 . Hence, in
the second phase of the attack we have only 28 hypotheses for K 10 . These can then be used
to drive 28 hypotheses for the master key.
Though the final search space is 28 , the time complexity of the attack is still 240 since
the second phase of the attack still needs to test each of the 240 keys generated from the
first phase of the attack.
bytes of K 13 . Similarly, in the rest of the three equations, each requires ten bytes of K 10 . In
the first phase of the attack we use (8.18) and (8.19) since their dependencies are between
the key bytes K0,3 10
, K0,1
10
, and K3,3
10
, K3,2
10
.
Therefore, in order to reduce the time complexity of the attack in the second phase we
only test one equation at a time. We start with the third equation, as it only requires eleven
bytes of K 10 (ten key bytes plus one for K0,3 10
since it depends on K0,1
10
in (8.18)). Those
which satisfy this equation are accepted and combined with the other five key bytes, and
are subsequently tested using rest of the three equations. Those which do not satisfy these
equations are simply discarded.
It is clear from the analysis in Section 8.4.1.3 that the number of unique choices of the
40
eleven key bytes required by the third equation is 225 = 235 . Therefore, we need only to test
235 hypotheses out of the 240 possibilities for the 16-byte key. Those which satisfy the test
are combined with 25 possible hypotheses for the remaining five key bytes and subsequently
tested using rest of the three equations. The first test will reduce the possible hypotheses
35
for 11 key bytes to 228 = 227 . Therefore, the rest of the three equations are tested using the
227 · 25 = 232 hypotheses for the 16-byte key, which will reduce the number of hypotheses
32
to (228 )3 = 28 .
So, finally we get 28 hypotheses for K 10 , and we test a maximum of 235 hypotheses
for the key. Therefore, the time complexity of the attack is reduced to 235 from 240 . As a
single fault in the AES key schedule is also able to reduce the number of key hypotheses
of AES-128 to 28 we claim that faults in the AES-128 datapath and key schedule are both
equal in terms of leakage of the key (reduction of key space), though the time complexity
of the attack is higher for the key schedule. The proposed attack summary is presented in
Algorithm 8.4.
26 return Lk
attack on AES-128, an extra eight byte need to be derived to get the master key.
We propose a two phase attack which requires two faulty ciphertexts C1∗ and C2∗ . These
two faulty ciphertexts are generated by inducing a single-byte fault at two different locations
of the first column of the tenth round key. Fig. 8.18 and Fig. 8.19 show how these faults
propagate in the key schedule.
111
000 111
000
000
111
000
111
p
000
111
000
111
p
000
111 000
111
SubWord
111
000
000
111
000
111
p
RotWord 000
111
Rcon
111
000
000
111 111
000
000
111 1111
0000
0000
1111 1111
0000
0000
1111 111
000
000
111 111
000
000
111
000
111
q
000
111 000
111
q
000
111 0000
1111
q
0000
1111 0000
1111
q
0000
1111 000
111
q
000
111 000
111
q
000
111
000
111
000
111 000
111
000
111 0000
1111
0000
1111 0000
1111
0000
1111 000
111
000
111 000
111
000
111
SubWord
RotWord
111
000 111
000 1111
0000 1111
0000
Rcon
000
111
000
111
r
000
111
000
111
r
0000
1111
0000
1111
r
0000
1111
0000
1111
r
000
111
000
111
000
111
000
111 0000
1111
0000
1111
0000
1111
0000
1111
000
111
q
111
000 0000
1111
q
1111
0000
000
111
000
111 0000
1111
0000
1111
FIGURE 8.18: Flow of Faults in AES-192 Key Schedule when Fault is Induced at K0,0
10
.
The propagation of these fault in the AES-192 state matrix in the last three rounds is
shown in Fig. 8.20(a) and Fig. 8.20(b). At the input to the eleventh round, state matrix
S1 , there is a difference in only four bytes. However, unlike the AES-128, the fault is not
propagated to all the bytes at the out put of penultimate round. In Fig. 8.20(a) the fault
is propagated to only 14 bytes, whereas in Fig. 8.20(b) the fault affects 13 bytes in the
penultimate round output.
In order to get the last two round keys of AES-192 we again follow a two phase attack
strategy. In the first phase of the attack we reduce the final round key to 28 choices and
in the second phase we first uniquely determine the final round key and then reduce the
penultimate round key to 210 possible choices.
111
000
000
111
p′
111
000
000
111
p′
000
111 000
111
SubWord
111
000
000
111 111
000
000
111 1111
0000
0000
1111 1111
0000
0000
1111 111
000
000
111 111
000
000
111
000
111
q′ 000
111
q′ 0000
1111
q′ 0000
1111
q′ 000
111
q′ 000
111
q′
RotWord 000
111 000
111 0000
1111 0000
1111 000
111
000
111 000
111
000
111
p′
000
111
Rcon
SubWord
111
000
000
111 1111
0000
0000
1111
RotWord
000
111
q′
000
111 0000
1111
q′
0000
1111
Rcon
000
111 000
111 0000
1111 0000
1111
111
000
000
111 111
000
000
111 1111
0000
0000
1111 1111
0000
0000
1111
r′
000
111
000
111
r′
000
111
000
111
r′
0000
1111
0000
1111
r′
0000
1111
0000
1111
000
111 000
111 0000
1111 0000
1111
FIGURE 8.19: Flow of Faults in AES-192 Key Schedule when Fault is Induced at K1,0
10
.
at the output of MixColumns (in S3 ). Therefore, this relation will produce four equations
similar to equations (8.6). In the same way, from the rest of the three columns of S3 we
get h2p1 , p1 , p1 , 3p1 i, h0, 0, 0, 0i , and hq1 , q1 , 3q1 , 2q1 i. Using second and fourth relations we
get two more sets of equations. However, from the third relation which does not have any
difference, we get a set of two equations corresponding to fault value p and q in K 11 . It may
be observed that the third byte of this relation is zero. Therefore, from this value we can
get r = C2,0 ⊕ C1(2,0)∗
.
Similarly, from the four columns of S3 of Fig. 8.20(b), we get relations h3p′0 , 2p′0 , p′0 , p′0 i,
h0, 0, 0, 0i , h2q0 , q0 , q0 , 3q0 i, and h(2q1 ⊕ 3p1 ), (q1 ⊕ 2p1 ), (q1 ⊕ p1 ), (3q1 ⊕ p1 )i. These four
relations will produce four more sets of equations. Each of these sets of equations corresponds
to one key quartet of twelfth round key K 12 . Like the previous attack we also name these
quartets Kq0 , Kq1 , Kq2 , and Kq3 respectively.
Therefore, each faulty ciphertext produces four sets of equations. These sets of equations
are not mutually independent, and are related by two variables. For the faulty ciphertext
C1∗ , the variables are (q, r) whereas for faulty ciphertext C2∗ , the variables are (q ′ , r′ ). As
with the propagation of faults in the AES-192 key schedule, the variables r and r′ can be
deduced from q and q ′ respectively (Fig. 8.18 and Fig. 8.19). They are related by following
equation:
11 11
r = S(K3,3 ) ⊕ S(K3,3 ⊕ q) (8.24a)
′ 11 11 ′
r = S(K0,3 ) ⊕ S(K0,3 ⊕q ) (8.24b)
Then r and r′ can directly be calculated from the ciphertexts C1∗ and C2∗ as r = C2,0 ⊕
∗
C1(2,0) and r′ = C2,0 ⊕ C2(2,0)
∗
. Now to solve the eight sets of equations we guess the values
of (q, q ). We start with two sets of equations corresponding to quartet Kq0 . In the second
′
set of equations, for one hypothesis for (q, q ′ ) we get 28 hypotheses for the quartet Kq0
MixCol MixCol
p p q′ q′
p′ p′
q q
p p q′ q′
K10 p′ p′ K10
S1 S1
q q
SubByte SubByte
ShiftRow ShiftRow
11th Round 11th Round
p0 p1 q′0 q′1
p′0 p′1
S2 S2
q0 q1
MixCol MixCol
S3 S3
p q′ q′ q′ q′
p′
q q q q
K11 K11
S4 S4
12th Round
SubByte 12th Round SubByte
ShiftRow ShiftRow
q′ q′
r r r r
q q r′ r′ r′ r′
K12 K12
r
r′
corresponding to 28 hypotheses for p′1 . Therefore, for all possible values of (q, q ′ ) we get 224
hypotheses for Kq0 . Each of these hypotheses are tested using the first set of equations.
There are eight equations in the two sets corresponding to quartet Kq0 , that contain
nine unknown variables; namely q,q ′ ,p0 , p3 ,p′1 and the quartet Kq0 . Therefore, the reduced
9−8
search space is given by (28 ) = 28 . This implies that out of 224 choices of q, q ′ , Kq0 , only
2 choices satisfy both the sets of equations.
8
Next we derive the second quartet Kq1 from its corresponding two sets of equations.
We can directly deduce the values of K0,1 12
corresponding to the values of q ′ in second set of
equations. These values can be used in the first set of equations to get the corresponding
values of 2 p1 and p1 . Using these values we can derive the three key bytes K1,0 12 12
, K2,3 12
, K3,2
from the remaining three equations of the first set.
This gives an expected 28 hypotheses for (q, q ′ ) from the previous step. Each of these
will produce 28 hypotheses for the quartet Kq1 , giving 28 hypotheses for (q, q ′ , Kq0 , Kq1 ).
For the third quartet, Kq2 , we can apply the same approach and one hypotheses for
12
K3,3 corresponding to one hypotheses for q from its first set of equations. This value will in
turn allow a hypothesis for K0,2 12
and K2,012
from the first and third equations of the second
set. However, p is unknown. Therefore, we have to consider all possible 28 hypotheses for
′
for q. Therefore, in this step, we get 216 hypotheses for (q, q ′ , Kq0 , Kq1 , Kq2 ).
In the next step we consider fourth quartet Kq3 . The two sets of equations are similar
to the two sets of equations corresponding to quartet Kq0 . Therefore, for one hypothesis for
q we get 28 hypotheses for the quartet Kq3 from the first set of equations. Each of these are
tested using the second set of equations. We have nine variables in the two sets of differential
equations in which we choose the values of q and q ′ from the 5-tuple (q, q ′ , Kq0 , Kq1 , Kq2 ).
Therefore, the total number for resulting hypotheses is (28 )7 · 216 = (28 )9 . We have eight
equations in two sets, which will reduce the hypotheses to (28 )9−8 = 28 for the 6-tuple
(q, q ′ , Kq0 , Kq1 , Kq2 ), Kq2 ). Therefore, in the first phase of the attack, we have 28 choices of
the final round key K 12 .
where 0 ≤ i ≤ 3. There is an expected 28 hypotheses for K 12 from the first phase of the
attack. We consider the two equations corresponding to two values of p in S1 . In these two
8
equations the search space is 28 , which can be reduced to 2216 = 218 . One would expect that
only one value will satisfy both the equations leaving one hypotheses for K 12 .
An attacker can then deduce the fourth column Ki,3 11
. The 2 bytes K0,3 11
and K3,3 11
of
the fourth column can directly be calculated using equations (8.24a) and (8.24b). For one
hypothesis for (q, r, q ′ , r′ ), we get four hypotheses for (K0,3
11 11
, K3,3 ). The other two key bytes,
K1,3 and K2,3 , can be derived from three more differential equations from S1 . The faulty
11 11
byte q in the fourth column of S1 (Fig. 8.20(a)), p′ in the first column and q ′ in the fourth
column of S1 (Fig. 8.20(b), will produce equations which correspond to Ki,3 11
. In these
equations only K1,3 and K2,3 are unknown, and the possible values for key bytes K0,3
11 11 11 11
, K3,3
had already been reduced to an expected four hypotheses. One would expect that these will
allow one hypothesis for Ki,3 11
to be determined (two hypotheses will remain with probability
218
(28 )3 = 26 ).
1
For the third column of K 11 , we can get the values of two key bytes K0,2 11
and K1,2
11
from (8.25a) and (8.25b). However, for one value for K0,3 , K1,3 , q, q we get two hypotheses
11 11 ′
for K0,211
from (8.25a) and two hypotheses for K1,2 11
from (8.25b) giving a total of four
hypotheses. For key bytes K2,2 and K3,2 we can only determine one equation, i.e., from
11 11
q ′ at the third column of S1 (Fig. 8.20(b)). This gives an expected four hypotheses for
(K2,2
11 11
, K3,2 ) and 216 hypotheses for (K2,2
11 11
, K2,3 ). Therefore, the resulting number of expected
Therefore, the two phase attack on AES-192 using two faulty ciphertexts can reduce a
192-bit key to 210 hypotheses. The above attack is one of the most efficient attack on the
AES-192 key schedule to date and is summarized in Algorithm 8.5.
17 Get K0,2
11
and K1,2
11
from equations (8.25a) and (8.25b).;
18 for Each candidate of (K0,2 11 11
, K1,2 ) do
19 Get K2,2 and K3,2 from equations of q ′ of S1 .;
11 11
20 Save (K 11 , K 12 ) to Lk .
21 end
22 end
23 return Lk
SubWord
p p p p q q q q
RotWord SubWord
Rcon
SubWord
p p
RotWord
Rcon
r r r r
FIGURE 8.21: Flow of Faults in AES-256 Key Schedule when the Fault is Induced at
12
K0,0
SubWord
p′′ p′′
RotWord SubWord
Rcon
SubWord
RotWord
Rcon
s′′ s′′ s′′ s′′
q′′ q′′
FIGURE 8.22: Flow of Faults in AES-256 Key Schedule when the Fault is Induced at
11
K0,0
MixCol
p p p p
p′′ p′′ p′′ p′′
K11
S1
p p p p
K12
S1
SubByte
12th Round
ShiftRow
SubByte
MixCol
p0 p1 p2 p3
2p′′0 2p′′1 2p′′2 2p′′3
S2
2p′′1 p′′1 p′′2 p′′3
p′′0 p′′1 p′′2 p′′3
S2
3p′′0 3p′′1 3p′′2 3p′′3
ShiftRow
13th Round
MixCol
q′′ q′′ q′′ q′′
2p02p12p22p3 K12
S3
2p1 p1 p2 p3
p0 p1 p2 p3 S3
3p03p13p23p3 q q q q
SubByte
13th Round
ShiftRow
K13 MixCol
S4 p′′ p′′
14th Round
r′′ r′′ r′′ r′′
SubByte SubByte
1
0 K13
0
1
14th Round
0
1
ShiftRow ShiftRow
p p
0
1 s′′ s′′ s′′ s′′
r r r r q′′ q′′
14
K K14
12
(a) Flow of faults from key byte K0,0 11
(b) Flow of faults from key byte K0,0
(K1,214 14
, K2,1 ), and the corresponding values of p1 , p2 , p3 , p′1 , p′2 , and p′3 from the second and
third equations of two sets of equations of Kq1 ,Kq2 ,Kq3 . Next, we guess the values of r and
r′ . For each hypothesis, we get one hypothesis for K3,1 14
using fourth equation of two sets of
equations of Kq0 . Similarly, we get the values K3,2 , K3,3
14 14
, and K3,0 14
corresponding to other
three key quartets. There are eight equations and six unknown variables (namely r, r′ , and
four key bytes) so an attacker should be able to determine these bytes.
An attacker would then only need to solve the first equation of each of the eight sets of
equations. In these equations we have eight unknown variable (q, p, q ′ , p′ ), and the four key
bytes. As per Fig. 8.21, q and q ′ can be derived from p and p′ using the following:
13 13 13 13
q = S(K0,3 ⊕ K0,2 ) ⊕ S(K0,3 ⊕ K0,2 ⊕ p) (8.26)
′ 13 13 13 13 ′
q = S(K0,3 ⊕ K0,2 ) ⊕ S(K0,3 ⊕ K0,2 ⊕p) (8.27)
14 14 14 14
r = S(K3,3 ⊕ K3,2 ) ⊕ S(K3,3 ⊕ K3,2 ⊕ q) (8.28)
′ 14 14 14 14 ′
r = S(K3,3 ⊕ K3,2 ) ⊕ S(K3,3 ⊕ K3,2 ⊕q ) (8.29)
where K 14∗ and K 13∗ are the 14th and 13th round faulty keys used to generate faulty
ciphertext C3 . K 14 is already known to us. Therefore, in order to get K 14∗ and (K 13 ⊕K 13∗ )
we need to know the values of p′′ , q ′′ , r′′ , and s′′ . However, as per Fig. 8.22, r′′ can be
directly deduced from K 14 and q ′′ by the following equation:
r′′ 12
= S(K3,3 12
) ⊕ S(K3,3 ⊕ q ′′ )
14 14 14 14
= S(K3,3 ⊕ K3,2 ) ⊕ S(K3,3 ⊕ K3,2 ⊕ q ′′ ) (8.30)
Therefore, now we need to guess p′′ , q ′′ , and s′′ to get the possible hypotheses for ǫ.
The possible fault values in the first column of S2 (Fig. 8.23(b)) can be represented in
terms of first column of X and ǫ which will produce four differential equations. Similarly,
from the rest of the three columns of S2 we get three more sets of equations. The values
for X0,0 , X0,1 , X0,2 , X0,3 can also be represented by the faulty ciphertexts C1∗ and C2∗ . In
Fig. 8.23(a), the first row of S1 can be expressed in terms of (X0,0 , X0,1 , X0,2 , X0,3 ),
(p0 , p1 , p2 , p3 ), which will produce a set of four differential equations. Similar, equations can
also be generated from C2∗ .
In these eight equations, only X0,0 , X0,1 , X0,2 , X0,3 are unknown; the rest of the variables
have been determined in the first phase of the attack. Therefore, using these equations we
can uniquely determine the values of X0,0 , X0,1 , X0,2 , X0,3 . It may be noted that these 4
bytes of X correspond to the first equations of the four sets of equations generated from S2
(Fig. 8.23(b)). We use the four bytes of X; and get the corresponding values of 2 p′′0 , 2 p′′1 ,
2 p′′2 , 2 p′′3 . If we multiply these values with the inverse of 2 we get the corresponding values
of p′′0 , p′′1 , p′′2 , and p′′3 .
We have 224 choices of ǫ corresponding to the all possible values of p′′ , q ′′ , and s′′ . For,
each possible value of ǫ we will get one hypothesis for the quartet of X from each of the
four sets of equations. Therefore, from all the four sets of equations we get one hypothesis
for X corresponding to one hypothesis for ǫ. Therefore, we expect to have 224 hypotheses
for X corresponding to 224 hypotheses for ǫ.
In the next step, we deduce four differential equations corresponding to four faulty
bytes p′′ , p′′ , p′′ , p′′ , in S1 (Fig. 8.23(b)) as described in Section 8.6.2.2. Each of these four
equations requires one column of the twelfth round key K 12 . The last three columns of K 12
can be computed from K 14 as Ki,j 12
= Ki,j
14 14
⊕Ki,j−1 where 0 ≤ i ≤ 4 and 1 ≤ j ≤ 3. Therefore,
we can test each value of X using the last three of the four equations which corresponds to
last three columns of K 12 . The value of p′′ is already known while considering ε.
There are 224 values of X in the three equations that will be expected to be reduced
24
to one hypothesis, since (228 )3 = 1. In some cases there could be more than one remaining
hypothesis for X satisfying the last three equations. In which case the false hypotheses can
be eliminated since K 13 = (M C(SR(X)) ⊕ C 13 ). Using the value of K 13 and K 14 we verify
these hypotheses using the key schedule.
The described attack would determine K 13 and K 14 allowing the 256-bit master key of
AES-256 using three faulty ciphertexts. The summary of the attack is given in Algorithm 8.6.
The detection mechanisms rely on specialized circuits, which are called as concurrent
error detection (CED) [364], which identifies a faulty computation through several redun-
dancy techniques. For example, the detection countermeasure is usually implemented by
duplicating the computation and finally comparing the results of two computations. When
a fault is thus detected, the adversary prevents the faulty ciphertext from being exposed
to the adversary. But in this countermeasure, the comparison step itself is prone to fault
attacks. The infection countermeasure on the other hand, aims to destroy the fault invari-
ant by diffusing the effect of a fault in such a way that it renders the faulty ciphertext
unexploitable. Infection countermeasures are preferred to detection as they avoid the use of
attack vulnerable operations such as comparison.
We first provide an overview on the four types of redundancy which are used to buid
CED based countermeasures aganst DFA.
In [170], the authors propose a novel hardware-redundancy technique for AES to detect
faults. Because an attacker can potentially inject the same faults to both of the AES circuits,
the straightforward hardware redundancy can be bypassed by the attacker. Furthermore,
in case of implementations where the Key-Schedule is done prior to the encryption and is
stored in a memory may not be protected by such countermeasures. This is because the
faults can target the memory storing the key, and thus the hardware redundancy checks will
miss the fault induction as repeating the encryption will yield the same result each time!
As shown in Fig. 8.24, the idea is to mix byte states between the operations in two pieces
of hardware, in different ways and at different locations. Because the entire hardware is
duplicated, hardware redundancy has low performance overhead, and the hardware overhead
is approximately 200%.
To reduce the hardware overhead, [270] proposes a partial hardware-redundancy tech-
nique as shown in Fig. 8.25. This technique focuses on parallel AES architecture and S-box
protection. The idea is to add an additional S-box to every set of four S-boxes, and perform
two tests of every S-box per encryption cycle (10 rounds). Although the hardware overhead
is reduced to 26%, this process has a fault coverage of 25% at a certain clock cycle, because
it can only check one S-box among every four in one clock cycle.
MUX 1
K1
Stage 1
K2
REG
11
00
00
11
00
11
1
0
1
0 K(r−1)
Stage r−1
Kr
REG
MUX 2
K1
Stage r
Kr
REG
?
Ciphertext
A technique that is suited for any pipeline-based block cipher design is proposed in [311]
(Fig. 8.26). The key idea is to use different pipeline stages to check against each other by
sliding the computations from one stage to another. The attack is based on the popular slide
attack [51]. Let us assume the pipeline has r stages, denoted as Ri , 1 ≤ i ≤ r. In the normal
computation, the plaintext will be computed by the first stage and then the second stage
and so on. The rth stage will produce the ciphertext. Consider two successive encryptions
with the same key, K, with the round keys denoted as K1 , K2 , . . . , Kr . The first encryption
is on a plaintext P , and the ciphertext is hence C = Enc(P, K). The intermediate state after
the first round of the first encryption is P ′ = R1 (P, K1 ). When the first round is operating,
at the same time the plaintext is encrypted redundantly by the last stage of the pipeline,
which is otherwise idle at this instance. The result P ′ (if there is no fault) is fed back by the
multiplexer (MUX-1 in Fig. 8.26) to the first stage of the pipeline and is encrypted with
the round key K2 . Thus at the same time instance two different redundant encryptions are
getting operated. This continues and when the ciphertext C is produced by the rth stage,
the (r − 1)th stage produces C ′ = Rr−1 (Rr−2 (. . . (R1 (P ′ , K2 ), . . . , Kr−1 )Kr )). It is trivial
to check if there is no fault, then C = C ′ , which is checked by the comparator. Compared
to the original design, this CED provides a throughput of 50-90.0% percent, depending on
the frequency of the redundant check from every one to ten rounds (in case of AES-128).
Hardware overhead is only 2.3% percent.
8.8.3.1 Parity-1
A technique in which a parity bit is used for the entire 128-bit state matrix is developed
in [412]. The parity bit is checked once for the entire round as shown in Fig. 8.27. This
approach targets low-cost CED. Parity-1 is based on a general CED design for Substitution
Permutation Networks (SPN) [179], in which the input parity of SPN is modified according
to its processing steps into the output parity and compared with the output parity of every
round. The authors adapt this general approach to develop a low-cost CED. First, they
determine the parity of the 128-bit input using a tree of XOR gates. Then for the nonlinear
S-box, inversion in GF (28 ) and a linear affine transformation. They add one additional
binary output to each of the 16 S-boxes. This additional S-box output computes the parity
of every 8-bit input and the parity of the corresponding 8-bit output.
Each of the modified S-boxes is 8-bit by 9-bit. The additional single-bit outputs of the 16
S-boxes are used to modify the input parity for SubBytes. Because ShiftRows implements
a permutation, it does not change the parity of the entire state matrix from its input to
output. MixColumns does not change the parity of the state matrix from inputs to outputs
either. Moreover, MixColumns does not change the parity of each column. Finally, the bit-
wise XOR of the 128-bit round key needs a parity modification by a single precomputed
parity bit of the round key. Because the output of a round is the input to the next round,
the output parity of a round can be computed with the same hardware for computing the
input parity of the previous round.
Although this technique has only 22.3% hardware overhead, it has 48%-53% fault cov-
erage for multiple bit random fault model.
8.8.3.2 Parity-16
Parity-16 is first proposed in [45]. In this technique, each predicted parity bit is generated
from an input byte. Then, the predicted parity bits and actual parity bits of output are
compared to detect the faults.
In [45], the authors propose the use of a parity bit that is associated with each byte of
the state matrix of a 128-bit iterated hardware implementation with LUT-based S-boxes
as shown in Fig. 8.28. Predicted parity bits on S-box outputs are stored as additional bits
in the ROMs (nine bits instead of eight in the original S-boxes). In order to detect errors
in the memory, the authors propose increasing each S-box to 9-bit by 9-bit in such a way
that all the ROM words addressed with a wrong input address (i.e. S-boxes input with
a wrong associated parity), deliberately store values with a wrong output parity so that
the CED will detect the fault. As before, the parity bit associated with each byte is not
affected by ShiftRows. In Parity-1, the global parity bit on the 128 bits remains unchanged
after MixColumns. Conversely, at the byte level, the parity after MixColumns is affected.
Therefore, parity-16 requires the implementation of prediction functions in MixColumns.
Finally, the parity bits after AddRoundKey are computed as before, by adding the current
parity bits to those of the corresponding round key. This technique incurs 88.9% hardware
overhead because of the LUT size is doubled. The throughput is 67.86% of the original.
8.8.3.3 Parity-32
As shown in Fig. 8.29, a technique that strengthen fault detection on S-boxes is pro-
posed in [269]. With respect to parity-16, this technique still uses one parity bit for each
byte in all the operations except SubBytes. It adds one extra parity bit for each of the
inputs and outputs of the S-box in SubBytes; one parity bit for the input byte and one for
the output byte. The actual output parity is compared with the predicted output parity,
and the actual input parity bit is compared with the predicted input parity. It has 37.35%
hardware overhead and 99.20% fault coverage.
Robust codes are first proposed in [177]. The idea is to use non-linear EDC instead of
linear codes, like parity. Robust codes can be used to extend the error coverage of any linear
prediction technique for AES. The advantage of non-linear EDC is that it has uniform fault
coverage, unlike the linear codes like parity. Thus, if all the data vectors and error patterns
are equiprobable, then the probability of injecting an undetectable fault is the same for all
of them.
The architecture of AES with robust protection is presented in Fig. 8.30. In this archi-
tecture, two extra units are needed. One is the prediction unit at the round input, and it
includes a linear predictor, a linear compressor, and a cubic function. The other one is the
comparison unit at the output of the round, and it includes a compressor, a linear compres-
sor, and a cubic function This architecture protects the encryption and decryption as well
as Key-Scheduler or Key Expansion module.
A linear predictor and linear compressor is designed to generate a 32-bit output, and
we call them the linear portion. The output of the linear portion is linearly related to the
output of the round of AES as shown in Fig. 8.30. They offer a relatively compact design
compared to the original round of AES. They simplify the round function by XORing the
bytes in the same column. The effect of MixColumns is removed by the linear portion. As a
result, the linear portion is greatly simplified as it no longer needs to perform multiplication
associated with the MixColumns or InvMixColumns. For the cubic function, the input is
cubed in GF (2r ) to produce the r-bit output, and thus it is non-linear with respect to the
output of the round.
In the comparison unit, the compressor and the linear compressor are designed to gen-
erate a 32-bit output from the 128-bit round output. The bytes in the same column of the
output is XORed. Again, the 32-bit output is cubed in the cubic function to generate r-bit
output. This output is then compared with the output from the prediction unit.
This technique provides 1 − 2−56 fault coverage, and it has a 77% hardware overhead.
Although these countermeasures can thwart DFA, the designer needs to be cautious
when implementing information redundancy-based techniques. Because they increase the
correlation of the circuit power consumption with the processed data, the side channel
leakage is also increased [235].
(a) (b)
FIGURE 8.31: Hybrid redundancy (a) Algorithm level (b) Round level
whether encryption or decryption is in progress. They then extend the proposed techniques
to full duplex mode by trading off throughput and CED capability.
As shown in Fig. 8.31(a), the algorithm-level CED approach exploits the inverse re-
lationship between the entire encryption and decryption. Plaintext is first processed by
the encryption module. After the ciphertext is available, the decryption module is enabled
to decrypt the ciphertext. While the decryption module is decrypting ciphertext, the en-
cryption module can process the next block of data or be idle. A copy of plaintext is also
temporarily stored in a register. The output of decryption is compared with this copy of
the input plaintext. If there is a mismatch, an error signal will be raised, and the faulty
ciphertext will be suppressed.
For AES, the inverse relationship between encryption and decryption exists at the round
level as well. Any input data passed successively through one encryption round are recovered
by the corresponding decryption round.
For almost all the symmetric block cipher algorithms, the first round of encryption
corresponds to the last round of decryption; the second round of encryption corresponds
to the next-to-the-last round of decryption, and so on. Based on this observation, CED
computations can also be performed at the round level. At the beginning of each encryption
round, the input data is stored in a register before being fed to the round module. After
one round of encryption is finished, output is fed to the corresponding round of decryption.
Then, the output of the decryption round is compared with the input data saved previously.
If they are not the same, encryption is halted and an error signal is raised. Encryption with
round-level CED is shown in Fig. 8.31(b).
Depending on the block ciphers and their hardware implementation, each round may
consume multiple clock cycles. Each round can be partitioned into operations and sub-
pipelined to improve performance. Each operation can consume one or more clock cycles,
such that the operations of encryption and corresponding operations of decryption satisfy
the inverse relationship. As shown in Fig. 8.32(a), applying input data to the encryption
operation and the output data of the encryption operation to the corresponding inverse
operation in decryption yields the original input data. The boundary on the left shows the
rth encryption round while the boundary on the right shows the (n − r + 1)th decryption
round, where r is the total number of rounds in encryption/decryption. Fig. 8.32(a) also
shows that the first operation of the encryption round corresponds to the mth operation
of the decryption round, which is the last operation of the decryption round. Output from
operation one of encryption is fed into the corresponding inverse operation m of decryption.
Although these techniques has close to 100% fault coverage, their throughput is 73.45%
of the original AES in half-duplex mode. It can suffer from more than 100% throughput
overhead if the design is in full-duplex mode. The hardware overhead is minimal if both
encryption and decryption are on the chip. However, if only encryption or decryption is
used in the chip, it will incur close to 100% hardware overhead.
(a) (b)
FIGURE 8.32: Hybrid redundancy. (a) Operation level. (b) Optimized version.
To reduce the hardware overhead of the previous technique, [349] proposes a novel hard-
ware optimization, and thus, reduced the hardware utilization significantly. Fig. 8.32(b)
shows the architecture. It divides a round function block into two sub-blocks and uses them
alternatively for encryption (or decryption) and error detection. Therefore, no extra calcu-
lation block is needed, even though only a pipeline register, a selector and a comparator are
added. The number of operating cycles is doubled, but the operating frequency is boosted
because the round function block in the critical path is halved. Therefore, the technique
provides 85.6% throughput compared to 73.45% in the previous one. The hardware overhead
also decrease from 97.6% to 88.9%.
3. while i ≤ 2n do
4. λ ← RandomBit() // λ = 0 implies a dummy round
5. κ ← (i ∧ λ) ⊕ 2(¬λ)
12. i←i+λ
13. end
14. R0 ← R0 ⊕ RoundF unction(R2 , k 0 ) ⊕ β
15. return(R0 )
Algorithm 8.7 depicts the infection countermeasure proposed in [131] for AES128. At the
beginning of this algorithm, plaintext P is copied to both R0 and R1 and a secret value β
is copied to R2 . In this algorithm, every round of AES is executed twice. The redundant
round which operates on R1 , occurs before the cipher round which operates on R0 . There
are dummy rounds which occur randomly across the execution of this algorithm, in addition
to one compulsory dummy round in step 14. The input to the dummy round is a secret value
β and a secret key k 0 , which is chosen such that RoundF unction(β, k 0 ) = β. To prevent the
information leakage through side channels e.g. power analysis, dummy SubByte, ShiftRow
and MixColumn operations are added to the 0th round and a dummy MixColumn operation
is added to the 10th round of AES128. The intermediate computation of cipher, redundant
and dummy round is stored in C0 , C1 and C2 respectively. A random bit λ decides the
course of the algorithm as follows:
After the computation of every cipher round, the difference between C0 and C1 is trans-
formed by Some Non Linear Function(SN LF ) which operates on each byte of the difference
(C0 ⊕ C1 ). SN LF maps all but zero byte to non-zero bytes and SN LF (0) = 0. Authors in
[131] have suggested to use inversion in GF (28 ) as SN LF . In case of fault injection in either
cipher or redundant round, the difference (C0 ⊕ C1 ) is non-zero and the infection spreads in
subsequent computations through R0 and R2 according to steps 9-11. Also, if the output of
dummy round, C2 , is not β, the infection spreads in the subsequent computations through
the steps 8-11. Finally in the step 14, the output of last cipher round is xored with the
output of dummy round and β, and the resulting value is returned.
R2 and R0 are infected in steps 10 and 11. After the infection steps, we obtain:
0 0 0 0
0 0 0 ε ⊕ SN LF [ε]
R0 ⊕ R1 =
0
0 0 0
0 0 0 0
Finally, in the step 14, dummy round operates on infected R2 which further infects R0 .
But, the ShiftRow operation of dummy round shifts the infection to column 3 and leaves
the faulty byte of R0 in column 4 unmasked. The output of compulsory dummy round
differs from β in column 3 and therefore, the final difference between the correct ciphertext
C and faulty ciphertext C ∗ is:
0 0 β8′ ⊕ β8 0
∗
0 0 β9′ ⊕ β9 ε ⊕ SN LF [ε]
∴C ⊕C =
0
(8.31)
0 ′
β10 ⊕ β10 0
0 0 ′
β11 ⊕ β11 0
M C(SR(S(β))) ⊕ k 0 = β
Using this relation, the xor of RoundF unction(R2 , k 0 ) and β in step 14 of Algorithm 8.7
can now be expressed as:
Since SubByte operation is the only non-linear operation in the above equation,
If R2 = β then the execution of compulsory dummy round in step 14 has no effect on the
final output R0 , but if R2 6= β then the output of compulsory dummy round infects the
final output R0 . However, this infection can be removed using the above derived equation
and the desired faulty ciphertext can be recovered.
On the basis of Equation 8.32, the xor of correct ciphertext C and faulty ciphertext C ∗ in
Equation 8.31 can now be expressed as:
0 0 3·x 0
0 0 2·x ε ⊕ SN LF [ε]
C ⊕ C∗ = 0
0 1·x 0
0 0 1·x 0
where x = S[β13 ⊕ SN LF [ε]] ⊕ S[β13 ] (for details refer [146]). Ideally, every byte of C ∗
should be infected with an independent random value but here the compulsory dummy
round in Algorithm 8.7 infects only column 3 of C ∗ and that too, with interrelated values
and leaves the rest of the bytes unmasked.
In the following discussion, we show the significance of this result, by attacking the top
row of I 10 . Subsequently, we show that the infection can be removed even if the fault is
injected in the input of the 9th cipher round. We prove this by mounting the classical Piret
& Quisquater’s attack [126] on the countermeasure [131].
Finally, in the step 14, dummy round operates on infected R2 which further infects R0 .
In this case, the ShiftRow operation of dummy round does not shift the infection and
the erroneous byte of R0 in column 1 is masked. The final difference between the correct
ciphertext C and faulty ciphertext C ∗ is:
ε ⊕ SN LF [ε] ⊕ β0′ ⊕ β0 0 0 0
β1′ ⊕ β1 0 0 0
∴ C ⊕ C∗ = (8.33)
β2′ ⊕ β2 0 0 0
β3′ ⊕ β3 0 0 0
where β0′ , β1′ , β2′ , β3′ are the infected bytes of the compulsory dummy round output. and
ε = S[I010 ⊕ f ] ⊕ S[I010 ]. Here, we cannot use the attack technique described in [42] directly,
because the erroneous byte of 10th cipher round has also been infected with the output of
compulsory dummy round in step 14. This is different from the case when fault is injected
in any of the last three rows of 10th cipher round input. In order to carry out the attack
[42], we need to remove the infection caused by the dummy round.
Now, we can use Equation 8.32 to write the above matrix as:
ε ⊕ SN LF [ε] ⊕ 2 · y 0 0 0
1·y 0 0 0
C ⊕ C∗ = (8.34)
1·y 0 0 0
3·y 0 0 0
where y = S[β0 ⊕ SN LF [ε]] ⊕ S[β0 ] (for details refer [146]). We can use the value of 1 · y
from C ⊕ C ∗ to remove the infection from C ∗ and therefore unmask the erroneous byte.
As a consequence, we can perform the attack suggested in [42] to get the key byte k011 . By
attacking the top row, now the attacker has the flexibility to mount the attack on any of the
12 bytes of 10th cipher round instead of always targeting the last three rows.
Observation 2: It is quite evident from this attack that the infection mechanism used
in the countermeasure [131] is not effective. The purpose of this infection countermeasure
is defeated as we can easily remove the infection and recover the desired faulty ciphertext.
This is a major flaw in this countermeasure as it makes even the 9th round susceptible to
the fault attack which we will illustrate in the following discussion.
attack [126] on this countermeasure and recover the entire key using only 8 faulty cipher-
texts. Subsequently, we show that even if the random dummy rounds occur, we can still
mount this attack [126].
Attack in the Absence of Random Dummy Rounds. Consider the scenario where
the attacker influences the RandomBit function so that no dummy round occurs except
the compulsory dummy round in step 14. We observe that if a fault is injected in the 9th
cipher round, then the rest of the computation is infected thrice. Once, after the 9th cipher
round in step 11, then after the 10th cipher round in step 11 and finally after the execution
of compulsory dummy round in step 14. To be able to mount Piret & Quisquater’s attack
[126], we first analyze the faulty ciphertext and identify whether a fault was injected in the
input of 9th cipher round. After identifying such faulty ciphertexts, we remove the infection
caused by the output of compulsory dummy round and 10th cipher round. Once the infec-
tion is removed, we can proceed with the attack described in [126].
The attack procedure can be summarized as follows:
1. Suppose a random fault f is injected in the first byte of the 9th cipher round input.
Before the execution of step 14, the output of faulty computation differs from the
output of correct computation in 4 positions viz. 0, 13, 10 and 7 which comprises a
diagonal. But the execution of compulsory dummy round in step 14 infects all the 16
bytes of the faulty computation. Therefore, the resulting faulty ciphertext T ∗ differs
from the correct ciphertext T in 16 bytes. We use Equation 8.32 to represent this
difference as:
m0 ⊕ 2F1 ⊕ 1F2 1F3 3F4 ⊕ 1F5 ⊕ 1F6 3F7
∗
1F1 ⊕ 3F2 1F3 2F4 ⊕ 3F5 ⊕ 1F6 m1 ⊕ 2F7
T ⊕T =
1F1 ⊕ 2F2 3F3 m2 ⊕ 1F4 ⊕ 2F5 ⊕ 3F6 1F7
3F1 ⊕ 1F2 m3 ⊕ 2F3 1F4 ⊕ 1F5 ⊕ 2F6 1F7
(8.35)
where Fi , i ∈ {1, . . . , 7}, represents the infection caused by the compulsory dummy
round in step 14 and mj , j ∈ {0, 1, 2, 3}, represents the difference between the correct
and faulty computation before the execution of step 14 in Algorithm 8.7 (for details
refer [146]). Now, we can deduce the values of F1 and F2 from column 1, F3 from
column 2, F4 , F5 and F6 from column 3 and F7 from column 4 and thus remove the
infection caused by the compulsory dummy round from T ∗ .
2. After removing the infection caused by compulsory dummy round, we get:
m0 0 0 0
0 0 0 m1
T ⊕ T∗ = 0
0 m2 0
0 m3 0 0
We can now remove the infection caused by the 10th cipher round. Each mj can be
written as zj ⊕ SN LF [zj ], j ∈ {0, 1, 2, 3}, where SN LF [zj ] represents the infection
caused in step 11 of Algorithm 8.7, after the execution of 10th cipher round and
zj represents the difference between the outputs of correct and faulty computations
before step 11 (for details refer [146]). If SN LF is implemented as inversion in GF (28 ),
we get two solutions of zj for every mj . Since the 4 equations represented by mj are
independent, we obtain 24 solutions for T ⊕ T ∗ . Here, T is known, therefore we have
24 solutions for T ∗ as well.
3. After removing the infection caused by 10th cipher round, the attacker makes hy-
potheses on 4 bytes of the 10th round key k 11 and uses the faulty and correct output
where SN LF [b · f ′ ], b ∈ {1, 2, 3} is the infection caused in step 11, after the execution of
9th cipher round. The above set of equations is solved for all 24 possible values of T ∗ (for
the complexity analysis of the attack, refer [146]).
(T ⊕ T ∗ )(4·(i+1))%16 = (T ⊕ T ∗ )(4·(i+1))%16+1
(T ⊕ T ∗ )(4·(i+1))%16+2 = 3 · (T ⊕ T ∗ )(4·(i+1))%16
(8.36)
(T ⊕ T ∗ )(4·(i+3))%16+2 = (T ⊕ T ∗ )(4·(i+3))%16+3
(T ⊕ T ∗ )(4·(i+3))%16 = 3 · (T ⊕ T ∗ )(4·(i+3))%16+2
where (T ⊕ T ∗ )j represents the j th byte in matrix T ⊕ T ∗ . One can see from Equation 8.35,
that the above relation arises because the compulsory dummy round uses the same value
to mask more than one byte of the faulty computation.
Simulation Results. We carried out Piret & Quisquater’s attack [126] on Algorithm 8.7
using a random byte fault model with no control over fault localization. We implemented
the Algorithm 8.7 in C and used the GNU Scientific Library(GSL) for RandomBit function.
The simulation details are as follows:
3 If the value of d varies across different executions, one can still compute a mean value of d by observing
2. Each test executes Algorithm 8.7 until 8 desired faulty ciphertexts are obtained. How-
ever, as the target (22 + d − 2)th RoundF unction can also be a dummy or 10th re-
dundant round, the undesired faulty ciphertexts obtained in such cases are discarded.
Equation 8.36 can be used to distinguish between desired and undesired faulty cipher-
texts.
3. An average of the faulty encryptions over 1000 tests is taken, where number of faulty
encryptions in a test = (8 desired faulty ciphertext + undesired faulty ciphertexts).
The probability that the targeted RoundF unction is a 9th cipher round decreases with
higher values of d but it still remains non-negligible. In other words, higher the value of d,
more is the number of faulty encryptions required in a test as evident from Fig. 8.33:
Average Number of Faulty Encryptions
300
200
100
0
0 5 10 15 20 25 30
Number of Random Dummy Rounds: d
Observation 3: The feasibility of Piret and Quisquater’s attack shows that the infection
method employed in the countermeasure [131] fails to protect against classical fault attacks.
1. If a fault is injected in any of the cipher, redundant or dummy round, all bytes in the
resulting ciphertext should be infected.
2. As shown, merely infecting all bytes in the output is not sufficient. Therefore, the
infection technique should result in such a faulty ciphertext that any attempts to
make hypothesis on the secret key used in AES are completely nullified.
3. The countermeasure itself should not leak any information related to the
RoundF unction computations which can be exploited through a side channel.
Given below is an algorithm, which is designed to possess all the aforementioned prop-
erties. It uses cipher, redundant and dummy rounds along the lines of Algorithm 8.7 but
exhibits a rather robust behaviour against fault attacks.
Following additional notations are used in this algorithm:
1. rstr: A ‘t’ bit random binary string, consisting of (2n) 1’s corresponding to AES
rounds and (t − 2n) 0’s corresponding to dummy rounds.
2. BLFN: A Boolean function that maps a 128 bit value to a 1 bit value. Specifically,
BLF N (0) = 0 and for nonzero input BLF N evaluates to 1.
3. γ: A one bit comparison variable to detect fault injection in AES round.
4. δ: A one bit comparison variable to identify a fault injection in dummy round.
6. κ ← (i ∧ λ) ⊕ 2(¬λ)
7. ζ ← λ · ⌈i/2⌉ // ζ is actual round counter, 0 for dummy
8. Rκ ← RoundF unction(Rκ , k ζ )
9. γ ← λ(¬(i ∧ 1)) · BLF N (R0 ⊕ R1 ) // check if i is even
10. δ ← (¬λ) · BLF N (R2 ⊕ β)
11. R0 ← (¬(γ ∨ δ) · R0 ) ⊕ ((γ ∨ δ) · R2 )
12. i←i+λ
13. q←q+1
14. end
15. return(R0 )
Apart from these elements, Algorithm 8.8 exhibits the following features which makes
it stronger than Algorithm 8.7:
1. In Algorithm 8.8, matrix R2 represents the state of the dummy round and is initial-
ized to a random value β. This state matrix R2 bears no relation with any of the
intermediate states or the round keys of AES. When a fault is induced in any of the
rounds, Algorithm 8.8 outputs a matrix R2 . For fault analysis to succeed, the faulty
output should contain some information about the key used in the cipher. However,
the new countermeasure outputs matrix R2 which is completely random and does not
have any information about the key used in the AES, which makes the differential
fault analysis impossible. Since in the case of fault injection Algorithm 8.8 outputs
dummy state R2 , the pair (β, k 0 ) should be refreshed in every execution4 .
2. In Algorithm 8.8, more than one dummy round can occur after the execution of last
cipher round and consequently the 10th cipher round is not always the penultimate
round.
3. Since the number of dummy rounds in Algorithm 8.8 is kept constant, the leakage of
timing information through a side channel is also prevented.
For a clear illustration, Table 8.2 shows the functioning of Algorithm 8.8.
If any of the cipher or redundant round is disturbed, then during the computation of
cipher round, (R0 ⊕ R1 ) is non-zero and BLFN(R0 ⊕ R1 ) updates the value of γ to 1. As
a result, R0 is replaced by R2 in step 11. Similarly, if the computation of dummy round is
faulty, (R2 ⊕ β) is non-zero and δ evaluates to 1. In this case too, R0 is replaced by R2 .
Also, if the state of comparison variables γ and δ is 1 at the same time, then in step 11, R0
is substituted by R2 as this condition indicates fault in comparison variables themselves. In
case of undisturbed execution, Algorithm 8.8 generates a correct ciphertext.
8.11 Conclusions
The chapter presents an overview on fault analysis and fault models. After introducing
Differential Fault Analysis the chapter starts with a detailed analysis, comparisions and
inter-relationships between various fault models which are assumed for DFA on AES. Sub-
sequently we discuss the fault attacks on AES: starting from early efforts to recent attacks
on the cipher. The chapter deals with the attacks in two different directions: single and
multiple byte fault model based attacks, and attacks which target the data path and the
key-schedule. Finally, the chapter concludes with some suitable countermeasures based on
detection and infection and tries to provide a comparative study of the schemes.
4 One should note that even a new pair of (β, k 0 ) cannot protect Algorithm 8.7 against the attacks
“History is malleable. A new cache of diaries can shed new light, and
archeological evidence can challenge our popular assumptions.”
(Ken Burns)
American Director
The memory organization in a processor can significantly affect its performance. Of all
the memory that is present, the cache is arguably the most important when performance is
considered. Cache memories used in processors are equipped with several additional features
to meet the high throughput requirements of the processor. The pitfall with cache memories
is that they result in side-channel leakage leading to attacks on ciphers. The side-channel
leakage is exploited in several flavors of attacks, which are described in this chapter.
265
Processor
L1 Data L1 Instruction
Cache Memory Cache Memory
Speed
Size
L2 Unified Cache Memory
Main Memory
types of cache misses: compulsory misses, capacity misses, and conflict misses. Compulsory
misses are cache misses caused by the first access to a block that has never been used in
the cache. Capacity misses occur when blocks are evicted and then later reloaded into the
cache. This occurs when the cache cannot contain all the blocks needed during execution
of a program. Conflict misses occur when one block replaces another in the cache.
Aword = A mod 2δ
Aline = ⌊A/2δ ⌋ mod 2b .
Note that the size of the address Aword is δ bits while the size of the line address is b bits.
Problems arise due to the many-to-one mapping from blocks to lines (as seen in Fig. 9.2).
The cache controller needs to know which block is present in a cache line before it can
decide if a hit or a miss has occurred. This is done by using an identifier called the tag. The
identifier denoted Atag for the address A is given by
Thus every line in the cache has an associated tag as shown in Fig. 9.3. For every memory
access, the tag for the address (Atag ) is compared with the tag stored in the cache. A match
results in a cache hit, otherwise a cache miss occurs.
Time-driven cache attacks on block ciphers monitor these hits and misses that occur
Aline Aword
Atag COMPARE
Select
Hit/Miss
Data
due to look-up tables used in the implementation. A look-up table is defined as a global
array in a program. For example, the C construct for a look-up table is as follows.
const unsigned char T0[256] = { 0x63, 0x7C, 0x77, 0x7B, ... };
Assuming that the base address for T0 is 0x804af60, Table 9.1 shows how the table gets
mapped for a 4KB direct-mapped cache with a cache line size of 64 bytes. Every block in
the table maps to a distinct cache line provided that the table has lesser blocks than the
lines in the cache.
The drawback of the direct mapped scheme is poor cache utilization and cache thrashing.
Consider a program which continuously reads data from addresses A and then A′ . The
addresses are such that they map into the same line in the cache (i.e. Aline = A′line ). Thus
every memory access would result in a cache miss, considerably slowing down the program
even though the other lines in the cache are unused. This is called cache thrashing.
An improved address translation scheme divides the cache into 2s sets. Each set groups
w = 2b /2s cache lines. A block now maps to a set instead of a line, and can be present in
any of the w cache lines in the set. A cache which uses such an address translation scheme is
called a w−way set associative cache. The address translation for such a scheme is defined
as follows.
Aword = A mod 2δ
Aset = ⌊A/2δ ⌋ mod 2s
Atag = ⌊⌊A/2δ ⌋/2s ⌋
TABLE 9.1: Mapping of Table T0 to a Direct-Mapped Cache of size 4KB (2δ = 64 and
2b = 64)
Elements Address line Tag
T0[0] to T0[63] 0x804af40 to 0x804af7f 61 0x804a
T0[64] to T0[127] 0x804af80 to 0x804afbf 62 0x804a
T0[128] to T0[191] 0x804afc0 to 0x804b0ff 63 0x804a
T0[192] to T0[255] 0x804b000 to 0x804b03f 0 0x804b
where Hit time is the time to hit in the cache, while Miss Penalty is the time required to
replace the block from memory (i.e. the miss time). Miss Rate is the fraction of the memory
accesses that miss the cache.
Round Input
x0 x1 x2 xn−1
k0 k1 k2 kn−1
s1 s2 s3 sn
S S S S
Diffusion Layer
Round Output
Algorithm 9.1 presents the AES-128 algorithm. The first operation on the input is the
AddRoundKeys, which serves to provide the initial randomness by mixing the input key.
The state is then subjected to 9 rounds to further increase the diffusion and confusion in
the cipher [370]. Each round comprises of 4 operations on the state: SubBytes, ShiftRows,
MixColumns, and AddRoundKeys. The state is then subjected to a final round, which has
all operations except the MixColumns operation. The four AES operations are defined as
follows:
s1 s5 s9 s13 SubByte S(s1 ) S(s5 ) S(s9 ) S(s13 ) ShiftRows S(s5 ) S(s9 ) S(s13 ) S(s1 )
s2 s6 s10 s14 S(s2 ) S(s6 ) S(s10 ) S(s14 ) S(s10 ) S(s14 ) S(s2 ) S(s6 )
s3 s7 s11 s15 S(s3 ) S(s7 ) S(s11 ) S(s15 ) S(s15 ) S(s3 ) S(s7 ) S(s11 )
si for 0 ≤ i ≤ 15 is the elements of the state
MixColumns
2s′0 ⊕ 3s′5 ⊕ 2s′4 ⊕ 3s′9 ⊕ s′14 2s′8 ⊕ 3s′13 ⊕ 2s′12 ⊕ 3s′1 ⊕ 2s′0 ⊕ 3s′5 ⊕ 2s′4 ⊕ 3s′9 ⊕ 2s′8 ⊕ 3s′13 ⊕ 2s′12 ⊕ 3s′1 ⊕
s′10 ⊕ s′15 ⊕ k0 ⊕s′3 ⊕ k4 s′2 ⊕ s′7 ⊕ k8 s′6 ⊕ s′11 ⊕ k12 s′10 ⊕ s′15 s′14 ⊕ s′3 s′2 ⊕ s′7 s′6 ⊕ s′11
s′0 ⊕ 2s′5 ⊕ s′4 ⊕ 2s′9 ⊕ 3s′14 s′8 ⊕ 2s′13 ⊕ s′12 ⊕ 2s′1 ⊕ s′0 ⊕ 2s′5 ⊕ s′4 ⊕ 2s′9 ⊕ s′8 ⊕ 2s′13 ⊕ s′12 ⊕ 2s′1 ⊕
3s′10 ⊕ s′15 ⊕ k1 ⊕s′3 ⊕ k5 3s′2 ⊕ s′7 ⊕ k9 3s′6 ⊕ s′11 ⊕ k13 AddRoundKey 3s′10 ⊕ s′15 3s′14 ⊕ s′3 3s′2 ⊕ s′7 3s′6 ⊕ s′11
s′0 ⊕ s′5 ⊕ s′4 ⊕ s′9 ⊕ 2s′14 s′8 ⊕ s′13 ⊕ s′12 ⊕ s′1 ⊕ s′0 ⊕ s′5 ⊕ s′4 ⊕ s′9 ⊕ s′8 ⊕ s′13 ⊕ s′12 ⊕ s′1 ⊕
2s′10 ⊕ 3s′15 ⊕ k2 ⊕3s′3 ⊕ k6 2s′2 ⊕ 3s′7 ⊕ k10 2s′6 ⊕ 3s′11 ⊕ k14 2s′10 ⊕ 3s′15 2s′14 ⊕ 3s′3 2s′2 ⊕ 3s′7 2s′6 ⊕ 3s′11
3s′0 ⊕ s′5 ⊕ 3s′4 ⊕ s′9 ⊕ s′14 3s′8 ⊕ s′13 ⊕ 3s′12 ⊕ s′1 ⊕ 3s′0 ⊕ s′5 ⊕ 3s′4 ⊕ s′9 ⊕ 3s′8 ⊕ s′13 ⊕ 3s′12 ⊕ s′1 ⊕
s′10 ⊕ 2s′15 ⊕ k3 ⊕2s′3 ⊕ k7 s′2 ⊕ 2s′7 ⊕ k11 s′6 ⊕ 2s′11 ⊕ k15 s′10 ⊕ 2s′15 s′14 ⊕ 2s′3 s′2 ⊕ 2s′7 s′6 ⊕ 2s′11
• ShiftRows : Provides a cyclic shift of the i-th row in the state by i bytes towards the
left (where 0 ≤ i ≤ 3). That is each byte in the i-th row is cyclically shifted to the
left by i bytes.
Starting from the 4 × 4 byte state, Fig. 9.5 shows the transformation it undergoes in a
round (for 1 ≤ r ≤ 9).
Of all operations, the SubBytes is the most difficult to implement. On 8-bit micro-
controllers, a 256-byte look-up table is ideal to perform this operation. The table provides
the necessary flexibility in terms of content, small footprint, and speed. For 32-bit platforms,
more efficient implementations can be built using larger tables. We give a brief description
of this method, which are known as T-table implementations. T-table implementations were
first proposed in [104] and have been adopted by several crypto-libraries such as OpenSSL1 .
1 http://www.openssl.org
is observed from the power or electro-magnetic traces of the cryptographic device, the
adversary can then infer that the first and second access were to the same memory block
of the table (a collision) therefore will have the same line address (Section 9.1.1). The
following relation can then be built: hx0 ⊕ k0 i = hx1 ⊕ k1 i, where h·i indicates the address
of the memory block (the top log2 l bits of the table index). Thus the XOR of the secret
keys k0 and k1 can be inferred.
Since k0 and k1 together have an initial entropy of 2n, the entropy reduces to n + δ after
the cache traces are obtained. Equation 9.3 is the basic relation which can be deduced from
the cache traces. Trace-driven attacks use this relation along with cipher properties to get
more information about the secret key. There have several trace-driven attacks that have
been published. For example [288, 46, 216, 121, 58, 10, 127]. In this section we survey the
various reported trace-driven attacks.
implementation. If cache traces of the first round are considered then relationships between
(0) (0) (0)
the key bytes can be constructed of the form hki ⊕ kj i, where 0 ≤ i, j ≤ 15 and ki and
(0)
kj are the AES whitening keys as described in Algorithm 9.1. If log2 l bits of information
are revealed about the memory accesses, then the first round attack can reduce the key
space from 2128 to 2128−15·log2 l keys. Assuming 16 elements of the table fit in a single cache
line (thus the number of blocks occupied by the table is l = 256/16 = 16), the key space
for AES-128 is reduced to 268 keys. Fournier and Tunstall also showed that if the second
round cache traces were also considered then relationships between the two rounds can be
used to reduce the key space further. For example, if the first look-up in two rounds collide,
then the following relationship can be built (refer Algorithm 9.1 and Fig. 9.5):
(0) (1)
hs0 i = hs0 i
(0) (0) (0) (0) (0) (1)
hs0 i = h2 · S(s0 ) ⊕ 3 · S(s5 ) ⊕ S(s10 ) ⊕ S(s15 ) ⊕ k0 i
(0) (0) (0)
(9.4)
hx0 ⊕ k0 i = h2 · S(x0 ⊕ k0 ) ⊕ 3 · S(x5 ⊕ k5 )
(0) (0) (1)
⊕ S(x10 ⊕ k10 ) ⊕ S(x15 ⊕ k15 ) ⊕ k0 i
(1) (0) (0)
Further, the key expansion algorithm of AES can be used to represent k0 as k0 ⊕S(k13 )⊕
(0) (0) (0) (0) (0)
1. Thus there are 5 key bytes involved in the relation: k0 , k5 , k10 , k13 , and k15 . The
values of these bytes are unknown, however log2 l bits of their XOR relationship can be
(0) (0) (0)
determined from the first round attack (i.e., the first round attack gives hk0 ⊕ k5 i, hk0 ⊕
(0) (0) (0) (0) (0)
k10 i, hk0 ⊕ k13 i, hk0 ⊕ k15 i). Consequently the key space for the tuple of 5 key bytes
reduces from 240 to 224 (assuming log2 l = 4). A search in the key space of 224 which satisfy
Equation 9.4 is required to completely recover the key bytes.
The attack in [121] used adaptively chosen plaintexts. First x1 is varied until a cache hit
is obtained in the second table access, then x2 is varied until the third access results in a
cache hit, and so on until 15 cache hits are obtained in the first round. The first round cache
trace would then look as follows: (MHHHHHHHHHHHHHHH). In [127], Gallais, Kizhvatov,
and Tunstall observed that while varying xi (for 1 ≤ i ≤ 14), to obtain a collision in the
i-th memory access, the information in the cache trace to the right of i is ignored (i.e. all
traces for j (i < j ≤ 15) are ignored). They then provided an improved attack that utilizes
information in these parts of the trace as well. This reduced the expected number of traces
required in the first round of attack from 125 (in [121]) to around 15. They also applied
these enhancements to improve the second round attack on AES.
In [10], Aciiçmez and Koç showed that if collisions in the last round exists, say between the
i1 and i2 accesses (0 ≤ i1 , i2 ≤ 15), then the following relation can be constructed.
(9) (9)
hsi1 i = hsi2 i
(10) (10)
(9.6)
hS −1 (yi1 ⊕ ki1 )i = hS −1 (yi2 ⊕ ki2 )i
(10) (10)
Due to the non-linearity of the S-box, only the correct values of ki1 and ki2 obey the
relation for every ciphertext sample.
3. Encrypt x again and this time obtain the cache trace by monitoring its power con-
sumption.
In the second encryption, if a cache miss is observed in the i-th memory access of the
first round, it can be inferred with significant probability that the invalidated cache line
was accessed during the i-th memory access, where 0 ≤ i ≤ 15. This reveals the most
significant bits of the index of the table access (i.e. hsi i), and consequently bits of the key
(0) (0)
since hki i = hsi ⊕ xi i. Bertoni et al. demonstrated this attack on the AES block cipher
with the S-box implemented using a table of 256 bytes. The attack was demonstrated using
the Simplescalar4 simulator and software power models to determine the cache trace.
EVICT+TIME PRIME+PROBE
1111111
0000000
1111111 cache
0000000 set contains spy data
cache set contains cipher data
FIGURE 9.6: Cache Sets Corresponding to Table Tj for Evict+Time and Prime+Probe
Access Driven Attacks
Fig. 9.6 shows the various stages of the cache sets used by the table Tj . There are two
possible outcomes after the second encryption:
(0) (0) (0) (0)
• If hsi+j i = hs̃i+j i implying (hki+j i = hk̃i+j i), then the cache set M is always accessed
by both encryptions (in steps 1 and 4). Particularly, the second encryption would
(0)
always (with probability = 1) result in a cache miss for the access Tj [si+j ].
(0) (0) (0) (0)
• If hsi+j i 6= hs̃i+j i implying (hki+j i 6= hk̃i+j i), then for a value of x, the cache set M
(0)
is not accessed at Tj [s̃i+j ]. However, the set may be accessed by the other memory
accesses to Tj . Thus for the second encryption (in step 4), the probability of a cache
(0)
miss is ≤ 1 for the access Tj [si+j ].
(0) (0)
If several values of x are considered, the ever-present cache miss at Tj [si+j ] when hsi+j i =
(0)
hs̃i+j i, would cause a slight increase in the expected encryption time compared to when
(0) (0) (0)
hsi+j i 6= hsi+j i where the cache miss at Tj [si+j ] is not always present. This difference in
(0)
time can be detected by an adversary to determine the correct value of hsi+j i thereby
(0)
obtaining hki+j i.
In [56], Bogdanov, Eisenbarth, and Paar use the Evict+Time strategy to mount a dif-
ferential cache-collision attack on AES. The main idea of the attack is to choose pairs of
plaintexts such that they cause wide-collisions; five AES S-box operations (one in the 2nd
round and four in the 3rd) possess either pairwisely equal or pairwisely distinct values. The
evict+time technique is used to identify plaintexts pairs with wide collisions. An existence
of a wide collision would cause on average the second encryption to be faster by a margin of
5 compared to when there is no wide collision. The wide collisions are then used to construct
a set of four non-linear equations, which when solved result in parts of the key.
(0)
Prime + Probe: In the evict+time method, the time for the memory access to Tj [si+j ]
(where 0 ≤ j ≤ 3 and i ∈ {0, 4, 8, 12}) gets reflected in the total execution time of the
cipher, thus leading to the attack. However, this technique delivers a low success due to the
presence of additional memory accesses and other code that executes during the encryption.
Further, there is considerable noise from sources such as instruction scheduling, conditional
branches, and cache contention thus resulting in a low signal to noise ratio (SNR). In the
prime+probe method, smaller codes are timed thus leading to a larger attack success. The
steps involved in the prime+probe is as follows:
1. Define an array A as large as the cache memory and read a value of A for every
memory block (thus filling the entire cache with A).
2. Trigger an encryption with a random input x.
(0) (0)
3. For a guess of hs̃i+j i, determine the cache set that Tj [hs̃i+j i] gets mapped into. Denote
this cache set as M .
4. Access A at indices which get mapped to the cache set M and time the accesses.
Fig. 9.6 shows the various stages of the cache sets used by the table Tj .
(0) (0) (0)
If hs̃i+j i is correct (i.e. hs̃i+j i = hsi+j i), then A′ s data present in the cache set M would be
evicted with probability = 1 during the encryption in step 2. However, this probability can
(0)
be less than one if hs̃i i is incorrect. Further, an eviction of A′ s data at M would result
in a cache miss in the fourth step; identified by a longer memory access time. However, if
A′ s data at M is not evicted, then the last step would have a cache hit therefore a shorter
(0) (0)
access time. The correct hs̃i+j i (thus hk̃i i) can therefore be identified by repeating the
four steps with several inputs.
to be targeted [280, 384, 281]. Consider the first table access in the second round. From
Algorithm 9.1, Fig. 9.5, and the key scheduling algorithm of AES, this is
(1) (0) (0) (0) (0) (0) (0)
s0 = 2·S[x0 ⊕k0 ]⊕3·S[x5 ⊕k5 ]⊕S[x10 ⊕k10 ]⊕S[x15 ⊕k15 ]⊕k0 ⊕S(k13 )⊕1 . (9.7)
(1) (0) (0) (0) (0) (0)
The value of hs0 i is affected by keys k0 , k5 , k10 , k13 , and k15 each occupying one
byte. The first round attack reveals log2 l bits of each of these key bytes leaving δ bits
unknown, Thus there is a space of 5 · 2δ keys that need to be searched. The evict+time or
prime+probe methods can be used along with Equation 9.7 to identify the correct key from
this key space.
Pre-emptive OS Scheduling divides CPU time into equally spaced intervals called slices. At
the beginning of a slice, a process gets allocated the CPU and executes until it voluntarily
relinquishes the CPU or the time slice completes. A new process may then get allocated
to the CPU by a process known as context-switching. In [272], Neve and Seifert suggests
that spy processes can exploit such schedulers to obtain covert information about a cipher’s
execution, though no explicit details about the construction was given. In [385], Tsafrir,
Etsion, and Feitelson present a practical malicious code that can exploit context-switching
in OS schedulers. The intuition is that it starts executing at the beginning of a slice, but
yields the processor before the slice completes. Another process is then scheduled for the
remaining time interval in the slice. The authors show several applications of the malicious
code such as denial of services, bypassing profiling and administrative polices, etc. Gullasch,
Bangerter, and Krenn use such a malicious code to develop a fine-grained access-driven cache
timing attack in [142].
The cipher then executes in the small time interval that remains in the time slice. Typically
the cipher is made to execute for only 200 clock cycles before getting blocked again. This
interval is just sufficient for the cipher to make one memory access. The next spy thread then
gets scheduled into the processor and the single access made by the cipher is determined
by measuring the memory access time to access its data. In this way, the cipher is executed
very slowly, and each memory access it makes can be tracked by the spy.
The ExecutionTime is the time required to perform an encryption. That is, from the time
the crypto-program receives its input to the time it produces an output. It is variations
in this component that is useful for the attack. The scaling factor A accommodates clock
skew between the server and client. The ProbagationTime is the average latency of the com-
munication, while the Jitter absorbs all the randomness introduced by the network, load
on the machines, and any other source. The value of Jitter would vary depending on the
distance between the server and client. For public-key ciphers the variation in Execution-
Time is strong enough to overcome the jitter in LANs. In [66] and [65], two attacks over
LANs were demonstrated. The former was on RSA while the latter on ECDSA. For block
5 The evict+time attack discussed in Section 9.4.1 falls into both time-driven as well as access-driven
categories. Like time-driven attacks, the encryption time of the cipher is used as the side-channel and like
access-driven attacks, the cache sets accessed by the cipher is obtained by a spy which shares the cache
memory of the cipher process.
11
00
00
11
00
11 11
00
00
11 00
11
00
11
00
11 00
11
00
11 trigger encryption 00
11
00
11
00
11 00
11
00
11
1111111111111111111111
0000000
000000000000000 1111111111111111111111
0000000000000000000000
0000000000000001111111
0000000
111111111111111 0000000000000000111111
1111111111111111000000
0000000000000001111111
0000000
111111111111111 0000000000000000111111
1111111111111111000000
Server Client
Cryptographic System 1. ts = Read Timestamp
Perform Encryption on Input 2. Trigger Encryption
3. te = Read Timestamp
4. T = te − ts
ciphers, the variations in ExecutionTime is comparatively small, making attacks over the
LAN more difficult. The remote timing attacks demonstrated on block ciphers were with the
server and client running on the same host and communicating via TCP/IP sockets ([15]
and [406]). These attacks are applicable in the current cloud computing infrastructures,
where several users (including malicious users) share the same hardware but in different
virtualized environments. The attack in [406] is demonstrated in one such environment.
Further, time-driven attacks could possibly threaten future networks, where increased net-
work speeds would reduce the amount of jitter in the measurements making variations in
the ExecutionTime more easily observable. Most time-driven attacks use these variations to
distinguish between a hit and a miss. This is discussed in the next part of the section.
4500 0.012
Collision
4400 Average
0.01
Frequency Distribution
4300
Encryption Time
0.008
4200
4100 0.006
4000
0.004
3900
0.002
3800
3700 0
50 60 70 80 90 100 3600 3800 4000 4200 4400
Number of Cache Misses Time (Clock Cycles)
FIGURE 9.8: Distributions of Encryption Time of OpenSSL AES Block Cipher on Intel
Core 2 Duo (E7500)
about the keys can be obtained by using timing instead of power traces [387]. The model by
Tsunoo et al. uses the fact that the execution time is linearly proportional to the number
of cache misses due to the look-up table used in the cipher’s implementation. This is seen
in Fig. 9.8(a), which plots the execution time versus number of cache misses due to the
look-up tables for the OpenSSL implementation of AES executing on an Intel Core 2 Duo
machine. Thus, on average, a longer execution time would imply the presence of higher
number of cache misses during the encryption. Fig. 9.8(b) shows two timing distributions
for an implementation of AES. The only difference between the two distributions is that
on average, one of the distributions has one lesser cache miss compared to the other. This
distribution, called collision in Fig. 9.8(b), can be clearly distinguished by the shorter
encryption time.
This principle to distinguish a hit from a miss was used in the first practical cache at-
tack [387]. The cipher targeted was MISTY1 [248] implemented on an Intel Pentium III
system running at 600MHz. The attack required around 217 chosen plaintexts and suc-
cessfully retrieved the key in over 90% of the attacks. The attack was improved in [388]
to reduce the number of measurements required to around 214 . In [386], encryption time
was used to distinguish between a cache hit and a cache miss in DES. The attack required
around 223 known plaintexts to retrieve the DES key. Later Aciiçmez, Schindler, and Koç
used the same principle to mount a first round time-driven cache attack on AES [15].
set. The smaller number of cache misses in the correct set will result in a smaller average
execution time. This is used to distinguish the correct key.
Further, the number of cache misses forms a Gaussian distribution with mean (µ) and
variance (σ 2 ) defined to be
l
X
µ= m · Pr[m]
m=1
(9.12)
l
X
σ2 = m2 · Pr[m] − µ2 .
m=1
In a similar way another Gaussian distribution is obtained, which represents the number of
cache misses during an encryption given that one collision is always present (ever-present)
for a specific memory access. The distinguishability between the two distributions is used
in time-driven attacks (for example, Fig. 9.8(b), where the distribution called collision
has the ever-present collision, while the average case distribution does not). Tiri et al.
used estimation techniques from [241] to compute the number of measurements required to
distinguish the two distributions with a certain level of confidence.
-1
-2
0 16 32 48 64 80 96 112 128 144 160 176 192 208 224 240
Input Byte x0
FIGURE 9.9: Timing Profile for D0 vs x0 for AES on an Intel Core 2 Duo (E7500)
built for the secret key. A statistical comparison of this timing profile with the template
reveals the secret key. Below we provide details of the attack.
Building a Timing Profile: The AES input has sixteen bytes as seen in Section 9.2.1.1.
For each byte an array of size 256 is defined, which stores the average execution time.
We denote these arrays by Ai for 0 ≤ i ≤ 15 and the 256 elements as Ai [j] for 0 ≤
j ≤ 255. The adversary finds the encryption time for a randomly chosen input (say x =
(x0 || · · · ||xi || · · · x15 ), 0 ≤ i ≤ 15) and updates the average timing of Ai [xi ] for all 16
arrays. This is repeated for around 224 inputs. The average time (tavg ) for all encryptions
is computed and deviation vectors found as follows Di [j] = Ai [j] − tavg for 0 ≤ i ≤ 15 and
0 ≤ j ≤ 255. Each deviation vector forms a timing profile for an input byte. Fig. 9.9 shows
the timing profile for the first input byte of AES. It plots the value of x0 on the x-axis and
D0 [x0 ] on the y-axis. To attack a key byte, two such timing profiles are required. One for a
known key and the other for an unknown key.
Extracting Keys from the Timing Profiles: Each vertical line in the timing profile provides
an average deviation Di [j] where 0 ≤ i ≤ 15 and 0 ≤ j ≤ 255. This deviation is an invariant
with respect to the index of the look-up table accessed in the i-th memory access of the
first round of AES. As seen in Algorithm 9.1, the i-th memory access to the look-up table
(0) (0)
in the first round has the form si = xi ⊕ ki . Thus the time deviation is an invariant with
(0) (0)
respect to si (i.e. Di [si ] is an invariant).
Now consider two timing profiles for the i-th byte (0 ≤ i ≤ 15) : one from a known
(0) (0) (0)
key (denoted k̂i ) and another from an unknown key (denoted k̃i ). Further, Di [si ] is
(0)
an invariant and there are two ways to obtain this invariant. Using k̂i the invariant can
(0) (0)
be obtained from x̂i ⊕ k̂i for some x̂i (0 ≤ x̂i ≤ 255) and using k̃i the invariant can be
(0)
obtained from x̃i ⊕ k̃i for some x̃i (0 ≤ x̃i ≤ 255). Thus with reference to the deviation
(0) (0)
of time, Di [x̃i ⊕ k̃i ] = Di [x̂i ⊕ k̂i ]. With respect to the timing profiles, the invariant has
shifted from x̂i in the known timing profile to x̃i in the unknown timing profile. If this shift
(0) (0)
can be determined then the unknown key can be computed as k̃i = x̃i ⊕ x̂i ⊕ k̃i .
To determine the shift, the unknown key byte is guessed (say kguess) and for every
possible value of x (0 ≤ x ≤ 255), a correlation coefficient is computed as follows:
255
X (0)
CCkguess = Di [x ⊕ k̂i ] × Di [x ⊕ kguess] (9.13)
x=0
There are thus 256 values of CCkguess , which are arranged in a decreasing order. The key
guess corresponding to the maximum value of CCkguess is most likely the unknown secret
(0)
byte of the key (k̃i ).
Analysis of Bernstein’s Attack : The reason for the attack to work is the fact that for a
given platform and an i (0 ≤ i ≤ 15), execution time varies depending on the value of
(0)
xi ⊕ ki . This variation in execution time is attributed to several causes such as number of
cache misses that occur during the encryption, interrupts, context switches, etc. However,
in the attack, we are only interested in the variation obtained after taking the average of
several encryptions. This measurement is not influenced by events that are non-periodic
with respect to time, provided an appropriate threshold is used to filter out excessively
large encryption times. The factors which causes variation in the average encryption time
is as follows.
• Conflict Misses. These cache misses occur due to memory accesses which evict the
recently accessed data. Conflict misses can be non-periodic. But as pointed out by
Neve, Seifert, and Wang [273, 271], conflict misses can also be periodic in nature.
Periodic conflict misses, cause cache misses in the same cache set at regular intervals of
time. This periodic form of conflict misses affect the average encryption time, because
they evict the same data from the cache in every encryption.
• Micro-architectural Effects. There can be small variations in the time depending on
the micro-architecture of the system, and the data being accessed. Bernstein gives
an example of recently occurring stores to memory affecting the load time in certain
locations. Another example is loads from memory, which causes conflicts in cache
banks. Further micro-architectural components in cache memory, like the hardware
prefetchers have been found to also affect information leakage.
Loaded Cache Variant: In [72], Canteaut, Lauradoux, and Seznec claim that the AES
execution will behave differently depending on the initial state of the cache. If the cache is
clean of any AES data before the start of encryption then the timing side-channels mostly
detect cache misses. These variation in execution time are significant because data read
from the RAM takes considerably more time than data read from the cache. In Bernstein’s
attack, the cache is flushed of all AES data at the start of encryption. This is forced by
making system calls and by array manipulations. Both these operations can have a major
influence in the state of the cache.
If the encryption starts with a loaded cache, the variations in execution time does not
depend on the cache misses anymore. The variations in time are due conflict misses or the
micro-architecture of the processor. The variations due to the micro-architecture are more
subtle, therefore Canteaut, Lauradoux, and Seznec provide a stronger variant of Bernstein’s
attack for such models. The variant attack uses the correlation between the execution time
(0)
and the value of subsets of the bits of si (for 0 ≤ i ≤ 15) instead.
Second Round Attack Variant: In theory, all bits of the whitening key can be obtained
from Bernstein’s first round attack. This is unlike other first round cache attacks where the
amount of information obtained about the whitening key is limited to a few most significant
bits of each byte. In practice however, the result is not different from any other first round
cache attack. The recovered bits are limited just as in any other cache attack. In [271],
Neve, Seifert, and Wang provide a second round profiled cache-timing attack to reveal the
(1) (0)
remaining key bits. The attack works by building templates based on si instead of si
for 0 ≤ i ≤ 15. Results show that all bits of the key are recovered with the same number of
timing measurements as the first round attack.
1. There should be variations in the way a crypto-system processes its inputs. These
variations must be manifested through the time required to execute an operation.
3. The adversary must be able to monitor the timing with an instrument powerful enough
to measure the variations in the signal carrying the timing information.
5. The attacker should know the design of the crypto-system. Note that in certain cases
the information needed can be learnt from the side-channel itself.
6. The attacker should have a synchronization signal to know when the crypto-system
starts and/or completes its processing.
If any of these criterion are not met, then timing attacks are prevented. It is therefore not
surprising that countermeasures try to make difficult to fulfill one or more of these criteria.
Completely preventing a criterion can be done, for example by eliminating the cache memory
from processor designs. However, this leads to significant performance degradation, which
far super seeds the security gains obtained. Thus the aim of practical countermeasures is to
obtain the right balance between security and performance.
This section summarizes countermeasures for cache attacks from [288, 289, 280, 216,
298, 384, 121, 44, 72, 64, 98, 53, 402, 165, 245]. The techniques to counter timing attacks
form a subset of these. Depending on where they are applied, countermeasures can be
categorized as application layer or hardware layer countermeasures. The remaining of the
section discusses each category.
Implementations without Look-up Tables: For cache attacks, which rely on look-up tables
used in the cipher’s implementation, an obvious countermeasure would be to not use look-up
tables. One method is to design new ciphers which have no S-boxes, for example [200]. For
the general class of ciphers that do use S-boxes, implementations without look-up tables
would have a drastic affect on performance. An alternate non-look-up table based imple-
mentation is the use of bitslicing [48, 322], which allows multiple encryptions to be done
concurrently in a single processor. Bitslicing yields high speeds of encryption, however is
restricted to certain non-feedback modes of operation, and therefore cannot be universally
applied.
In [313], we suggested a hybrid countermeasure for AES, which can protect against
cache attacks as well as provide the necessary performance. Our method is based on the
fact that attacks can exploit only the first, second, third, and the last round of AES. The
countermeasure therefore chose to implement these rounds without a look-up table, while
implementing the remaining intermediate rounds with look-up tables. The implementations
without look-up tables in the first, second, three and last rounds shield the implementation
against cache attacks, while the look-up tables used in the intermediate rounds provide the
necessary performance.
Cache Warming: Loading the contents of the look-up tables prior to encryption. As a
result the accesses to the look-up table during the cipher execution would be cache hits.
In addition to the warming, the elements of the table should not be evicted during the
encryption [44]. This is not easy to satisfy.
Data-Oblivious Memory Access Pattern: This allows the pattern of memory accesses to
be performed in an order that is oblivious to the data passing through the algorithm. To
a certain extent, modern microprocessors provide such data oblivious memory accesses by
reordering instructions. For example, four memory accesses to different locations say, A,
B, C, and D, would get executed in any of the 4! ways (say BCDA or DCAB). This
reordering would increase the difficulty of trace-driven and certain time-driven and access-
driven attacks. However, the reordering is restricted to memory accesses which do not have
data dependencies. For block ciphers, such independent memory accesses are present within
a round but not across rounds, therefore only partially fulfills data-oblivious requirements.
A naïve method to attain complete data-oblivious memory accesses is to read elements
from every memory block of the table, in a fixed order, and use just the one needed.
Another option is to add noise to the memory access pattern by adding spurious accesses,
for example by performing a dummy encryption in parallel to the real one. A third option
is to mask the sensitive table accesses with random masks that are stripped away at the
end of the encryption. Alternatively, the table can be permuted regularly to thwart attacks
using statistical analysis.
Generic program transformations are also present for hiding memory accesses [135].
However, there are huge overheads in performance and memory requirements. More practical
proposals have been developed using shuffling [431] and permutations [432].
Specialized Cache Designs: have been proposed for thwarting cache attacks. They work
on the fact that information leakage is due to sharing of cache resources, thus leading to
cache interference. These solutions provide means of preventing access-driven attacks. Their
effectiveness in blocking time-driven attacks has not yet been analyzed. In [298], Percival
suggests eliminating cache interference by modifying the cache eviction algorithms. The
modified eviction algorithms would minimize the extent to which one thread can evict data
from another thread.
In [290], Page proposed to partition cache memory, which is a direct-mapped partitioned
cache that can dynamically be partitioned into protected regions by the use of specialized
cache management instructions. By modifying the instruction set architecture and tagging
memory accesses with partition identifiers, each memory access is hashed into a dedicated
partition. Such cache management instructions are only available to the operating system.
While this technique prevents cache interference from multiple processes, the cache memory
is under-utilized due to rigid partitions. For example, a process may use very few cache lines
of its partition, but the unused cache lines are not available to another process.
In [402], Wang and Lee provide an improvement on the work by Page using a construct
called partition-locked cache (PLCache), which the cache lines of interest are locked in
cache, thereby creating a private partition. These locked cache lines cannot be evicted by
other cache accesses not belonging to the private partition. In the hardware, each cache
line requires additional tags comprising of a flag to indicate if the the line is locked and an
identifier to indicate the owner of the cache line. The under-utilization of Page’s partitioned
cache still persists because the locked lines cannot be used by other processes, even after
the owner no longer requires them.
Wang and Lee also propose a random-permutation cache (RPCache) in [402], where as
the name suggests, randomizes the cache interference, so that the difficulty of the attack
increases. The design is based on the fact that information is leaked only when cache
interference is present between two different processes. RPCache aims at randomizing such
interferences so that no useful information is gleaned. The architecture requires an additional
hardware called the permutation table, which maps the set bits in the effective address
to obtain new set bits. These are then used to index the cache set array. Changing the
contents of the permutation table will invalidate the respective lines in the cache. This
causes additional cache misses and a randomization in the cache interference. In [403],
Wang and Lee use an underlying direct-mapped cache and dynamically reprogrammable
cache mapping algorithms to achieve randomization. From the security perspective, this
technique is shown to be as effective in preventing attacks as RPCache, but with less
overheads on the performance. In [204], Kong et al. show that partition locked and random
permutation caches, although effective in reducing performance overhead, are still vulnerable
to advanced cache attack techniques. They then go on to propose modifications to Wang
and Lee’s proposals to improve the security [206, 205].
In [113], Domnitser et al. provide a low-cost solution to prevent access-driven attacks
based on the fact that the cipher evicts one or more lines of the spy data from the cache. The
solution, which requires small modifications of the replacement policies in cache memories,
restricts an application from holding more than a pre-determined number of lines in each
set of a set-associative cache. With such a cache memory, the spy can never hold all cache
lines in the set, therefore the probability that the cipher evicts spy data is reduced. By
controlling the number of lines that the spy can hold, tradeoff between performance and
security can be achieved.
up a typical block cipher. For public-key ciphers, several instruction enhancements have
been suggested such as [137, 138, 395].
In [193], we suggested a new instruction to be added to the microprocessor ISA called
PERMS. This instruction allows any arbitrary permutation of the bits in an n bit word, and
can be used to accelerate diffusion layer implementations in block ciphers, thereby providing
high-speed encryptions.
Controlling Time: Current timing attacks require distinguishing between events by their
execution time. This requires highly accurate timing measurements to be made. A popu-
lar method for making these measurements is by the RDTSC (read time stamp counter)
instruction, which is present in most modern day processors. This instruction allows the
processor’s time stamp counter (TSC) to be read. A typical measurement would look as
follows:
t1 = RDTSC();
function(); /* the operation that needs to be timed */
t2 = RDTSC();
t = t2 - t1;
return t;
Since the TSC is incremented in every clock cycle, RDTSC can provide nanoscale accuracy.
Denying the user access to RDTSC instructions can prevent cache-timing attacks (both
time-driven and access-driven). However, the main drawback is that several benign applica-
tions (such as Linux kernels, multimedia games, and certain cryptographic algorithms [245])
rely on RDTSC. These applications will cease to function if the instruction is disabled. A
less stringent way is to therefore fuzzy the timestamp returned by the RDTSC instruction.
This is possible because unlike timing attacks, benign applications do not require highly
accurate timing measurements. Time fuzzing can be either done by reducing the accuracy
of the measurement (such as by masking least significant bits of the timestamp [280]) or
by reducing the precision of the measurement (such as by injecting noise into the times-
tamp [165]).
If t1 is the current value of the TSC, then masking returns ⌊ t1 E ⌋, where E is a time
duration called epoch. Consequently, t has the form ⌊ t2 E ⌋ − ⌊ E ⌋. The size of the epoch is
t1
crucial. It should be large enough to protect against timing attacks, yet small enough to
not affect benign applications. In [394], Vattikonda et al. propose a way to bypass masking
schemes. The technique uses the fact that timing attacks do not need absolute timing
measurements, rather it is only required to distinguish between two or more events. In the
proposal, the adversary first loops continuously to synchronize with the start of the epoch.
This is denoted by t1e = ⌊ t1 E ⌋. Then the operation to be timed is called, and an RDTSC
instruction is invoked in a tight loop till the end of the epoch (detected by the change in the
RDTSC value). A counter c counts the number of RDTSC invocations made. The value of c
in the end of the loop can be used to distinguish between events executed by the operation,
as a smaller value of t2e = ⌊ t2E ⌋ will lead to a larger value of c.
c = 0;
/* Synchronize to the start of epoch */
t1e = RDTSC();
while(t1e != RDTSC());
function() /* the operation that needs to be timed */
/* Continuously scan TSC until end of epoch */
t2e = RDTSC();
while(t2e != RDTSC()) c = c + 1;
return c and (t2e - t1e)
In the noise injection technique, a random offset between [0, E) is added to the timestamp
value returned by RDTSC. There are however two possible problems. The first is likely
to occur when multiple RDTSC instructions are invoked. The first invocation of RDTSC
returns a value of the form t1 + r1, while the second invocation return has the form t2 +
r2, where t1 and t2 are the timestamp values and r1 and r2 are the random numbers
(0 ≤ r1 , r2 < E). If the two invocations are done in quick succession then t1 = t2, while it
is possible that r1 > r2. This gives the impression of time moving backwards. The second
drawback of the noise injection scheme is that the randomness is limited by the size of E. If
sufficient number time measurements are made and average taken, the effect of the injected
noise can be eliminated.
In [245], Martin, Demme, and Sethumadhavan propose a scheme called Timewarp. The
scheme prevents the attack in [394] by restricting the TSC reading to end of an epoch.
Additionally the RDTSC instruction is delayed by a random time (less than an epoch)
before the result is returned. Noise is further added to the TSC value to increase the
difficulty of the attack.
Even without using the RDTSC instructions, it is possible to make highly accurate
timing measurements. For example on multi-processor systems, a counter can be run in
a tight loop in a different thread, which can be used to make sufficiently precise timing
measurements. These are called virtual time-stamp counters (VTSC). To prevent use of
such counters, [245] proposes to add hardware in the processor, which detects these loops
and injects random delays into them.
9.7 Conclusions
The chapter presents an overview on memory hierarchy and the impact of cache memory
for improving the performance of processors. Subsequently, we deal with the impact of cache
in leading to timing attacks against table based software implementations of ciphers like
AES. The chapter presents a detailed overview on the three types of cache attacks: namely
trace-driven, access-driven, and time-driven. Finally we present some discussions on possible
mitigations that can be adopted at either the application or hardware level againt these
attacks.
(Edmond Locard)
293
Digital Oscilloscope
Plain Text
Cipher Text
Signal Trigger
SASEBO−GII
Containing AES 128
1. Design Under Test: The design under test for our experimentation is a standard
SASEBO BOARD which is used for evaluation. We provide a short description of the
SASEBO evaluation board here. More details of the same can be found in [346]. There
are two FPGAs in this board. One is known as control FPGA (Spartan 3A XC3S400A)
and another is known as cryptographic FPGA (Virtex 5 xc5vlx50). Control FPGA
contains the codes for communication with CPU [346]. It acts as a controller which
provides input and required control signals to cryptographic FPGA. Cryptographic
FPGA contains the implementation of the cryptographic algorithm to be attacked.
Communication with CPU is done through a USB cable. Plain text is given to this
board through CPU and cipher text is sent to CPU for verification. Signals from this
board is given to oscilloscope to plot power traces.
The detailed diagram of SASEBO is shown in Fig. 10.2.
❧ ✁✂✄☎❧✆ ❚❯❱
✽ ✽ ❅ ✹✷ ✽ ❇
✶❃ ❂✸ ✶❃ ❂✸ ❄✺
❅
❄✺ ✺ ✸ ❇✸ ✳✺ ❉ ✳✺ ❉ ✶❃ ❊❋
✵✴ ✵✺ ✵✴ ✵✺ ✸✹ ✵✷ ✵❄❆ ✵✺ ✵✴ ✵✺
✳✴❜ ✳✴❜ ✵✷ ✳✴❜ ✳✴❜ ✳✴❜ ✳✴❜ ✳✴❜
❜✳✴ ❜✳✴ ❜✳✴
✱ ❧ ✁✂✄r✂✞ ✥
❧ ✁✂✄❑▲▼✝
✰ ❧ ✁✂✄☎❧✆
✯ ✲✶ ✲✶ ✲✶
✮ ✓ ✡ ✔
❛ ✟ ✇ ❴✎ ✟
❴✎ ❴✎ ❴✎ ❴✎
✾❈ ☞ ☞ ☞ ☞ ☞
✼ ✌✤ ✌✤ ✌✤ ✌✤ ✌✤
✯ ⑦❧
✱ ✁✂✄r✂✞ ✥ ☎✞r❧✄❧ ✁✂▼✝
✫ ❧ ✁✂✄☎❧✆
✰ ✽ ✽
★ ✶❃ ✦ ✍ ✍ ✟ ☛ ✌ ✌
✪ ✑ ✟✡ ✡☛ ❡ ❢☞ ☞
✯ ✡✒ ☛❡ ❛☛
✇ ✇ ❛❢
✻
✫ ✛
✔ ✙✚
❈ ⑦❧ ✁✂✄r✂✞ ✥ ✓❢ ✣ ✇ ✔
✓❢
✙✚
✣ ✡✛ ⑦❧ ✁✂✄r✂✞ ✥
✒❢ ✢
✠ ❴
✔ ✒❢ ✢
✠
❴
✔
✾✾ ✦✑✎ ✑
✏✎ ✜
✖ ✓❢❢ ✑ ✜ ❢✓❢✭ ❧ ✁✂✄☎❧✆
✒✌✧ ❴✎
✡ ❧ ✁✂✄☎❧✆ ✕
❴
✭ ✎✏ ✖
✕
✰ ❴✎ ❴
❁ ☞ ☞
★ ✌✤ ✌✤ ✟ ☛ ✌ ✌ ✍ ✍
✽ ☞
❀ ✇ ✇ ❢☞ ❛❢ ✽ ✟✡ ✡☛ ☛❡
❡
✿ ❴✎ ❴✎ ❴✎ ❴✎ ❴✎ ❴✎ ❴✎ ❛☛
❙ ☞ ☞ ☞ ☞ ☞ ☞ ❴✎
☞
✱ ✙✚ ✎ ✎✛ ✌✤ ✌✤ ✌✤ ✌✤ ✌✤ ✌✤ ✌✤ ☞
☞ ☞ ✌✤
❙ ✌✤
✌✤ ✦❴ ✎
✦❴ ✎ ✙✚ ✡ ✙✚ ✡
✡ ✡
✧✌ ✡✛ ✟✡
✛
r✂✞❝✝
r✂✞❝✝ ❴✎ ✦✔✒ ✇ ❴✎ ✦✔✒
✒✌✧ ❴✧
✒
✤
☞ ☛ ❴✎
✤
☞ ☛ ❴✎
❴✧ ✑ ☞ ✑ ☞
✑ ✤✭ ✑ ✤✭
❡ ✭❡ ☎❧✆❝✝ ✒✔ ✒✔ ☎❧✆❝✝
✱ ✽
✰ ✗✍ ✽
✗✍ ✗ ✗☛ ✌✗ ✌✗
✯ ✗ ✗☛ ✟ ☞
✮ ✧ ✦✎ ✚ ✟✡ ✡ ❡☛ ❡☛ ✇ ✇ ❢☞ ❛❢
✬ ✒✌ ✡ ✟ ❛
★ ✒✌✧
r✂✞❝✝ r✂✞❝✝
✫ ✔ ✡✛ ✔ ✟✡
✛
✪ ✓❢ ✙✚ ✇ ✓❢ ✙✚
✩ ✚
✟ ✚ ☎❧✆❝✝ ✒❢ ✘
✗ ❴
✔ ✒❢ ✘
✗ ❴
✔
☎❧✆❝✝
✦✎ ✦✟ ✎ ✑ ✖ ✑ ✖
★ ✒✌✧ ✒✌✧ r✂✞ ✏ ✕ ❢✓❢ ✏ ✕ ❢✓❢
❈
✡ ✡ ❛✎ ❴ ✭❛ ❛✎ ❴ ✭❛
r✂✞
☎❧✆ ☎❧✆
✽ ✽
✚ ✦✛✎ ✠ ✠☛ ✌✠ ✠✍ ✠✍
✦✙ ✎ ✡ ✟ ✌✠ ☞ ✠ ✠☛
✡ ❢☞ ✟✡
✇ ✇ ❛❢ ✡
☛❡
❡
❛☛
✒✌✧ ✒✌✧
❴✧ ❴✧
r✂✞
❡ ✭❡ ☎❧✆ ☎✞r❧✄▲✞ ◗◗❘◗❝▼✝ ❧ ✁✂✄r❝ ❏
✽
✖✡
❢✑ ☛✑ ✟✡
✑ ✡
✟ ✦✖ ✇
❴ ❴ ❴ ❴ ❴
✑ ✤✎ ✤✎ ✤✎ ✤✎ ✤✎
✓✑ ✦✓✎ ❍☛ ❛✌✦ ☞ ☞ ☞ ☞ ☞
✒✌✧ ✡ ✑
✔
❴☛
✟ ✑
❴
●❢ ✟
✔ P■ ✂s✂◆ ❖✄✞◆✂✞◆r
❴
✒ ●❢
✒
2. Oscilloscope: The oscilloscope is used for acquiring power traces. Upon receiving
trigger signal, power traces is obtained and is sent to CPU. Triggering is very impor-
tant for obtaining power traces. Triggering indicates start of the encryption and help
us to identify the desired power traces. The oscilloscope should have a high sampling
rate. We use a Tektronix MSO4034B Mixed Signal Oscilloscope with specifications
2.50 Gs/s, 350 MHz and 4 channels. We do not use any dedicated probes.
3. CPU: The heart of this setup is the CPU. The whole setup along with the operation
of the other components are controlled by CPU. Interface between CPU and SASEBO
board and CPU and oscilloscope is developed using C# codes. Plain text is sent to
FPGA and cipher text is received for the verification. Power trace is received from
the oscilloscope and is used to attack the scheme.
SubBytes
ShiftRows
Inner−pipelining
BUFFER 1 clock
MixColumns
AddRoundKey
256 RoundKey
BUFFER2
Addkey_ enable
256
256
Last_round
BUFFER 3
256 Cipher Text
A closer inspection shows the 10 rounds of encryption. Each of the 10 rounds correspond
to the power consumption in a clock cycle of time period of 500 ns (i.e., 2 MHz clock in the
cryptographic FPGA of the SASEBO board). In Fig. 10.5 we show the zoomed out trace
of the last round of the encryption. The last round power trace is used in the subsequent
attack, as one can develop a ciphertext only attack using power analysis. Also in AES-128
recovering the last round key helps to recover the initial key of the cipher.
FIGURE 10.6: Electronic Noise for 9th and 10th round of AES-128
A proper experimental setup should aim for faster power trace acquisition over huge
number of plaintexts as required in the attacks. In order to obtain several power traces
without intervention the captured power traces will be sent online from the mixed signal
oscilloscope (MSO) to the PC as soon as they are captured. This will facilitate on-the fly
processing of power traces. Also the MSO has 4 analog channels and 16 digital channels
which facilitates on-chip debugging of the cryptographic design on FPGA platform. The
TPP series passive probes of the MSO have an analog bandwidth of 1 GHz and less than
4 pF of capacitive loading. As the parasitic elements as mentioned above are less, it en-
ables a proper path for power trace signals with high bandwidth to propagate from the
SASEBO-GII FPGA board to the MSO and leads to increased accuracy in the power trace
measurements.
Charging Path
CL
Discharging
Path
GND
FIGURE 10.7: Different Power Consumption for 0 → 1 and 1 → 0 transition for CMOS
switch
and buses. On the contrary, they are incapable for estimating the power consumption due
to combination circuits as these transitions are unknown due to the presence of unknown
transitions because of glitches.
LFSR
+ + + +
Keystream
Plaintext Ciphertext
In order to illustrate the above models consider as an example, a Linear Feedback Shift
Register (LFSR), which forms the basic building blocks of keystream generators of Stream
Ciphers and are well suited for hardware implementations. The stream cipher using the
LFSR and a nonlinear function is shown in Fig. 10.8. The stream cipher was implemented
on Xilinx FPGA platform XC3S400-5PQ208 device and the respective power traces were
taken as voltage drop across 1-ohm resistance put across Vccint and GND. The power traces
for 80 consecutive clock cycles after deasserting reset is shown in Fig. 10.9.
The power traces were taken at a resolution of 20ns/pt. Using the above discussed
power models, namely Hamming Weight and Hamming Distance for each clock cycle after
the start of the execution are shown in Fig. 10.10(a) and Fig. 10.10(b). The similarity of
the Hamming distance and the Hamming weight plots (Fig. 10.10(a) and Fig. 10.10(b))
with Fig. 10.9 can be observed visually.
4.4e-05
4e-05
3.8e-05
3.6e-05
3.4e-05
3.2e-05
3e-05
2.8e-05
0 10 20 30 40 50 60 70 80
No. of clocks ->
FIGURE 10.9: Power profile of the implemented stream cipher as obtained from the setup
HD profile at each +ve clock edge, Initial Key: 0x8080_8080_8080_8080_8080 Hamming Weight profile at each +ve clock edge, Initial Key: 0x8080_8080_8080_8080_8080
50 45
Hamming Distance Hamming Weight
40
45
35
40
Hamming Distance ->
30
35
25
30
20
25
15
20 10
0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80
No. of clocks -> No. of clocks ->
In this section using the Hamming Distance power model, we discuss why a CMOS
gate can be subjected to power analysis. Consider a gate, denoted as y = f (x), where x
comprises of the inputs. As an example, let us consider 4 different energy consumptions due
to transitions of the gate, namely P0→0 , P0→1 , P1→0 , and P1→1 . Consider the transitions
of an AND gate, denoted as y = AN D(a, b) = a ∧ b, where a and b are bits, in Table 10.1.
The energy levels, as stated before, are annotated in the fourth column. This column can
be used to estimate the average energy when the output bit is 0 or 1, namely E(q = 0) and
E(q = 1) respectively. We show that the power consumption of the device is correlated by
the value of the output bit. It may be emphasized that this observation is central to the
working of a DPA attack.
Observe that if the four transition energy levels are different, then in general |E(q =
0) − E(q = 1)| 6= 0. This simple computation shows that if a large number of power traces
are accumlated and divided into two bins: one for q = 0 and the other for q = 1, the means
for the 0-bin and 1-bin computed, and then the difference-of-mean (DOM) computed, it is
expected to find a non-zero difference. This forms the basis of Differential Power Analysis
(DPA), where the variation of power of a circuit wrt. data is exploited. This is more powerful
than the Simple Power Analysis (SPA) as detailed in Chapter 7. In the following section,
we provide a discussion on the Difference-of-Mean (DoM) technique to perform a DPA on
a block cipher like AES.
on the target bit is thus exploited to distinguish the correct key from the wrong ones. We
state the above attack more elaborately in the form of an algorithm.
also to observe the effect of noise, we super-impose the power traces as discussed above with
a Gaussian noise. As shown in Fig. 10.11(b) one can observe that the separation of the
correct key from the other ones happens slowly, but nevertheless is clearly visible.
10.3.3 DOM and the Hamming Distance vs. Hamming Weight Model
This experiments show that DoM can be used as a distinguisher for the keys of a cipher.
The inherent reason being that the power profile of the device is dependent on data. The
DoM technique is based on correct classification of the power traces into two bins depending
on the function of the taget bit. Assuming the power model is Hamming Weight, we perform
the classification based on the 0 or 1 value of a target bit. The attack outlined previously
was an example of such a hypothesis. The attack can be easily adapted for a Hamming
Distance power model, where the classification of the power traces is based on the change
or transitions of the target bit across two successive clock cycles. The rest of the attack
remains the same.
1.4 2.5
1.2
1.5
0.8
0.6
1.0
0.4
0.5
0.2
0.0 0.0
0 2 4 6 8 10 12 14 16 18 0 2 4 6 8 10 12 14 16 18
trace# (*1000) -> trace# (*1000) ->
FIGURE 10.11: Progress of Bias of keys for simulated Power Profiles of AES-128
(Fig. 10.12(a)). In fact a precise look at the results show that for the 3rd and 6th target
bits, DPA returns key 244 as the correct key. This could even trick an attacker into believing
that 244 might be the correct key. Thus we see that for 2500 traces DPA leads to indecisive
results. We deploy a probable key analysis using the same number of traces.
DPA Bias (mV)
0.2 0.2
0 0
−0.2 −0.2
2500 2500
2000 2000
1500 Key = 19 1500 Key = 244
1000 Key = 12 1000 Key = 12
Time (µs) 500 Key = 227 Time (µs) 500 Key = 76
0 Key = 86 Key Guess 0 Key = 58 Key Guess
(a) Target Bit 2, Correct key returned by (b) Target Bit 3, Correct key returned by
DPA → 12 DPA → 244
DPA Bias (mV)
0.2 0.2
0 0
−0.2 −0.2
2500 2500
2000 2000
1500 Key = 124 1500 Key = 91
1000 Key = 12 1000 Key = 12
Time (µs) 500 Key = 95 Time (µs) 500 Key = 244
0 Key = 117 Key Guess 0 Key = 38 Key Guess
(c) Target Bit 4, Correct key returned by (d) Target Bit 6, Correct key returned by
DPA → 124 DPA → 244
FIGURE 10.12: AES Differential plots for some target bits of S-Box-15 with 2500 traces.
In order to depict the attack, we define probable key matrix as a w × t matrix, where w
is the windo size chosen for the attack, and t is the number of target bits. The ith column
of the matrix is populated with the top w keys which are returned by a DoM based DPA
targeting the ith bit chosen from the t target bits. The elements in the ith column are also
arranged in descending order of their DoM, and thus the first row of the matrix actually
represents the keys which have registered highest DPA bias. In other words, these keys are
the ones that a classical DPA attack would have returned as correct keys. The window-size
(w) for the experiment is 10 and hence we have total of 10 probable keys in each column
which are arranged in decreasing order of their DPA bias.
As the number of output bits of an AES S-Box is 8 so we get a (10 × 8) probable key
matrix. Table 10.2 shows the probable key matrix for 2, 500 traces. From the frequency
analysis (Table 10.3), we infer that key 12 is the most frequent key with frequency of
occurrence being 4. So by probable key analysis, key 12 is concluded to be the correct key.
TABLE 10.2: AES Probable Key Matrix for S-Box 15 with 2500 Traces
Bit1 Bit2 Bit3 Bit4 Bit5 Bit6 Bit7 Bit8
11 12 244 124 86 244 242 147
23 86 76 95 64 38 143 107
61 227 58 244 69 67 128 42
133 19 217 12 210 17 44 88
197 161 164 142 127 124 174 137
35 38 69 139 60 103 41 12
22 191 52 117 218 91 36 122
220 164 123 74 193 196 68 125
238 105 12 73 18 82 133 159
178 26 193 147 89 78 202 96
TABLE 10.3: Frequency Matrix for unmasked AES S-Box 15 with 2500 Traces
Key Freq. Key Freq. Key Freq. Key Freq. Key Freq. Key Freq. Key Freq.
12 4 96 1 41 1 73 1 105 1 142 1 210 1
244 3 11 1 42 1 74 1 107 1 143 1 217 1
38 2 17 1 44 1 76 1 117 1 159 1 218 1
69 2 18 1 52 1 78 1 122 1 161 1 220 1
86 2 19 1 58 1 82 1 123 1 174 1 227 1
124 2 22 1 60 1 88 1 125 1 178 1 238 1
133 2 23 1 61 1 89 1 127 1 191 1 242 1
147 2 26 1 64 1 91 1 128 1 196 1 Remaining
164 2 35 1 67 1 95 1 137 1 197 1 keys do
193 2 36 1 68 1 103 1 139 1 202 1 not occur
The choice of the window size is critical for the success of the attack. In the following we
discuss the relation between the choice of the window size and success of the attack through
a theoretical framework.
The probability of occurrence of any randomly chosen key in a column of the probable key
matrix is
w
P r[Mi = 1] = m , where 2m is # of keys guessed
2
For, a random key P r[M < n2 ] should be very high. Now, it can be notedthat according
to PKDPA if a key other than the correct key occurs more than or equal to n2 times than
it leads to an error. So, P r[M ≥ n2 ] gives the error probability of PKDPA. From the point
of view of probable key analysis, we want to keep P r[M ≥ n2 ] for a randomly chosen key
as low as possible. We then argue that with the error probability significantly low for a
wrong key, if some key occurs more than or equal to n2 times, then it must be the correct
key with a very high probability. This is because P r[M ≥ n2 ] is high for a correct key as
it will be correlated to all the target bits and hence tend to appear in all columns. Hence
we can use this to devise a good distinguisher between the correct key and wrong keys. In
the following paragraphs we try to draw a relation between the parameter window-size and
error probability. We proceed by applying the theorem for multiplicative form of Chernoff
bound [145], which is stated as follows:
Theorem 18 [145] Let random variables (X1 , X2 , · · · , Xn ) be independent random vari-
taking on values 0 or 1. Further, assume that P r(Xi = 1) = pi . Then, if we let
ables P
n
X = i=1 Xi and µ be the expectation of X, for any δ > 0, we have
µ
eδ
P r[X ≥ (1 + δ)µ] ≤ (10.1)
(1 + δ)(1+δ)
We are interested to find the Chernoff bound for the error probability P r[M ≥ n2 ]
of a random (or wrong) key. For the random variable M defined earlier, the expectation
µ = E(M ) and δ > 0 are as below.
w
E(M ) = m n (10.2)
2
(m−1)
2
δ= −1 (10.3)
w
By Theorem 18, the error probability of a random key occurrence in the probable key
matrix is bounded by
n o( 2wm )n
2(m−1)
w −1
h ni e
Pr M ≥ ≤ (10.4)
2 n
(m−1)
(m−1)
o 2 w
2
w
The R.H.S of Equation 10.4, thus gives an upper bound of the error probability of a randomly
chosen (wrong) key to pass the test. From the condition that δ > 0, we can easily derive
the upper bound for the window-size.
n (m−1) o
2
w −1 >0
⇒ w < 2(m−1)
We now give some theoretical results about the variation of error probability with the
window-size. For AES we choose m = n = 8 and plot the upper bound of the error prob-
ability given in Equation 10.4 for all possible values of the window-size (w < 2(m−1) ) for
the AES algorithm in Fig. 10.13. We also plot the value of confidence defined as (1 - error
probability).
1
Confidence
0.8
0.6
0.4
0.2
Error Probability
AES
0
20 40 60 80 100 120
Window−size (w)
FIGURE 10.13: Error probability and Confidence vs. Window-size for the AES circuit
From Fig. 10.13 one can infer that as we go on increasing the window-size the error
probability of PKDPA also increases and attains very high value if w is close to 2(m−1) .
In general we can say that the probability of error is directly proportional to the chosen
window-size.
h ni h ni
Pr M ≥ > Pr M ≥ , if w1 > w2
2 w1 2 w2
Table 10.4 shows the error probabilities for AES w.r.t some discrete window-sizes. In
the next paragraph we study the effect of window-size on the number of traces required to
mount the attack.
The estimates of error probability and confidence have been plotted in Fig. 10.14 w.r.t
the window-size. In addition to that we plot the number of traces that are required to attack
AES S-Box - 15 against window-size (w). One can infer from the figure that ideal size of the
window varies between 9 and 18. In this range least number of traces are required to mount
PKDPA. If we increase w beyond 18 the trace count also increases rapidly. We also see that
beyond a certain value of the window-size i.e., 34 there is a drastic increase in the number
of required traces.
Actually
from our practical results we have found that for w > 34, the
probability P r M ≥ n2 for wrong keys becomes significantly high. It can be seen that the
theoretical value of the error probability (ǫ) for w = 35 given in Table 10.4 is also high.
From our experimental results we have found that for all practical purposes w = 10 (AES)
yields good results.
The plots in Fig. 10.13 imply that the probability of error with window-size w = 1 is
1 15000
Number of Traces
Number of Traces 9000
(Sbox − 15)
0.5
6000
3000
Error Probability
Ideal Window−size
0 0
0 20 40 60 80 100 120
Window−size (w)
FIGURE 10.14: Error probability, Confidence and # of Traces (S-Box - 15) vs. Window-
size for AES
the least and steadily increases with w > 1. It is interesting to note that w = 1 actually
represents classical DPA where we look only for the key with the highest DPA bias. From
Table 10.4 we can see that the probability of error (ǫ) is significantly low for w = 1.
However, if we choose w = 10 (AES) then ǫ still remains very low. Conversely, we can say
that the level of confidence (1 − ǫ) in classical DPA is very high while for PKDPA it is still
considerably high. In practice, we see that by slightly lowering the level of confidence we are
able to considerably reduce the number of traces required for a successful attack. However,
one should be careful in choosing the window size as choosing a very high window-size could
lead to drastic increase in required number of traces.
There are several other improvements and variants of the Differential Power Attacks.
The most commonly known method is called the Correlation Power Attack (CPA), where
correlation coefficient is used as a statistical distinguisher to identify the wrong keys from
the correct one. In the next section, we provide a detailed discussion of the method.
columns of the matrix H and those of the matrix T. The similarity is typically computed
using the Pearson’s Correlation coefficient as defined in details below.
Toggling in the
registers measured R1 R5 R9 R13
by the Hamming Distance
of the initial and final
values. R2 R6 R10 R14
R3 R7 R11 R15
FIGURE 10.15: Computing Hamming Distance in the Registers due to Last Round Trans-
formation of AES
In the following, we detail the computation of the Hypothetical power trace. The attacker
targets a specific register (see Fig. 10.15) and observes the corresponding ciphertext byte,
Cipher. As described before to compute the hypothetical power value, the attacker needs to
apply Inverse Shift Row to obtain the corresponding byte position which affects this register
position.
TABLE 10.5: Target Ciphertext Byte wrt. Register Position to Negotiate Inverse Shift
Row
Register Position 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Table 10.5 shows the mapping between the target register position and the ciphertext
byte which is to be traced back to the last round to find the previous value of the register. We
call this ciphertext byte as SCipher. The attacker in order to trace back this byte guesses the
key byte, denoted as key and obtains InvSubByte[SCipher ⊕ key]. The Hamming Distance
between this traced back byte and the present ciphertext byte, Cipher provides an estimate
of the hypothetical power.
PN Sample
(hP ower[i][k]−meanH[i])(trace[j][k]−meanT race[j])
result[i][j] = PN Sample
k=0
2
PN Sample
(hP ower[i][k]−meanH[i]) (hP ower[i][k]−meanH[i])2
k=0 k=0
Algorithm 10.3 summarizes the steps to compute the correlation matrix and perform the
correlation power attack. It may be noted that i denotes the candidate key, and j represents
a time instant in the power trace when the correlation is computed. It is expected that after
sufficient number of traces the correct key will have a significant peak in its correlation plot
wrt. the time interval of encryption.
0.30 0.25
0.25
DPA bias balues ->
0.20
0.15
0.15
0.10
0.10
0.05
0.05
0.00 0.00
0 20 40 60 80 100 0 20 40 60 80 100
trace# (*200) -> trace# (*200) ->
FIGURE 10.16: Progress of Bias of keys for CPA on simulated Power Profiles of AES-128
The efficiency of a CPA analysis is often measured by several metrics as published in the
literature. In the following section, we provide an overview on some of the most important
metrics, namely Guessing Entropy and Mean Time to Disclosure.
which the required k may belong. While the above definition for a given order is fixed wrt.
the remaining work load, the following definition of guessing entropy provides a more flexible
definition for the remaining work load. It measures the average number of key candidates
to test after the attack. We formally state the definition.
The Guessing Entropy of the adversary AEk ,L against a key class variable S is defined as:
In order to evaluate the Guessing Entropy (GE) we observe the ranks of the various
guessed key values in the guessing vector. For the experiments on CPA with the simulated
power traces (both without and with noise), Fig. 10.17 we plot the ranking of the correct
key wrt. the wrong keys. The plots show that the correct key rank reduces fast to zero,
thus showing that the guessing entropy reduces to 0 at that point, and the key is identified
after a certain number of traces. The figures show that with noise the Guessing Entropy
reaches zero slower and after more number of traces, as when the power traces are without
any noise value.
300 300
Ranking of the Keys ->
250 250
200 200
150 150
100 100
50 50
0 0
0 20 40 60 80 100 0 20 40 60 80 100
trace# (*200) -> trace# (*200) ->
FIGURE 10.17: Guessing Entropy Plots for CPA on Simulated Power Traces
The above experiments on CPA were based on simulated power traces, and thus do
not exactly capture the real-life scenario. For completeness we also present in the following
section we present an overview on using real-life power traces for performing the correlation
power analysis.
dk = Corr(aFk∗ (X) + N, Fk (X)) for k ∈ K and returns the key k as the correct key such
that dk is maximum.
In most of the practical scenarios, the point of interest t∗ is not known before hand. Thus,
in practice, DPA attacks are multivariate in nature, i.e., they takes the leakages of multiple
sample points as the input. Most common form of multivariate DPA attacks apply an
univariate distinguisher on each of the sample points and then, simply chooses the best result
among those. However, in a different strategy, the attacker sometimes uses multivariate
distinguisher which jointly evaluates the power consumptions at multiple sample points.
Such multivariate distinguishers are common in profiling attacks like Template attack [86],
Stochastic attack [350, 183]. However, non-profiling attacks are vulnerable to a decrease in
the success rate resulting from the integration of the output of a sample point which has
more signal component corresponding to the noise (often measured using the metric SNR)
to that of sample points with low signal-to-noise ratio. Thus, the definition of SNR and the
ability to measure can be quite handy to evaluate DPA: for developing improvements of the
classical DPA and also for assessing the vulnerability of a given architecture against these
attacks.
Consider an AES implementation running on a hardware platform. We compute the
target variable S by taking the 128-bit Hamming Distance between the ciphertext and
the input to the previous round. We investigate the behaviour of the leakages of an AES
implementation over a range of sample points due to the computation of the intermediate
variable. For example, we examine the signal-to-noise ratio for 20,000 power traces around
300 sample points around the register update for the last round.
Recalling, that the power leakage is Lt = aS + N , where aS is the deterministic data
dependent part of the power targetted by CPA. The overall power consumption of the
device can be depicted using the (conceptual) frequency distribution in Fig. 10.18. It
can be seen that the overall power consumption is centered around distinct voltage levels,
which indicate the various Hamming Distance classes, denoted by S. However, because of
the expected Gaussian noise N , the voltage levels are spread across each of these voltage
levels.
Frequency
... ...
S0 S1 S2
The variation of the data-dependent leakage is computed as V ar(E[Lt |S]), which indi-
cates the variations in leakage due to the target S at sample point t. It is intuitive to see
this statistic captures the deterministic component of the power leakage and is useful to the
attacker (more the variation, more the detectability!). It thus may be stated that the vari-
ations in this parameter indicate the leakage of the target variable S. Thus, V ar(E[Lt |S])
captures the signal content of a power trace. The noise component of the power consump-
tion can be computed as (Lt − E[Lt |S]). It may be observed that in Fig. 10.18 each of
the normal distributions, which are due to the noise, starts overlapping if this variance in-
creases. Thus the variance of the noise distribution computed as Var(Lt − E[Lt |S]) captures
the noise of the power traces at the instance t. Thus we have the following definition of the
signal-to-noise ratio (SNR) of a power trace.
The ratio SNR, thus quantifies the amount of information at a sample point and is
computed as:
V ar(E[Lt |S])
SN Rt =
V ar(Lt − E[Lt |S])
The SNR is a very useful metric for assessing the threat of a power attack on a given
implementation. The power consumption of the device can be denoted as Lt = Pdet + N ,
where Pdet is the deterministic component of power and N is the noise. Thus the correla-
tion between the total leakage Lt and the hypothetical power value for the ith key Hi as
Corr(Hi , Lt ) and expanded as:
Corr(Hi , Lt ) = Corr(Hi , Pdet + N )
Cov(Hi , (Pdet + N ))
= p
V ar(Hi )(V ar(Pdet ) + V ar(N ))
E(Hi .(Pdet + N )) − E(Hi )E(Pdet + N )
= p q
V ar(Hi )V ar(Pdet ) 1 + VVar(P
ar(N )
det )
Corr(Hi , Pdet )
= q
1 + SN1 R
In the above equations, Cov and E denotes the Covariance and Expectation of random
variables. Also note that we have used the fact that Pdet and N are statistically inde-
pendent. The above result can be used to evaluate an architecture using simulated traces.
The simulations, as described previously, are based on a power model and derived with the
knowledge of the key. In such cases there is no noise, and the CPA computes the Correlation
Corr(Hi , Pdet ). However the real correlation can be computed using the knowledge of the
SN R using the above equation.
Having defined the SNR of the power traces, we investigate the performance of the CPAs
on few architecture topologies. We start from the iterative version. The following results
are however based on actual power traces.
1st Round
PlainText
R
e
g
AES Round i
s
Last Round t
e
r
and correlated with the hypothetical power values, which again depend on the key. In this
architecture, as all the S-Boxes are in parallel, when we target a specific S-Box, the 15
others are also operational.
FIGURE 10.20: Plot of Guessing Entropy for CPA on Parallel Architecture for AES
They add to the algorithmic noise for the attack. To emphasize this point, we call
this architecture a parallel architecture as opposed to the serialized architecture (in the
next section). We acquire 70,000 power traces divided into sets with 3000 traces each. We
perform the CPA attack on them, compute the average guessing entropy for them. The
corresponding plot is illustrated in Fig. 10.20.
STATE REGISTER
128
128 128
8 8 8 DIFFUSION LAYER
...
STATE BYTE 16
SELECT
8
SBOX
8
...
1 1 1
128 128
SUBSTITUTION/DIFFUSION 1
SELECT
128
FIGURE 10.22: Plot of Guessing Entropy for CPA on Serialized Architecture for AES
FIGURE 10.23: Frequency Plot of Signal Component of Power Traces for the zeroth
S-Box of Serialized Architecture for AES
tiplexer in Fig. 10.21 arbitrary. However shuffling schemes have their limitations, as the
randomness provided is quite limited. In particular, if one observes the total power con-
sumed over the 16 clock cycles for the S-Box operations one can still perform a CPA attack.
However compared with the serialized implementation the SNR is low, and thus making it
harder to perform a CPA. We compare the above architectures by measuring their relative
SNRs, as discussed below.
For computing the SNR for the serialized architecture, we observe the power trace during
say the zeroth S-Box computation. Then with the knowledge of the corresponding key byte
(note that we are computing the SNR for performing an evaluation and hence we are aware
of the key), we compute the Hamming Distance corresponding to the zeroth S-Box. Note
that since we are targetting a particular S-Box, the Hamming classes can be 9: from 0 to
8. Then we divide the power traces into the the possible into these 9 classes, and compute
the average of each class, pi . We also note the frequencies of each class, ie. the number of
traces falling into each of the Hamming classes, fi . The averaging is expected to remove
the effect of noise, and hence is expected to give a measure of the signal. We compute the
Σ8 fi p2i Σ8 pi fi 2
variance by using the standard formula: V ar(E[Lt |S]) = Σi=0 8 f i
− ( Σi=0
8 f
) . Now the
i=0 i=0 i
average is deducted from the power traces, thus removing the signal and providing the noise
distribution. We perform similar computations as for the signal to find the variance of the
noise. The ratio is finally computed to provide the SNR.
For the shuffled scheme, we compute the SNR for the total power for all the 16 S-Box
computations. This increases the SNR, as can be checked that if a specific S-Box is targetted
1 th
the SNR is theoretically expected to be 256 compared to the serialized implementation.
For our computations of the SNR we sum the power values in the traces for the sample points
corresponding to each of the sixteen cycles, and then they are divided into the Hamming
classes.
It may be noted that since we are considering the total power for all the S-Boxes, the
Hamming classes vary from 0 to 128. The same technique as above is applied for computing
the variance of the signal, and the noise to compute the SNR. Fig. 10.24 shows the SNR
FIGURE 10.24: Comparison of SNR values of Serialized Architecture for AES vs the
Shuffled Architecture
values of both the architectures to indicate that the average SNR for the shuffled architecture
is almost one-third of the serialized architecture.
We also show the guessing entropy of a CPA attack mounted on the shuffled architecture
in Fig. 10.25. This may be compared with that in the serialized architecture where the
guessing entropy reduced to 0 value quite fast indicating indicating increased vulnerability
against CPA.
FIGURE 10.25: Plot of Guessing Entropy for CPA on Shuffled Architecture for AES
(averaged over all the S-Boxes)
the popular techniques. The other line of countermeasures is by building circuits assuming
logic styles with data independent power consumption. However, these methods are often
costly and not in accordance with the normal CAD methodology.
The most popular approach is to randomize the intermediate results which are computed
using conventional gates. The assumption is based on the fact that the power consumption
of the devices are uncorrelated with the actual data as they are masked with a random
value. The popularity of masking over other techniques are because it can be applied at
algorithmic level or at the gate level and does not rely on specially designed gates. Further,
the process of converting an unmasked digital circuit can be converted into a masked version
in an automated fashion.
In masking every intemediate value which is related to the key is concealed by a random
value m which is called as the mask. Thus, we transform the intermediate value v as vm =
v ∗ m, where m is randomly chosen and varies from encyption to encryption. The attacker
does not know the value m. The operation ∗ could be either exclusive or, modulo addition,
or modulo multiplication. Boolean masking is a special term given to the phenomenon of
applying ⊕ as the above operation to conceal the intermediate value. If the operation is
addition or multiplication, the masking is often referred to as arithmetic masking.
Masking can be broadly classified into two types depending on whether it is applied at
the algorithmic level or at the gate level.
As discussed before in Chapter 4, a compact realization of the AES S-Box can be achieved
by using the composite fields. The transformation from GF (28 ) to GF (24 )2 is again linear,
and thus can be masked using XORs quite conveniently. We consider the masking of the
Galois field inverse in composite fields.
Recall the circuit description in chapter 4. Let the irreducible polynomial of an element in
GF (24 )2 be r(Y ) = Y 2 +Y +µ (note we assume τ = 1). Let an element in the composite field
be γ = (γ1 Y + γ0 ) and let the inverse be δ = (γ1 Y + γ0 )−1 = (δ1 Y + δ0 )mod (Y 2 + τ Y + µ).
Thus, the inverse of the element is expressed by the following equations:
δ1 = γ1 d′ (10.7)
δ0 = (γ1 + γ0 )d′ (10.8)
d = γ12 µ + γ1 γ0 + γ02 (10.9)
d′ = d−1 (10.10)
Next we consider the masking of these operations. The masked values corresponding to
the input is thus, (γ1 + mh )Y + (γ0 + ml ), of which the inverse is to be computed such that
the output of the equations of Equation are also masked by random values, respectively
m′h , m′l , md , m′d .
Let us consider the masking of Equation 10.7. Thus we have:
However we may note that in that case the operation uses terms like d′ we should mask.
Thus one has to take care when adding correction terms that no intermediate values are
correlated with values, which an attacker can predict.
We thus mask d′ in the correction term as follows: (γ1 +mh )m′d +mh (d′ +m′d )+mh m′d +
mh .
′
We thus have:
Likewise, one can derive the remaining 2 equations (Equation 10.8, and 10.9) in the
masked form. For Equation 10.8, we have:
d + md = γ12 p0 + γ1 γ0 + γ02 + md
= (γ1 + mh )2 p0 + (γ1 + mh )(γ0 + ml ) + (γ0 + ml )2 + (γ1 + mh )ml + (γ0 + ml )mh
+m2h p0 + m2l + mh ml + md
= fd ((γ1 + mh ), (γ0 + ml ), p0 , mh , ml , md )
Masking Equation 10.10 involves masking an inverse operation in GF (24 ). Hence the
same masking operations as above can be applied while reducing the inverse to that in
GF (22 ). Thus, we can express an element in GF (24 ) δ = Γ1 Z + Γ0 , where Γ1 and Γ0 ∈
GF (22 ). Interestingly, in GF (22 ) the inverse is a linear operation making masking easy!
Thus we have, (Γ + m)−1 = Γ−1 + m−1 . This reduces the gate count considerably.
Definition 10.8.1 An algorithm that evaluates an encryption function enc is order d per-
fectly masked if for all d-tupels I1 , . . . , Id of intermediate results we have that:
Dx,k (R) = Dx′ ,k′ (R) for all pairs (x, k), (x′ , k ′ )
Thus for the above masking scheme we need to argue that all the data dependent inter-
mediate operations fulfill definition 10.8.1.2. The intermediate values in the above masking
are masked data (a+ma ), masked multiplication (a+ma )(b+mb ), multiplication of masked
values with masks (a + ma )mb , masked squarings (a + ma )2 , and (a + ma )2 p. The proof
presented in [282] is divided into two parts: first it shows that all the above intermedi-
ate computations leads to outputs whose distributions are independent of the inputs, and
secondly the summation of the intermediate states are done securely.
We state the results without proofs.
Thus, the distribution of (a+ma )(b+mb ) is independent of a and b. We call this distribution
as the random product distribution. As a special case, when b = 0, we have (a + ma )mb
which is independent of a.
The above results show that the masking scheme performs operations which are secure,
in the sense that they produce distributions which are independent of the input if the masks
are chosen independently and unformly. However when we combine the above masked values
more intermediate computations are performed. We need to ensure that the composition is
also secure. The following result gives a guarantee for the same.
One should note that the independence is for all values of i. Thus for the independence
every summation of variables must start with the addition of an independent mask M . This
result shows that the order in which the terms are added is important.
p0
mh
fd ml
d + mdmd
fd ′
d−1 m′d
γ1 + mh
mh
fγ1
δ1 + m′h m′h
ml
fγ0
δ0 + m′l m′l
ensure the prevention of any accidental unmasking of data. Following are some suggested
reuse of masks to simplify the circuitry.
In order to reduce the complexity of fγ1 the following reuse of masks is performed. Using
m′d = ml and m′h = mh , we have the following:
For the sake of reducing the gate count of the circuit for fγ0 we can choose m′l = ml
and m′d = mh = m′h . Thus we have:
In the following we detail the time steps in which the above intermediate computations
are performed for the overall security. We divide the intermediate computations into masked
S-Box computations and correction terms, to stress that while the former are computations
which are any way performed in the unmasked S-Box circuit, the later are extra correction
steps and adds to the overhead of the computation.
qm = q ⊕ mq
= (ab) ⊕ mq
= (am ⊕ ma )(bm ⊕ mb ) ⊕ mq
= (am bm ⊕ bm ma ⊕ am mb ⊕ ma mb ⊕ mq )
However, it should be noted that the order of the computations performed is of extreme
importance. The correct order of performing the computations are as follows:
The ordering follows to ensure that the unmasked values are not exposed during the
computations. Further it should be emphasized that one cannot reuse the mask values. For
example one may attempt to make one of the input masks, ma same as the output mask,
mq . While this may seem to be seemingly harmless can defeat the purpose of the masking.
We discuss this aspect in details below.
In the above product let us denote the intermediate terms as P1 = am bm , P2 = ma bm ,
P3 = am mb , P4 = ma mb . First let us observe that adding any of two of these products gives
a data dependent distribution. For example, consider P1 + P2 = abm . This distribution
depends on a, since a = 0 gives the constant zero distribution, while a 6= 0 gives a uniform
distribution. Likewise, P1 + P3 = bam , P2 + P4 = bmb , which are all dependent on b for
their distributions.
Consider, P1 + P4 = am bm + ma mb = (a + ma )(b + mb ) + ma mb = abm + ma b. This
distribution depends on both a and b. Fixing a = b = 0, we have the zero distribution, else
we have a unform distribution. This makes the distribution data dependent, and thus does
not work. Likewise, P2 + P3 does not work as the resulting value is amb + ma b, which again
depends on a and b. Hence this also does not work.
Hence, the only way of adding the terms seem to be my adding the output mask.
When we are attempting to reuse the masks by setting mq = ma , we have P1 + ma =
abm + ma (bm + 1). Resulting distribution depends on a, as when a = 0, we have the
random product distribution, else when a 6= 0, we have the uniform distribution. Similarly,
P3 + ma = amb + ma (mb + 1) also is data dependent.
So, we start with P2 +ma or P4 +ma . For the first choice, we observe P2 +ma = ma bm +
ma = ma (bm + 1), which is a random product distribution. Next, we add P4 , and obtain
(P2 + ma ) + P4 = ma (b + 1) which depends on b. Trying (P2 + ma ) + P3 = amb + ma (b + 1),
which is again data dependent. This can be easily observed as setting a = 0 and b = 1 we
have the zero distribution, else we have a uniform distribution.
However, trying (P2 + ma ) + P1 = ma (bm + 1) + am bm = abm + ma , which is a uniform
distribution and is data independent. As we still need to combine P3 and P4 , we attempt
them as either of the following:
((P2 + ma ) + P1 ) + P3 = abm + ma + am mb
= ab + ma (mb + 1)
((P2 + ma ) + P1 ) + P4 = abm + ma + ma mb
= abm + ma (mb + 1)
The former distribtion depends on q = ab, as when q = 0 we get the random product
distribution, else when q 6= 0 we get different distribution depending on the value of q. The
later is also data dependent as when a = 0 we get the random product distribution, else we
get the uniform distribution.
This shows that though in the above masking of the AND gate, every term is not data
dependent, there is no way to add them without revealing a distribution which is not data
dependent. This shows the importance of lemma 4, which tells us to add a uniform and
independently chosen mask (in this case mask mq ) to make the addition of the intermediate
terms always data independent.
normal n-bit (or 1-bit) XOR gates. Also it may be observed that the multipliers (or, AND
gates) operate pairwise on (am , bm ), (bm , ma ), (am mb ) and (ma , mb ). Each of the element
of the pairs has no correlation to each other (if the mask value is properly generated) and
are independent of the unmasked values a, b and q. As discussed in section 10.2.3, one can
obtain a transition table and obtain the expected energy for generating q = 0 and q = 1.
The gate now has 5 inputs and thus there can be 45 = 1024 transitions like table 10.1).
If we perform a similar calculation as before for unmasked gates, we find that the energy
required to process q = 0 and q = 1 are identical. Thus if we compute the mean difference
of the power consumptions for all the possible 1024 transitions for the two cases: q = 0 and
q = 1, we should obtain theoretically zero. Likewise the energy levels are also not dependent
on the inputs a and b and thus supports the theory of masking and show that the masked
gate should not leak against a first order DPA. However in this analysis we assume that the
CMOS gates switch once per clock cycle, which is true in the absence of glitches.
am bm am ma bm mb ma mb
ab ⊕ mq
But glitches are a very common phenomenon in digital circuits, as a result of which
the CMOS gates switch more than once in a clock signal before stabilizing to their steady
states. One of the prime reasons of glitches in digital circuits is different arrival times of
the input signals, which may occur in practice due to skewed circuits, routing delays etc.
As can be seen in the circuit shown in Fig. 10.27 the circuit is unbalanced which leads to
glitches in the circuit.
The work proposed in [242] investigates various such scenarios which causes glitches and
multiple toggles of the masked AND gate. The assumption is that each of the 5 input signals
toggle once per clock cycle and that one of the inputs arrive at different time instance than
the others. Moreover we assume that the delay between the arrival time of two distant
signals is more than the propagation time of the gate. As a special case, consider situations
when only one of the five inputs arrives at a different moment of time than the remaining
four inputs.
There exist 10 such scenarios as each one of the 5 input signals can arrive either before
or after the four other ones. In every scenario there exist 45 = 1024 possible combinations
of transitions that can occur at the inputs. However, in each of the ten scenarios where the
inputs arrive at two different moments of time, the output of the masked and gate performs
two transitions instead of one. One transition is performed when the single input performs
a transition and another one is performed when the other four input signals perform a
transition. Thus the Transition Table for such a gate in this scenario would consist of 2048
rows and we observe that the expected mean for the cases when q = qm ⊕mq = 0 is different
from that when q = 1. Similar results were found in other scenarios as well.
In [243], the exact reason behind the attacks on masked AND gates were searched for.
For each of the 1024 transitions resulting in the circuit due to the toggles in the input
lines the transitions were monitored for each of the internal lines. The number of transi-
tions in the lines are denoted by T (am ), T (bm ), T (ma ), T (mb ), T (mq ), . . . , T (i1 ), . . . , T (i7 ).
It can be observed that the masking causes the correlation of the number of transitions
T (am ), T (bm ), T (ma ), T (mb ), T (mq ), T (i1 ), T (i2 ), T (i3 ), T (i4 ) with the unmasked signals, a,
b, and c zero. The inputs are masked and hence it can be explained why the number of tran-
sitions T (am ), T (bm ), T (ma ), T (mb ), T (mq ) are uncorrelated with the value of q.
The 4 multipliers never operate on inputs like am and the corresponding mask ma ,
rather they operate on independent inputs like am along with bm and mb . These makes
the number of transitions at the output of the multipliers are also uncorrelated with the
unmasked values. Hence, it was exactly pin-pointed that it was the XOR gates which was
responsible for the correlation between power and the value of q.
This was counterintuitive as normally the output of the XOR gate switches whenever
any of the input changes. Since the inputs to 4 XOR gates (Fig. 10.27) i1 , i2 , i3 and i4
are uncorrelated to the unmasked values (as it was stated previously that the multipliers
process values which are not correlated to the unmasked values), it was expected that the
outputs of the XOR gates should be also uncorrelated to the unmasked values. However
XOR gates absorbs certain transitions when both the values change simultaneously or within
a small interval of time. It was shown that due to the difference of arrival time between
the inputs to the XOR gates, the number of absorbed transitions of the XOR gates were
correlated to unmasked values a, b, and q. The absorbed transitions again depend on the
arrival times of the input signals i1 to i4 , which depends on the unmasked values. Thus
the joint distribution of the arrival times of i1 , . . . , i4 was established as the source of side
channel leakage of the masked gates.
The above discussion shows that the absorption property of the XORs are responsible
for the power attacks. On a closer look it may be observed that out of the 4 XOR gates of
the masked AND gate, the one combining the mask mq and the output line of a multiplier
i4 operates on data none of which depends on the unmasked inputs a or b. Hence the actual
responsible XOR gates are the other three XOR gates. If the architecture can ensure that
these XOR gates do not absorve any transition then the overall circuit can be secured. One
possible solution is to properly time the circuit with suitable enabling signals. We may also
want to properly balance the circuit, with the care that arbitrary balancing might lead to
the exposure of unmasked values. However adding such measures also increases the circuit
complexity.
10.9 Conclusions
In this chapter, we provided an overview on power attacks on implementations of cryp-
tographic algorithms. The chapter provided an insight into the measurement setup, the
underlying power models explaining why the gates are vulnerable to power analysis. Sub-
sequently, Differential Power Analysis is explained using the Difference of Mean method,
along with some improvements over the basic technique. We discussed techniques for per-
forming power attacks using correlation analysis, thus defining the method of Correlation
Power Analysis. Metrics for evaluating a Differential Power Analysis is subsequently dis-
cussed. The chapter presented power analysis results for both simulated power traces, along
with actual attacks on real traces captured from FPGAs for various types of AES architec-
tures. The chapter concluded with popular countermeasures, with an emphasis on masking
techniques and their limitations against first order power attacks.
(Satires of Juvenal)
11.1 Introduction
With the increase in the applications and complexity of cryptodevices, reliability of such
devices has raised concern. Scan chains are the most popular testing technique due to their
high fault coverage and least hardware overhead. However, scan chains open side channels
for cryptanalysis as discussed in Chapter 7. Scan chains are used to access intermediate
values stored in the flip-flops, thereby, ascertaining the secret information, often known as
key. These attack techniques exploit the fact that most of modern day ICs are designed for
testability. If scan chains are used for testing then they serve as a double-edged sword. In
this test scheme all flip-flops (FFs) are connected in a chain and the states of the FFs can be
scanned out through the chain. Scan testing equips a user with two very powerful features
namely controllability and observability. Controllability refers to the fact that the user can
set the FFs to a desired state, while observability refers to the power to observe the content
of the FFs. Both these desirable features of a scan-chain testing methodology can be fatal
from cryptographic point of view. In cryptographic algorithms, knowledge of intermediate
333
values of the cipher can seriously reduce the complexity of breaking it. As seen in Section
7.7 of Chapter 7, the availability of scan chains to an attacker can be used efficiently to
determine the key of a block cipher.
Such kinds of attacks can have profound practical importance as the security of the sys-
tem can be compromised using unsophisticated methods. For example, keys can be extracted
from popular pay TV set-top boxes via scan-chains at a cost of few dollars. Naturally, the
attack can be prevented by eliminating the test capabilities from the design. However, this
increases the risk of shipping chips with defects, which may hamper the normal functional-
ity of the designs. Hence, research is needed to study the scan-chain based attacks against
designs of standard ciphers with the objective of developing suitable DFT techniques for
cryptographic hardware.
In order to solve this challenging problem of efficiently testing cryptographic chips, sev-
eral interesting solutions have been proposed. An important constraint on these designs is
their hardware cost, test time, and the amount of testability achieved. In this chapter, we
extend the attack on block ciphers as introduced in Chapter 7 to stream ciphers. Subse-
quently we provide an overview on some possible strategies to be incorporated in the design
of test methodologies to make the design testable, yet secure.
present section, we provide an overview on the attacks on stream ciphers. These attacks show
that the scan chains provide controllability and observability to attack any cryptographic
algorithm.
n 1 CR1
Memory
n 1 SR1
w bits
1
n 1 CR2
ki c
i
n 1 SR2
F +
m
i
n 1 CRn
n 1 SRn
Seed (w bits)
The stream cipher can be attacked through the use of scan-chains, enabling the adversary
to obtain the value of the message data from the ciphertext. The attack has three broad
stages: First, the attacker scans out the entire contents of the internal CR and SR flip-flops
at the beginning, but he does not know their positions in the scan-chains.
In the first stage of the attack, the adversary ascertains the structure of the seed. For this
the attacker inputs an all zero-input seed and applies one clock cycle. Thus the memory
gets accessed at a location, addr(0), the content of which is denoted by (addr(0)), and
the corresponding polynomial gets loaded into the CR registers. Now the attacker sets the
device back to test mode and scans out the data. It may be noted that all the SRs get
the value directly from the seed and are thus all zero. Hence the number of ones in the
scanned out data patterns is s.wt(addr(0)), as there are s-LFSRs. Now the attacker sets
one of the input seed bits to one and the above steps are repeated. Depending on whether
this bit goes to the memory or goes to the SRs, the number of ones is either s.wt(addr(1))
or s.wt(addr(0)) + 1. Since both of these cannot be same (as s > 1), and the attacker knows
the value of s.wt(addr(0)), he comprehends whether the set bit is a part of the w1 bits
or not. Continuing this for all the w bits of the seed, the attacker knows the bits in the
seed which are used to initialize the SRs. The attacker also knows the positions of the CR
registers in the scanned out pattern.
In the second step, the attacker identifies the ordering of the CR and SR registers. The
attacker first feeds an all one input, through the scan input pin in test mode. Thus with
out loss of generality, if we assume that the LFSRs have even number of bits, the feedback
bit is zero. If now the device is scanned out, one can identify the location of the SR[n] bits
(see figure) in the scanned out pattern. Similarly toggling between normal and test mode,
one can identify all the groups of the SR bits. At this point however he does not know the
relation-ship of the SR bits in the scanned out pattern with the corresponding LFSRs.
For this the attacker applies a simple trick and sets the CR bits to 10 . . . 01 through the
scan chains (note he already has ascertained the positions of the CR bits in the scanned out
pattern). The adversary also sets one of the SR[1] bits to one. After n clock cycles, all the
SR bits of a particular register becomes one, and their positions are revealed in the scan
chain.
Now, the attacker uses the scanned out data at the beginning and uses his knowledge
of SR and CR registers in the scanned out data, to determine the contents of the SR and
CR registers at the beginning. This information helps him to obtain the key-stream, which
helps to obtain the message bits from the ciphertext bits.
The above generic technique for attacking stream ciphers using scan chains is illustrated
next, with an example attack on the Trivium stream cipher.
First, the key and the Initialization Vector (IV) are used to initialize the internal states
of the cipher which are then updated using Equation (11.2) but without generating the
key-stream.
for i = 1 to 4 × 288 do
t1 ← s66 ⊕ s91 .s92 ⊕ s93 ⊕ s171
t2 ← s162 ⊕ s175 .s176 ⊕ s177 ⊕ s264
t3 ← s243 ⊕ s286 .s287 ⊕ s288 ⊕ s69
(s1 , s2 , s3 , ..., s93 ) ← (t3 , s1 , s2 , ..., s92 )
(s94 , s95 , s96 , ..., s177 ) ← (t1 , s94 , s95 , ..., s176 )
(s178 , s179 , s180 , ..., s288 ) ← (t2 , s178 , s179 , ..., s287 )
end for (11.2)
for i = 1 to N do
t1 ← s66 ⊕ s93
t2 ← s162 ⊕ s177
t3 ← s243 ⊕ s288
zi ← t1 ⊕ t2 ⊕ t3
t1 ← t1 ⊕ s91 .s92 ⊕ s171
t2 ← t2 ⊕ s175 .s176 ⊕ s264
t3 ← t3 ⊕ s286 .s287 ⊕ s69
(s1 , s2 , s3 , s4 , ..., s93 ) ← (t3 , s1 , s2 , s3 , s4 , ..., s92 )
(s94 , s95 , s96 , s97 , ..., s177 ) ← (t1 , s94 , s95 , s96 , s97 , ..., s176 )
(s178 , s179 , s180 , s181 , ..., s288 ) ← (t2 , s178 , s179 , s180 , s181 , ..., s287 )
end for (11.3)
the user to scan in a desired pattern in test mode, run the circuit in normal mode and then
scan out the pattern to verify the functioning of the device. In case of crypto devices this
feature equips an attacker with a tool by virtue of which he can gain knowledge about the
internal intermediate state of the crypto algorithm. He exploits this knowledge to break the
cryptosystem. Literature shows that these attacks are quite easy to implement and do not
require any sophisticated set-up.
The attacker can ascertain the positions of the counter bits by exploiting the Key-IV
input pattern. As per the design in [375] the user is needed to give a padded input
pattern as, (Key80 , 013 , IV80 , 0112 , 13 ) through the Key_IV padding line. Here Key80
represents the 80 bits of the key while 013 denotes a sequence of 13 0’s. Since this
pattern is given from outside, so the attacker inputs a pattern of all zeros i.e., 0288 .
He then runs the system in normal mode for (210 − 1) cycles. Since the internal state
of the cipher is set to all zeros, so according to Equation (11.3), running the cipher
will have no change on the 288-bit internal state register and the temporary registers
which always remain to an all zero state. The only thing that changes is the 11-bit
counter. Since the system runs for (210 − 1) cycles, so 10 bits of the counter will be set
to 1. The attacker then scans out the pattern and the bits which are 1, are concluded
to be the 10 counter bits. The 11th counter bit can be determined by scanning in
the scanned out pattern again and running the system for 1 cycle. This sets the 11th
counter bit to 1. The attacker now scans out the pattern to get the bit position.
The attacker requires a total of 288 clock cycles to scan-in the pattern and the same
number of clock cycles to scan-out the pattern.
It might be noted that to mount the scan attack the attacker need not know
which bit of the counter corresponds to which bit-position but only the posi-
tions of all the counter-bits as a whole. The number of clock-cycles required is
(288 + 210 − 1 + 288 + 288 + 1 + 288) = 2176.
This part of the attack is straight-forward. The attacker here again tries to exploit
the key-IV input pattern. He first resets the circuit and then gives a pattern in which
only key1 = 1 and remaining bits are 0 during Key-IV setup Equation (11.1) .i.e.,
He then runs the system for 1 clock-cycle to load the first key bit, after which he scans
out the entire pattern in test-mode. The output pattern will have 1’s in two positions.
Out of these one will correspond to the counter LSB which the attacker already knows
while the other corresponds to s1 . So the bit position of s1 in the output pattern is
ascertained. He proceeds likewise setting to 1 only the bit he wants to find in the input
pattern and then running the corresponding number of clock-cycles in normal mode.
Typically, to find the ith bit position he has to run the system for i clock periods.
He thus finds the bit positions forP all the 288 state bits in the output pattern. The
288
number of clock-cycles required is i=1 i + 2882 = 124560.
The attacker has already determined the positions of (288 + 11) = 299 internal reg-
isters. He then finds the temporary registers t1 , t2 , t3 . The following equations can be
derived from Equation (11.2),
The attacker sets the input pattern in such a way that when he runs the cipher in
normal mode after key-IV load, only one of the temporary registers changes their
value to 1 while the others are 0. For example, the attacker can set s66 = 1 i.e.,
He then loads the Key-IV pattern in normal mode and runs for 1 more clock cycle
which will set t1 = 1. He scans out the pattern and as he already knows all other bit
positions he can get the position of t1 . This requires 2 × 288 clocks, 288 for scanning
in and 288 for scanning out. He repeats the process to get t2 . Once he obtains t1 and
t2 , the position of t3 becomes evident. Thus, he requires a total of 4 × 288 clocks.
This concludes the first phase of the attack with the attacker now having the knowl-
edge of all the bits positions in the internal state of the cipher. He then advances to
the next phase where he attempts to decipher the cryptogram from the knowledge of
the internal state.
Pi = Ki ⊕ Ci
So our basic motive is to get all the key stream bits. The attacker had scanned out
the internal state of Trivium after getting hold of the device. Now, he has ascertained
the position of s-bits in the scan-chain. This information will be used to decipher the
cryptogram and to obtain the plaintext. We proceed by knowing the previous state
from the current state. Table 1 gives a clear picture about the relation of present and
previous states of Trivium. As is clear from the encryption algorithm, current state is
a right shift of previous state with first bit being a non-linear function of some other
bits. So, our task remains to calculate ’a’.’b’ and ’c’. Observe the following equations:
t1 = s66 ⊕ s93 (11.4)
t1 = t1 ⊕ s91 · s92 ⊕ s171 (11.5)
(s94 , s95 , · · · , s177 ) ← (t1 , s94 , · · · , s176 ) (11.6)
Equations (11.4) and (11.5) can be combined to get,
t1 = s66 ⊕ s93 ⊕ s91 · s92 ⊕ s171 (11.7)
This should be noted that if we give a clock at this configuration of Trivium, register
s94 gets loaded with t1 and other bits are shifted to their right. So, we can say that
what is s67 now, must have been s66 in the previous state and what is s93 now, must
have been s92 in the previous state and so on. Hence, from Equations (11.6) and (11.7)
and by referring to table. 11.1 we have the following equation:
s94 = a ⊕ s67 ⊕ s92 · s93 ⊕ s172
⇒ a = s94 ⊕ s67 ⊕ s92 · s93 ⊕ s172
Similarly, ’b’ and ’c’ can be deduced by the following set of equations:
s178 = b ⊕ s163 ⊕ s176 · s177 ⊕ s265
⇒b = s178 ⊕ s163 ⊕ s176 · s177 ⊕ s265
And,
s1 = c ⊕ s244 ⊕ s287 · s288 ⊕ s70
⇒ c = s1 ⊕ s244 ⊕ s287 · s288 ⊕ s70
Hence, we can compute all the previous states given a single current state of the
internal registers. Once obtained a state, one can easily get the key stream bit by
following Equation (11.3). The key stream bit when XORed with the ciphertext bit
of the state produces corresponding plaintext bit.
As a summary, the scan attacks are a very powerful attack model. Thus several ciphers
have been attacked in literature, following similar controllability and observability through
scan chains. These works show that both block and stream ciphers, and even public key
algorithm hardwares cannot be tested safe through scan chains. This raises the important
question, of how to test cryptographic devices through specially designed scan chains so that
while the normal user gets controllability and observability needed for testability, while the
attacker does not get the same. Further constrain of a low hardware overhead makes the
task even more challenging. Allowing untested components in a cryptographic circuit is
also dangerous, as it may be the avenues for further attacks. Thus testing features are a
necessity, without opening side channels to the adversary.
Crypto
Enable_scan_in
Core Load
Test Scan_mode
M Key K
Controller Logic K E
Enable_scan_out R Y
FIGURE 11.3: Secure Scan Architecture with Mirror Key Register [55]
An interesting alternative and one of the best method was proposed in [55, 421] where a
secure scan chain architecture with mirror key register was used to provide both testability
and security. Fig 11.3 shows the diagram of the secure scan architecture. The design uses
the idea of a special register called as the mirror key register (MKR), which is loaded with
the key, stored in a separate register, during encryption. However, during encryption, the
design is in a secure mode and the scan chains are disabled. When the design is in the test
mode, the design is in the insecure mode and the scan chains are enabled. During this time
the MKR is detached from the key register. The transition from the insecure mode to the
secure mode happens, by setting the Load_key signal high and the Enable_scan_in and
Enable_scan_out signals low. However, the transition from the secure mode to the insecure
mode happens only through a power_off state and reversing the above control signals. It is
expected that the power off removes the content of the MKR, and thus does not reveal the
key to a scan-chain based attacker.
But this method has the following shortcomings:
• Security is derived from fact that switching off power destroys the data in registers. So,
if the secret is permanently stored on-chip (example credit cards, cell-phone simcards,
access cards) even after turning the power off the information exists inside the chip.
This can be extracted from a device having such a scan chain in the insecure mode.
• One of the most secured mode of operation of a block cipher like AES is Cipher Block
Chaining (CBC) where the ciphertext at any instant of time depends on the previous
block of ciphertext [368]. If testing is required at an intermediate stage then the device
needs to be switched off. Thus for resuming data encryption all the previous blocks
have to be encrypted again. This entire process has also to be synchronized with the
receiver which is decrypting the data. Therefore such modes of block ciphers cannot
be tested efficiently using this scheme.
In [161] a secured chain architecture was proposed based on lock and key technique.
The architecture also uses a test key like [149] to secure the vital information of the chip.
The design has a Test Security Controller (TSC) which compares the key. When the key
is succesfully entered a finite state machine (FSM) switches the chip to a secured mode
allowing normal scan based testing. Otherwise the device goes to an insecure mode and
remains stuck until an additional Test Control pin is reset. The design suffers from the
problem of large overhead due to the design of the TSC. The TSC itself uses a large number
of flip-flops (for LFSRs and FSMs) which requires BIST for testing leading to an inefficient
design. Further the design uses an additional key (known as test key) for security. If the
cipher uses an n bit key for its operation (like 128 bits for AES-Rijndael) a brute force
attack would require 2n operations to break the system. Any successful attack should break
the key with a complexity less than that of a brute force effort. If the design uses additional
key bits for security of TSC (say t bits) then with a total of n + t bits of key the design
provides security equivalent to that of min(n, t) bits. Hence, using additional key bits for
scan chains is not a viable solution.
In [263] a scan tree based architecture with aliasing free compactor was proposed for
testing of cryptographic devices. However, the design has the weakness of a large area
overhead due to the design of compactors and its testing circuit. Normal CAD flow also
does not support the design of scan tree structures.
One of the recent techniques in implementing secure scan-chains is the Flipped-scan
[357] technique. In this scheme inverters are introduced at random points in the scan-chain.
Security lies in the fact that an attacker cannot guess the positions of the inverters with
a probability significantly greater that 12 . However, this scheme is vulnerable to an attack
which we call the reset attack. In standard VLSI design, each FF is accompanied by either
a synchronous or asynchronous RESET which initializes the FF to zero. An attacker can
assert the reset signal and obtain the scan-out pattern by operating the crypto chip in test
mode. The scan-out pattern would look like series of 0’s interleaved with series of 1’s. The
places where polarity is reversed are the locations where an inverter has been inserted. The
attack is detailed below.
and obtain the scan-out pattern by operating the crypto chip in test mode. The scan-out
pattern would look like series of 0’s interleaved with series of 1’s. The places where polarity
is reversed are the locations where an inverter has been inserted. The following example
illustrates the attack.
Example 22 Suppose there is a scan-chain of 10 D-FFs; and inverters are placed at po-
sition numbers 1, 3, 6, 7 and 11, out of 11 possible positions as shown in Fig. 11.4. We
apply a reset signal and scan out the pattern. We will get a pattern equal to X1 , X2 , · · · , X10 ,
where Xi = ai+1 ⊕ ai+2 · · · ⊕ a11 , where ai is 1 if there is an inverter in the ith link and
ai = 0 otherwise.
We will get this pattern in the order X10 , X9 , · · · , X1 . In this example,
X10 = a11
X9 = a10 ⊕ a11
X8 = a9 ⊕ a10 ⊕ a11
..
.
X1 = a2 ⊕ a3 · · · a11
From the above set of equations it follows that we will get a sequence X =
{0, 0, 1, 1, 1, 0, 1, 1, 1, 1}. As previously stated inverters have been inserted at those positions
where polarity is reversed in the pattern. Here, the positions where inverters were placed are
3,6,7 and 11. Presence of inverter at the 11th position was detected using the fact that first
bit obtained while scanning out is 1. It can be easily deduced whether there is an inverter in
the first position or not by feeding in a bit and observing if the output toggles or not. Thus,
inverters are placed at positions 1,3,6,7 and 11. By the above procedure, one can always
ascertain the location of inverters. Therefore, the design in [357] is no longer secure.
architecture can be used for testability as we can have a unique output pattern for a given
input pattern. The following theorem, summarizes the fact that the XOR-chain maintains
a one-to-one correspondence between the test input and output patterns, and thus can be
used to test the device.
Theorem 19 [261] Let X = {X1 , X2 , · · · , Xn } be the vector space of inputs to the XOR-
chain and Y = {Y1 , Y2 , · · · , Yn } be the vector space of outputs from the XOR-chain. Then
there is a one-to-one correspondence between X and Y, if the XOR chain is reset before
feeding in the pattern X.
The proof has essentially two parts. First we try to prove the one-to-one correspondence
between the input and what gets into the FFs.
Let the existence of an XOR gate be denoted by a variable ai ∈ (0, 1) such that if ai = 1
then there is an XOR gate before the ith FF. Otherwise, if ai = 0 there is no XOR gate.
At any instant of time t the state of the internal FFs of the XOR-chain can be expressed
as follows:
where,
Solving the above system of equations by multiplying these equations with appropriate
values of ai and subsequent XORing, we get the following expression:
a1 .S2t ⊕ a1 .a2 .S3t ⊕ . . . (a1 .a2 ...an−1 ).Snt = S1t ⊕ Xn−t ⊕ (a1 .a2 ...an ).Snt−1
For testing, before scanning in a pattern, the circuit is always reset. Hence, all FFs are
initially holding 0. In the above equation Snt−1 is the state of the nth FF at the (t−1)th clock
cycle. However, that should be 0 because of the initial reset. So the term (a1 .a2 ...an ).Snt−1
is always 0 independent of the values of a1 , a2 , ...an . Thus the resultant equation can be
rewritten as:
It can be inferred from above that given the states of all the FFs at an instant (i.e., S1t , ..., Snt )
and the configuration of the XOR-chain(i.e., a1 , ..., an ) one can find out what the input to
the chain was. Thus we can conclude that given a fixed X and (a1 , a2 , .., an ), the states of
internal FFs have a one-to-one mapping with vector space X.
In order to mount a scan attack on a crypto hardware in general and stream cipher in
particular, an attacker must first ascertain the structure of the scan-chain. So the main aim
of any prevention mechanism is to thwart such an attempt. Similar, to the philosophy of
flipped scan chain, security of our scheme relies on the fact that if the positions of the XOR
gates are unknown to the attacker, then it is computationally infeasible for him to break
the structure of the scan-chain. As a rule of thumb in designing we insert m = ⌊n/2⌋ XOR
gates in a scan-chain with n FFs. We show in the following that even if the attacker has the
knowledge of the number of XOR gates (m) among the number of FFs, the probability of
successfully determining the structure is about 1/2n . The following theorem states the fact
that it is infeasible for the attacker to determine the scan structure.
Theorem 20 [261] For a XOR scan-chain with n FFs and m = ⌊n/2⌋ XOR gates. The
probability to guess the correct structure by an attacker with knowledge of n is nearly 1/2n .
Let the number of FFs be n and the number of XOR gates be m. Hence, the number of
structures possible is nCm , where nCm means the number the ways of choosing m unordered
elements from n elements. Lets us assume that attacker uses the knowledge that the number
of XOR gates is half the number of FFs, so that we have m = ⌊n/2⌋. Therefore, the
probability to guess the correct structure is now 1/nC⌊n/2⌋
In order to compute the bound, we use the fact the maximum value of nCr is when
r = ⌊n/2⌋. We combine this with the fact that the maximum binomial coefficient (given by
the above value) must be greater that the average of all binomial coefficients, i.e., when r
runs from 0 to n.
But, nC⌊n/2⌋ > 2n /(n + 1) ⇒ nC⌊n/2⌋ > 2n−log2 (n+1) . And for large values of n, n − log2 (n +
1) ≈ n. Hence, the probability of guessing the correct scan structure is nearly 1/2n .
The XOR-chain model requires insertion of XOR gates at random points of a scan
chain. In terms of gate count the XOR-chain is far more costly than its nearest counterpart
flipped-scan. Table 11.2 compares the hardware overhead of the proposed prevention scheme
with [357].
Subsequently, several developments and interesting solutions on scan chains for secured
testing have evolved [125, 124, 361, 362]. But in spite of the defences, this topic remains an
important field of activity with the emergence of new attacks [166].
11.5 Conclusions
The chapter described the potential security threats due to conventional scan chains
for providing DFT (Design for Testability) to cryptographic hardware. We presented a
case-study of an attack on a generic stream cipher hardware module, and then later show
an actual attack on a standard stream cipher called Trivium. Finally, we discussed some
potential countermeasures to allow testability, but reducing the attack opportunities to an
adversary.
347
349
Eric Allman
12.1 Introduction
Reuse-based System-on-Chip design using hardware intellectual property (IP) cores has
become a pervasive practice in the industry. These IP cores usually come in the following
three forms: synthesizable Register Transfer Level (RTL) descriptions in Hardware Descrip-
tion Languages (HDLs) (“Soft IP”); gate-level designs directly implementable in hardware
(“Firm IP”); and GDS-II design database (“Hard IP”). The approach of designing complex
systems by integrating tested, verified and reusable modules reduces the SoC design time
and cost dramatically [73].
Unfortunately, recent trends in IP-piracy and reverse-engineering efforts to produce
counterfeit ICs have raised serious concerns in the IC design community [73, 277, 172,
85, 215]. IP piracy can take diverse forms, as illustrated by the following scenarios:
• A chip design house buys the IP core from the IP vendor, and makes an illegal copy or
“clone” of the IP. The IC design house then uses the IP without paying the required
royalty, or sells it to another IC design house (after minor modifications) claiming the
IP to be its own design [172].
• An untrusted fabrication house makes an illegal copy of the GDS-II database supplied
by a chip design house, and then manufactures and sells counterfeit copies of the IC
under a different brand name [77].
• A company performs post-silicon reverse-engineering on an IC to manufacture its
illegal clone [336].
These scenarios demonstrate that all parties involved in the IC design flow are vulnerable
to different forms of IP infringement, which can result in loss of revenue and market share.
Obfuscation is a technique that transforms an application or a design into one that is
functionally equivalent to the original but is significantly more difficult to reverse engineer.
Software obfuscation to prevent reverse-engineering has been studied widely in recent years
[97, 430, 155, 153, 95, 96]; however, the techniques of software obfuscation are not directly
applicable to HDL because the obfuscated HDL can result in potentially unacceptable design
overhead when synthesized.
Although design modifications to prevent the illegal manufacturing of ICs by fabrication
houses have been proposed [336] before, such techniques are not useful in preventing the
theft of soft IPs. Furthermore, they do not provide protection against possible IP piracy from
the SoC design house. In this chapter, we present two low-overhead techniques each of which
can serve as the basis for a secure SoC design methodology through design obfuscation and
authentication performed on the RTL design description. We follow a key-based obfuscation
approach, where normal functionality is enabled only upon application of a specific input
initialization key sequence. The majority of commercial hardware IPs come in the RTL
(“soft”) format, which offers better portability by allowing design houses to map the circuit
to a preferred platform in a particular manufacturing process [231].
In this chapter, we have demonstrated design obfuscation for both gate-level [77] and
RTL IPs. The main idea behind the IP protection technique for gate-level designs are: (a)
finding the optimal modification locations in a gate-level netlist by structural analysis, and
(b) designing a gate-level modification circuit to optimally modify the netlist. While direct
structural analysis and the structural modification proposed in [77] is suitable for gate-level
IPs, the proposed obfuscation approach for gate-level IPs cannot be used for RTL IPs for
the following reasons: (a) RTL IPs do not directly provide high-level structural information;
and (b) any obfuscation performed on an RTL IP should maintain its portability. Since a
majority of the hardware IPs are actually delivered in RTL format, there is a need to develop
low-overhead protection measures for them against piracy. In this chapter, we propose a
anti-piracy approach for RTL IPs based on effective key-based hardware obfuscation. The
basic idea is to judiciously obfuscate the control and data flow of an RTL IP in a way that
prevents its normal mode operation without application of an enabling key at its primary
input.
We provide two RTL IP obfuscation solutions, which differ in level of protection and
computation complexity: 1) the first technique, referred to as the “STG modification ap-
proach”, converts a RTL description to gate level; obfuscates the gate level netlist; and then
de-compiles back to RTL; 2) the second technique, referred to as the “CDFG modifica-
tion approach”, avoids the forward compilation step and applies obfuscation in the register
transfer level by modifying its control and data flow constructs, which is facilitated through
generation of a CDFG of the RTL design. The first approach can provide higher level of
obfuscation but is computationally more expensive than the second one.
We derive appropriate metrics to quantify the obfuscation level for both the approaches.
We compare the two approaches both qualitatively and quantitatively. We show that an
important advantage of the proposed obfuscation approach is that the level of protection
can be improved with minimal additional hardware overhead by increasing the length of the
initialization key sequence. We show that this is unlike conventional encryption algorithms
e.g. the Advanced Encryption Standard (AES) whose hardware implementations usually
incur much larger overhead for increasing key length.
Finally, along with obfuscation, we show that the proposed approaches can be extended
to embed a hard-to-remove “digital watermark” in the IP that can help to authenticate the
IP in case it is illegally stolen. The authentication capability provides enhanced security
while incurring little additional hardware overhead.
The rest of the chapter is organized as follows. In Section 12.2, we present related work
and the motivation behind this work. In Section 12.3, we describe the functional obfuscation
technique for IP based on modifications of the State Transition Graph (STG) of the circuit,
for gate–level designs. In Section 12.4 we describe the extension of the idea to generate RTL
designs. In Section 12.5, we describe IP protection based on the Control and Data Flow
Graph (CDFG) extracted from the RTL. In Section 12.5.2, we compare the relatives merits
and demerits of the two proposed techniques. In Section 12.6, we present theoretical analysis
of the proposed obfuscation schemes to derive metrics for the level of obfuscation. In Section
12.7, we present automated design flows, and simulation results for several open-source IP
cores. In Section 12.8, we describe a technique to decrease the overhead by utilizing the
normally unused states of the circuit.
machine state after execution as the original. Other software obfuscation approaches include
self-modifying code [168] (code that generates other code at run-time), self-decryption of
partially encrypted code at run-time [352, 35], and code redundancy and voting to pro-
duce “tamper-tolerant software” (conceptually similar to hardware redundancy for fault
tolerance) [162]. A general shortcoming of these approaches is that they do not scale well
in terms of memory footprint and performance as the size of the program (or the part
of the program to be protected) increases [84]. Hence, RTL obfuscation approaches moti-
vated along similar lines are also likely to result in inefficient circuit implementations of
the obfuscated RTL with unacceptable design overhead. Also, the value of such techniques
is the matter of debate because it has been theoretically proven that software obfuscation
in terms of obfuscating the “black-box functionality” does not exist [40]. In contrast, we
modify both the structure and the functionality of the circuit description under question;
hence the above result of impossibility of obfuscation is not applicable in our case.
FIGURE 12.1: Schemes for Boolean function modification and modification cell.
en = 1, if g = 1 for a given set of primary inputs and state element output state, fmod = f
and the test pattern is a failing test pattern. To increase the amount of dissimilarity between
the original and modified designs, we should try to make g evaluate to logic-1 as often as
possible. At first glance, the trivial choice seems to be g = 1. However, in that case the
input logic cone is not expanded and thus the number of failing vectors reported by a
formal verification approach is limited. For any given set of inputs, this is achieved by a
logic function which is the logical-OR of the input variables.
For the other “all zero” input combination of P2 , f = 0. Let the number of possible cases
where f = 1 at g = 0 be Ng0 . Then, the total number of failing input patterns:
Note that M attains the maximum (ideal) value of 0.5 in this case when Ng0 = 1. The
theoretical maximum, however, is not a very desirable option because it keeps the input
space limited to 2p1 possible vectors. Again, if p = 0, i.e., g is generated by a completely
different set of primary inputs which were not included in f , then:
1 Ng0 2−p1 − 1
M= 1+ (12.6)
2 2p2
Larger values of Ng0 and smaller values of p2 for a given p1 help to increase M . Note that
unlike the first case, Ng0 is guaranteed to be non-zero. This property effectively increases
the value of M in case (b) than that in case (a) for most functions. However, in the second
case, M < 12 , i.e. M cannot attain the theoretical maximum value.
FIGURE 12.2: The proposed functional and structural obfuscation scheme by modification
of the state transition function and internal node structure.
Selection of the Modification Kernel Function (g): Although the above analysis
points to the selection of primary inputs or state element outputs to design the MKF (g)
satisfying the condition p = 0, in practice, this could incur a lot of hardware overhead to
generate the OR-functions corresponding to each modified node. An alternative approach
is to select an internal logic node of the netlist to provide the Boolean function g. It should
have the following characteristics:
1. The modifying node should have a very large fan-in cone, which in turn would sub-
stantially expand the logic cone of the modified node.
3. It should not have any node in its fan-in cone which is in the fan-out cone of the
modified node.
Conditions (2) and (3) are essential to prevent any combinational loop in the modified
netlist. Such a choice of g does not, however, guarantee it to be an ORÂŰ-function and is
thus sub-ÂŰoptimal.
reset to its initial state, forcing the circuit to be in the obfuscated mode. Depending on the
applied input sequence, the FSM then goes through a state transition sequence and only on
receiving N specific input patterns in sequence, goes to a state which lets the circuit operate
in its normal mode. The initial state and the states reached by the FSM before a successful
initialization constitute the “pre-initialization state space” of the FSM, while those reached
after the circuit has entered its normal mode of operation constitute the “post-initialization
state space”. Fig. 12.2 shows the state diagram of such a FSM, with P 0→P 1→P 2 being
the correct initialization sequence. The input sequence P 0 through P 2 is decided by the IP
designer.
The FSM controls the mode of circuit operation. It also modifies selected nodes in the
design using its outputs and the modification cell (e.g. M1 through M3 ). This scheme is
shown in Fig. 12.2 for a gate–level design that incorporates modifications of three nodes n1
through n3 . The MKF can either be a high fan-in internal node (avoiding combinational
loops) in the unmodified design, or the OR-function of several selected primary inputs. The
other input (corresponding to the en port of the modification cell) is a Boolean function of
the inserted FSM state bits with the constraint that it is at logic-0 in the normal mode. This
modification ensures that when the FSM output is at logic-0, the logic values at the modified
nodes are the same as the original ones. On the other hand, in the obfuscated mode, for any
FSM output that is at logic-1, the logic values at the modified nodes are inverted if g = 1
and logic-0 if g = 0. Provided the modified nodes are selected judiciously, modifications
at even a small number of nodes can greatly affect the behavior of the modified system.
This happens even if the en signal is not always at logic-0. In our implementation, we chose
to have the number of outputs of the inserted FSM as a user-specified parameter. These
outputs are generated as random Boolean functions of the state element bits at design time
with the added constraint that in the normal mode, they are at logic-0. The randomness of
the Boolean functions adds to the security of the scheme. Such a node modification scheme
can provide higher resistance to structural reverse-engineering efforts than the scheme in
[78].
flow with knowledge of initialization sequence. Corresponding to each state in the pre-
initialization state space, we arrange to have a particular pattern to appear at a sub-set
of the primary outputs when a pre-defined input sequence is applied. Even if a hacker
arranges to by-pass the initialization stage by structural modifications, the inserted FSM
can be controlled to have the desired bit-patterns corresponding to the states in the pre-
initialization state space, thus revealing the watermark. For post-silicon authentication, scan
flip-flops can be used to bring the design to the obfuscated mode. Because of the prevalent
widespread use of full-scan designs, the inserted FSM flip-flops can always be controlled
to have the desired bit-patterns corresponding to the states in the authentication FSM,
thus revealing the watermark. Fig. 12.3 illustrates the modification of the state transition
function for embedding authentication signature in the obfuscated mode of operation.
To mask or disable the embedded signature, a hacker needs to perform the following
steps, assuming a purely random approach:
1. Choose the correct inserted FSM state elements (np ) from all the total state elements
(nt ). This has nnpt possible choices.
2. Apply the correct input vector at the ni input ports where the vectors are to be
applied to get the signature at the selected no output ports. This is one out of 2ni
choices.
3. Choose the no primary outputs at which the signature appears from the total set of
primary outputs (npo ). This has nnpo
o
possibilities.
4. For each of these recognized no outputs, identify it to be one among the possible
(ni +np )
22 Boolean functions (in the obfuscated mode) of the ni primary inputs and np
state elements, and change it without changing the normal functionality of the IP.
Hence, in order to mask one signature, the attacker has to make exactly one correct
2(ni +np )
choice from among N = nnpt · 2ni · nnpo
o
·2 possible choices, resulting in a masking
∼
success probability of Pmasking = N . To appreciate the scale of the challenge, consider a
1
case with nt = 30, np = 3, ni = 4, no = 4 and npo = 16. Then, Pmasking ∼ 10−47 . In actual
IPs, the masking probability would be substantially lower because of higher values of np
and nt .
where F I and F O are the number of nodes in the fan-in and the fan-out cone of the node,
respectively. F Imax and F Omax are the maximum number of fan-in and fan-out nodes in
the circuit netlist and are used to normalize the metric. w1 and w2 are weights assigned to
the two factors, with 0≤w1 , w2 ≤1 and w1 + w2 = 1. We chose w1 = w2 = 0.5, which gives
the best results in terms of obfuscation, as shown in the next section. Note that 0<Mnode ≤1.
Because of the widely differing values of F Omax and F Imax in some circuits, it is important
to consider both the sum and the product terms involving F O FO
max
and F IFmax
I
. Considering
only the sum or the product term results in an inferior metric that fails to capture the
actual suitability of a node, as observed in our simulations.
FIGURE 12.4: Hardware obfuscation design flow along with steps of the iterative node
ranking algorithm.
However, additional overhead is incurred for storing the input sequences in an on-chip
ROM. To increase the security of the scheme, the chip designer can arrange an instance-
specific initialization sequence to be stored in an one-time programmable ROM. In that
case, following the approach in [30], we can have the activating patterns to be simple
logic function (e.g. and XOR) of the patterns read from the ROM and the output of a
Physically Unclonable Function (PUF) block. The patterns are written to the ROM post-
manufacturing after receiving instructions from the chip designer, as suggested in [30].
Because the output of a PUF circuit is not predictable before manufacturing, it is not
FIGURE 12.6: Challenges and benefits of the HARPOON design methodology at different
stages of a hardware IP life cycle.
possible to have the same bits written into the programmable ROMs for each IC instance.
Fig. 12.5 shows this scheme.
The manufacturing house manufactures the SoC from the design provided by the design
house and passes it on to the test facility. If a PUF block has been used in the IC, the test
engineer reports the output on the application of certain vectors back to the chip designer.
The chip designer then calculates the specific bits required to be written in the one-time
programmable ROM. The test engineer does so and blows off an one-time programmable
fuse, so that the output of the PUF block is no longer visible at the output. The test
engineer then performs post-manufacturing testing, using the set of test vectors provided by
the design house. Ideally, all communication between parties associated with the design flow
should be carried out in an encrypted form, using symmetric or asymmetric cryptographic
algorithms such as Diffie-Hellman [336]. The tested ICs are passed to the system designer
along with initialization sequence (again in an encrypted form) from the design house.
The system designer integrates the different ICs in the board-level design and arranges to
apply the initialization patterns during “booting” or similar other initialization phase. Thus,
the initialization patterns for the different SoCs need to be stored in Read Only Memory
(ROM). In most ASICs composed of multiple IPs, several initialization cycles are typically
needed at start-up to get into the “steady-stream” state, which requires accomplishing
certain tasks such as initialization of specific registers [254]. The system designer can easily
utilize this inherent latency to hide the additional cycles due to initialization sequences from
the end user.
Finally, this secure system is used in the product for which it is meant. It provides
the end-user with the assurance that the components have gone through a secure and
piracy-proof design flow. Fig. 12.6 shows the challenges and benefits of the design flow from
the perspectives of different parties associated with the flow. It is worth noting that the
proposed design methodology remains valid for a SoC design house that uses custom logic
blocks instead of reusable IPs. In this case, the designer can synthesize the constituent logic
blocks using the proposed obfuscation methodology for protecting the SoC.
FIGURE 12.7: Example of a Verilog RTL description and its obfuscated version [82]: a)
original RTL; b) technology independent, unoptimized gate-level netlist obtained through
RTL compilation; c) obfuscated gate-level netlist; d) decompiled obfuscated RTL.
The modified, re-synthesized (to allow better intermingling of the inserted circuit mod-
ification), gate-level design is then decompiled to regenerate the RTL of the code, without
FIGURE 12.8: Design transformation steps in course of the proposed RTL obfuscation
process.
maintaining high level HDL constructs. Instead, the modified netlist is traversed recursively
to reconstruct the Boolean equations for the primary output nodes and the state element
inputs, expressed in terms of the primary inputs, the state-element outputs and a few se-
lected high fanout internal nodes. The redundant internal nodes are then removed. This
“partial flattening” effect hides all information about the modifications performed in the
netlist. Optionally, the obfuscation tool maintains a list of expected instances of library
datapath elements, and whenever these are encountered in the netlist, their outputs are
related through proper RTL constructs to their inputs. This ensures regeneration of the
same datapath cells on resynthesis of the RTL.
As an example, consider the simple Verilog module “simple” which performs addition or
subtraction of two bits depending on the value of a free running 1-bit counter, as shown in
Fig. 12.7(a). Fig. 12.7(b)-(d) shows the transformation of the design through the proposed
obfuscation process. The decompiled RTL in Fig. 12.7(d) shows that the modification cell
and the extra state transition logic are effectively hidden and isolation of the correct ini-
tialization sequence can be difficult even for such a small design. Major semantic effect of
obfuscation is the change and replacement of high level RTL constructs (such as if...else,
for, while, case, assign etc.) in the original RTL, and replacement of internal nodes and
registers. Furthermore, internal register, net and instance names are changed to arbitrary
identifiers to make the code less comprehensible.
After the gate-level modification, the modified netlist is decompiled to produce a de-
scription of the circuit, which although being technically a RTL and functionally equivalent
to the modified gate-level netlist, is extremely difficult to comprehend to a human reader. In
addition, the modifications made to the original circuit remain well-hidden. A forward anno-
tation file indicates relevant high-level HDL constructs and macros to be preserved through
this transformation. These are maintained during the RTL compilation and decompilation
steps. From the unmapped gate-level netlist, we look for specific generic gates, that can be
decompiled to an equivalent RTL construct, e.g. multiplexor can be mapped to an equiva-
lent if...then...else construct or a case construct. The datapath modules or macros are
transformed into appropriate operands. For example, an equation n1 = s1·d1+s2·d2+s3·d3
can be mapped to a case construct. Fig. 12.8 shows the design transformation steps during
the obfuscation process. We present an analysis of the security of this scheme in Section
12.6.
FIGURE 12.10: Example of hosting the registers of the mode-control FSM [83].
FIGURE 12.11: Examples of control-flow obfuscation: (a) original RTL, CDFG; (b) ob-
fuscated RTL, CDFG [83].
element. An example is shown in Fig. 12.10, where the 8-bit register reg1, referred as the
“host register”, has been expanded to 12-bits to host the mode-control FSM in its left 4-bits.
When these 4-bits are set at values 4′ h1 or 4′ h2, the circuit is in its normal mode, while
the circuit is in its obfuscated mode when they are at 4′ ha or 4′ hb. Note that extra RTL
statements have been added to make the circuit functionally equivalent in the normal mode.
The obfuscation level is improved by distributing the mode-control FSM state elements in
a non-contiguous manner inside one or more registers, if possible.
Modifying CDFG Branches: After the FSM has been hosted in a set of selected host
registers, several CDFG nodes are modified using the control signals generated from this
FIGURE 12.13: Example of RTL obfuscation by CDFG modification: (a) original RTL;
(b) obfuscated RTL [83].
FSM. The nodes with large fanout cones are preferentially selected for modification, since
this ensures maximum change in functional behavior at minimal design overhead. Three
example modifications of the CDFGs and the corresponding RTL statements are shown in
Fig. 12.11. The registers reg1, reg2 and reg3 are the host registers. Three “case()”, “if()” and
“assign” statements in Fig. 12.11(a) are modified by the mode-control signals cond1, cond2
and cond3, respectively. These signals evaluate to logic-1 only in the obfuscation mode
because the conditions reg1 =20′ habcde, reg2 =12′ haaa and reg3 =16′ hb1ac correspond to
states which are only reachable in the obfuscation mode. Fig. 12.11(b) shows the modified
CDFGs and the corresponding CDFG statements.
Besides changing the control-flow of the circuit, functionality is also modified by in-
troducing additional datapath components. However, such changes are done in a manner
that ensures sharing of the additional resources during synthesis. This is important since
datapath components usually incur large hardware overhead. An example is shown in Fig.
12.12, where the signal out originally computes (a + b) × (a − b). However, after modification
of the RTL, it computes (a + b) in the obfuscated mode, allowing the adder to be shared in
the two modes and the outputs of the multiplier and the adder to be multiplexed.
Generating Obfuscated RTL: After the modifications have been preformed on the
CDFG, the obfuscated RTL is generated from the modified CDFGs, by traversing each
of them in a depth-first manner. Fig. 12.13(a) shows an example RTL code and Fig.
12.13(b) shows its corresponding obfuscated versions. A 4-bit FSM has been hosted in regis-
ters int_reg1 and int_reg2. The conditions int_reg1 [13:12]=2′ b00, int_reg1 [13:12]=2′ b01,
int_reg2 [13:12]=2′ b00 and int_reg1 [13:12]=2′ b10 occur only in the obfuscated mode. The
initialization sequence is in1 =12′ h654 → in2 =12′ h222 → in1 =12′ h333 → in2 =12′ hacc →
in1 =12′ h9ab. Note the presence of dummy state transitions and out-of-order state tran-
sition RTL statements. The outputs res1 and res2 have been modified by two different
modification signals. Instead of allowing the inputs to appear directly in the sensitivity list
of the “if()” statements, it is possible to derive internal signals (similar to the ones shown
in Fig. 12.11(b)) with complex Boolean expressions which are used to perform the modifi-
cations. The output res1 has been modified following the datapath modification approach
using resource sharing.
Authentication features might be embedded in the RTL by the same principle as de-
scribed in Section 12.3.5. RTL statements describing the state transitions of the authen-
tication FSM can be integrated with the existing RTL in the same way the statements
corresponding to the obfuscation FSM is hidden. In case the unused states are difficult to
derive from the original RTL, it can be synthesized to a gate-level netlist and the same
technique based on sequential justification as described in Section 12.3.5 might be applied.
Table 12.1 compares the relative advantages and disadvantages of the two proposed
techniques. Although the decompilation based approach potentially can hide the modifica-
tions better than the direct RTL modification based approach (as shown by our theoretical
analysis of their obfuscation levels in Section 12.6 and by our simulation results), it also
loses major RTL constructs and creates a description of the circuit which might result in
an unoptimized implementation on re-synthesis. Hence, we provide the IP designer with a
choice where either of the techniques might be chosen based on the designer’s priority. For
example, if the IP is going to released to a untrustworthy SoC design houses with a prior
record of practising IP piracy, the STG modification system might be used. On the other
hand if the IP is to released to a comparatively more trustable SoC design house where the
design specifications are very aggressive, the CDFG modification based approach might be
used.
where Nc,orig is the total number of high-level RTL constructs in the original RTL; Ne,obf us
is the number of extra state elements included in the obfuscated design; Nw,orig is the total
number of internal wire declarations in the original RTL and Nraw,obf us is the number of
reg, assign and wire declarations in the obfuscated RTL. Note that 0≤Msem ≤1, with a
higher value implying better obfuscation. Msem represents a measure of semantic difference
between the obfuscated and the unobfuscated versions of the RTL, by taking into consid-
eration the constructs introduced in the obfuscated code and the constructs removed from
the original code. This is the weakest attack, with the adversary having very little chance
of figuring out the obfuscation scheme for large RTLs which have undergone a complete
change of the “look-and-feel”.
the modification scheme for a modified node with fanin cone size fi has a computational
complexity O(2fi ).
Graph Isomorphism Comparison: After the ROBDD of the modified node has been
expressed in the form shown in Fig. 12.14, each sub-graph below the node en should be
compared with the ROBDD graph for f for isomorphism. Proving graph isomorphism is a
problem with computational
√ complexity between P and N P , with the best-known heuristic
having complexity 2O( n log n) for a graph with n vertices [169]. Hence, establishing
√ the
equivalence for f through graph isomorphism has a computational complexity 2 O( fi log fi )
for a node with fanin cone size fi . Let fi be the average fanin cone size of the failing
verification
√ nodes. Hence, overall this sub-problem has a computational complexity O(2fi ·
2O( ), which must be solved for each of the F dissimilar nodes.
fi log fi )
Compare Point Matching: So far we have assumed that the adversary would be able
to associate the dissimilar nodes in the obfuscated design with the corresponding nodes in
the original design and would then try to decipher the obfuscation scheme. This is expected
to be relatively easy for the primary output ports of the IP because the IP vendor must
maintain a standard interface even for the obfuscated version of the IP, and hence the
adversary can take advantage of this name matching. However, the names of the state
elements can be changed arbitrarily by the IP vendor and hence, finding the corresponding
state elements to compare is a functional compare point matching problem [32], which is
extremely computationally challenging because of the requirement to search through (SN )!
combinations, where SN is the number of dissimilar state elements. Hence, we propose the
following metric to quantify the level of protection of the proposed STG modification based
obfuscation scheme in providing protection against structural analysis:
√
Mstr = F · 2fi · 2 fi log fi + (SN )! (12.9)
Observations from the Metric: From the above metric, the following can be observed
which act as a guide to the IP designer to design a well-obfuscated hardware IP following
the STG modification-based scheme:
1. Those nodes which have larger fanin cones should be preferably modified because this
would increase fi in eqn.(12.9), thus increasing Mstr .
2. An inserted FSM with larger number of flip-flops increases its obfuscation level be-
cause SN increases. Also, as shown previously in this section, there is an exponential
dependence of the probability of breaking the scheme by simulation based reverse-
engineering on the length of the initialization key sequence. Hence, it is evident that
FSM design and insertion to attain high levels of obfuscation incur greater design
overhead. Thus the IP designer must trade-off between design overhead and the level
of security achievable through obfuscation.
3. Modification of a larger number of nodes increases F , which in turn increases the level
of obfuscation.
The structural analysis of the CDFG modification based obfuscation scheme is estimated
by the degree of difficulty faced by an adversary in discovering the hosted mode-control FSM
and the modification signals. Consider a case where n mode-control FSM state-transition
statements have been hosted in a RTL with N blocking/non-blocking assignment state-
ments. However, the adversary does not know a-priori how many registers host the mode-
control FSM. Then, the adversary must correctly figure out the hosted FSM state transition
n
X N
statements from one out of possibilities. Again, each of these choices for a given
k
k=1
value of k has k! - associated ways to arrange the state transitions (so that the initialization
key sequence is applied in the correct order). Hence, the adversary must correctly identify
n
X N
one out of · k! possibilities. The other feature that needs to be deciphered to
k
k=1
break the scheme are the mode control signals. Let M be the total number of blocking, non-
blocking and dataflow assignments in the RTL, and let m be the size of the modification
signal
pool.
Then, the adversary must correctly choose m signals out of M , which is one out
M
of choices. Combining these two security features, we propose the following metric
m
to estimate the complexity of the structural analysis problem for the CDFG modification
based design:
n
X N M
Mstr = · k! · (12.10)
k m
k=1
FIGURE 12.15: Flow diagram for the proposed STG modification-based RTL obfuscation
methodology [82].
12.7 Results
In this section we first describe the automated design flows for the two proposed ob-
fuscation techniques, followed by the simulation results of application of the techniques to
open-source IPs and benchmark circuits.
FIGURE 12.16: Flow diagram for the proposed CDFG modification-based RTL obfusca-
tion methodology [83].
represented by the obfuscation metrics (Msem and Mstr ), and the maximum allowable area
overhead. It starts with the design of the mode-control FSM based on the target Msem and
Mstr . The outputs of this step are the specifications of the FSM which include its state
transition graph, the state encoding, the pool of modification signals, and the initialization
key sequence. Random state encoding and a random initialization key sequence are gener-
ated to increase the security. Note that in the STG modification-based approach, we do
not start with explicit target values of the metrics, because these two parameters cannot
be predicted a-priori in this technique. However, the target area overhead is an indirect
estimate of these parameters, and the decompilation process automatically ensures a high
value of Msem , while an optimal node modification algorithm ensures a high value of Mstr .
TABLE 12.3: Overall Design Overheads for Obfuscated IP Cores (STG Modification based
Results at Iso-delay)
STG Modification based Approach
IP Core Area (%) Delay (%) Power (%)
DES 6.09 1.10 5.05
AES 5.25 2.00 5.45
FDCT 5.22 0.95 5.55
Average 5.52 1.35 5.35
approach, as predicted in Section 12.6. The value of Msem was very close to the ideal value
of 1.0 for the STG modification based approach; however, it is closer to 0.75 on average for
the CDFG modification based approach. Again, this is an expected trend as predicted in
Section 12.6.
For all the individual circuit modules, the observed area overhead was less than 10%, the
power and delay overheads were within acceptable limits (target delay overhead was set at
0% for the CDFG modification based scheme). The maximum run-times of the obfuscation
programs for the individual modules was 29 seconds for the CDFG modification based
approach, and 37 seconds for the STG modification based approach. Table 12.3 shows the
overall design overheads after re-synthesis of the multi-module IP cores from the obfuscated
RTL, which are again all within acceptable limits.
FIGURE 12.17: Scheme with initialization key sequences of varying length (3, 4 or 5).
this flexibility comes at a price – the multi-key IP core versions usually have greater area
than the baseline designs supporting only a single key length [19].
Table 12.4 shows the area overhead effect of supporting multiple keys lengths on the
proposed schemes for the IP modules presented in Table 12.3, compared to two versions of
a commercially available AES core [19]. The key lengths for our proposed schemes were 4
(baseline), 6 (1.5X) and 8 (2X), while those for the AES implementations were 128 (base-
line), 192 (1.5X) and 256 (2X). From this table, it is clearly evident that the proposed
approaches are more scalable than the AES hardware implementations with respect to the
increase in key length.
TABLE 12.6: Area and Power Overhead for ISCAS-89 Benchmarks Utilizing Unused
States (at Iso-delay)
Circuit # of Gates % Ar. Ov.1 % Ar. Ov.2 %Pow. Ov.1 %Pow. Ov.2
with respect to the length of the initialization key sequence. We investigated the scalability
of the two proposed techniques with respect to the increase in the length of the initial-
ization key sequence for the proposed approaches vis-a-vis that for commercially available
hardware implementations of AES with respect to the key length. Again, for the proposed
obfuscation schemes we considered a key length of 4 to the baseline case, while a 128-bit
key for AES was considered baseline. Table 12.5 shows the decrease in throughput with the
increase of key length. Once again, the simulation results showed that the proposed schemes
had superior scalability of throughput than the hardware implementations of AES when the
key length is increased.
12.8 Discussions
In this section, we describe a technique to decrease the hardware overhead by while
utilizing the normally “unused states” (states which never arise during normal operations).
TABLE 12.7: Area and Power Overhead for IP Cores Utilizing Unused States (at Iso-delay)
IP Module % Ar. Ov.1 % Ar. Ov.2 %Pow. Ov.1 %Pow. Ov.2
sbox 5.95 7.46 15.19 17.67
AES inv_sbox 6.99 8.25 8.98 15.27
key_expand 3.93 5.47 13.80 15.93
Overall 4.57 6.07 13.03 16.14
key_sel 4.81 8.91 2.66 5.35
DES crp 4.75 6.77 1.75 4.79
Overall 4.77 7.69 2.03 4.97
used to encode the states in the obfuscated mode. The unused states were found using se-
quential justification by Synopsys Tetramax. The results were taken without integrating any
mode-control FSM, considering 5–6 randomly chosen state-elements in the original circuit,
having an initialization state space with 4 states, and (for result set “1”) an authentica-
tion state space consisting of 4 states. State encoding with 6 state elements were required
in cases where sufficient unused states are not available from 5 state elements. Table 12.7
shows the corresponding figures for two open-source IP cores.
12.9 Conclusions
The design and manufacturing of complex modern SoCs is made feasible by the existence
of pre-verified, high-performance hardware intellectual property cores. However, this makes
the hardware IP cores susceptible to piracy. We have presented a paradigm of hardware IP
protection based on Design Obfuscation, whereby the design is made less understandable
and more difficult to reverse-engineer. The proposed obfuscation approaches provide active
defense against IP infringement at different stages of SoC design and fabrication flow, thus
protecting the interests of multiple associated parties. The obfuscation steps can be easily
automated and integrated in the IP design flow and it does not affect the test/verification of
a SoC design for legal users. We have shown that they incur low design and computational
overhead and cause minimal impact on end-user experience. The proposed approaches are
easily scalable to large IPs (e.g. processor) and in terms of level of security. Further im-
provement in hardware overhead can be obtained by utilizing a normally unused states of
the circuit.
In the next chapter, we introduce another major threat in the domain of hardware
security, which is drawing a lot of research attention globally: hardware Trojans, The issue
of Hardware Trojans also arises from the horizontal model of semiconductor design and
manufacturing. We give examples of hardware Trojans, their modes of operation, the threats
posed by them, and then move to design and testing techniques to mitigate the threats.
Hardware Trojans
379
381
• “The cyber threat is serious, with potential consequences similar in some ways to
the nuclear threat of the Cold War. The cyber threat is also insidious, enabling
adversaries to access vast new channels of intelligence about critical U.S. enablers
(operational and technical; military and industrial) that can threaten our national
and economic security.”
• “Current DoD actions, though numerous, are fragmented. Thus, DoD is not pre-
pared to defend against this threat. DoD red teams, using cyber attack tools which
can be downloaded from the Internet, are very successful at defeating our systems.”
13.1 Introduction
The horizontal business model becoming prominent in recent years in semiconductor
design and manufacturing, has relinquished the control that IC design houses had over the
design and manufacture of ICs, thus making them vulnerable to different security attacks.
Fig. 13.1 illustrates the level of trust at different steps of a typical IC life-cycle [107]. Each
party associated with the design and manufacture of an IC can be a potential adversary
who inserts malicious modifications, referred as hardware Trojans [17]. Concern about this
vulnerability of ICs and the resultant compromise of security has been expressed globally
[109, 36], especially since several unexplained military mishaps are attributed to the presence
of malicious hardware Trojans [17, 199].
Ideally, any undesired modification made to an IC should be detectable by pre-silicon
verification/simulation and post-silicon testing. However, pre-silicon verification or simula-
tion requires a golden model of the entire IC. This might not be always available, especially
for IP based designs where IPs can come from third-party vendors. Besides, a large multi-
module design is usually not amenable to exhaustive verification. Post-silicon, the design can
be verified either through destructive de-packaging and reverse-engineering of the IC [107],
or by comparing its functionality or circuit characteristics with a golden version of the.
However, existing state-of-the-art approaches do not allow destructive verification of ICs
to be scalable. Moreover, as pointed out in [107], it is possible for the adversary to insert
Trojans in only some ICs on a wafer, not the entire population, which limits the usefulness
of a destructive approach.
Traditional post-manufacturing logic testing is not suitable for detecting hardware Tro-
jans. This is due to the stealthy nature of hardware Trojans and inordinately vast spectrum
of possible Trojan instances an adversary can employ. Typically, the adversary would design
a Trojan that triggers a malfunction only under rare circuit conditions in order to evade
detection. Due to the finite size of the testset, the rare condition for activation of the Trojan
might not be realized during the testing period, especially if the Trojan acts as a sequential
state machine or “time-bomb” [408]. Special algorithms must be developed to increase the
accuracy of logic testing techniques, and would be decsribed in the next chapter. On the
other hand, the techniques for detecting Trojans by comparison of the “side-channel pa-
rameters” such as power trace [23] or delay [167], are limited by the large process-variation
effect in nanoscale IC technologies, reduced detection sensitivity for ultra-small Trojans and
measurement noise [23].
In the next section, we give classification and examples of different hardware Trojans.
(e) Analog Trojan triggered based on (f) Analog Trojan triggered based on circuit
logic value activity [116]
The trigger mechanisms can be of two types: digital and analog. Digitally triggered Trojans
can again be classified into combinational and sequential types.
Fig. 13.3(a) shows an example of a combinationally triggered Trojan where the occur-
rence of the condition A = 0, B = 0 at the trigger nodes A and B causes a payload node
C to have an incorrect value at Cmodif ied . Typically, an adversary would choose an ex-
tremely rare activation condition so that it is very unlikely for the Trojan to trigger during
conventional manufacturing test.
Sequentially triggered Trojans (the so-called ÂŞtime bombsÂŤ), on the other hand, are
activated by the occurrence of a sequence, or a period of continuous operation. The simplest
sequential Trojans are synchronous stand-alone counters, which trigger a malfunction on
reaching a particular count. Fig. 13.3(b) shows a synchronous k-bit counter which activates
when the count reaches 2k − 1, by modifying the node ER to an incorrect value at node
ER⋆ . An asynchronous version is shown in Fig. 13.3(c), where the count is increased not
by the clock, but by a rising transition at the output of an AND gate with inputs p and
q. The trigger mechanism can also be hybrid, where the counts of both a synchronous and
an asynchronous counter simultaneously determine the Trojan trigger condition, as shown
in Fig. 13.3(d). Note that more complex state machines of different types and sizes can
(a) (b)
be used to generate the trigger condition based on a sequence of rare events. In general,
it is more challenging to detect sequential Trojans using conventional test generation and
application, because it requires satisfying a sequence of rare conditions at internal circuit
nodes to activate them. The number of such sequential trigger conditions for arbitrary
Trojan instances can be unmanageably large for a deterministic logic testing approach.
The trigger-mechanism can also be analog in nature, where on-chip sensors are used
to trigger a malfunction. Fig. 13.3(e) shows an example of an analog trigger mechanism
where the inserted capacitance is charged through the resistor if the condition q1 = 1, q2 = 1
is satisfied, and discharged otherwise, causing the logic threshold to be crossed after a
large number of cycles. A different analog Trojan trigger mechanism (see Fig. 13.3(f)) was
proposed in [116], where higher circuit activity and the resultant rise of temperature was
used to trigger the malfunction, through a pair of ring oscillators and a counter.
Trojans can also be classified based on their payload mechanisms into two main classes
– digital and analog. Digital Trojans can either affect the logic values at chosen internal
payload nodes, or can modify the contents of memory locations. Analog payload Trojans,
on the other hand, affect circuit parameters such as performance, power and noise margin.
Fig. 13.4(a) shows an example where a bridging fault is introduced using an inserted resistor,
while Fig. 13.4(b) shows an example where the delay of the path is affected by increasing
the capacitive load.
Apart from triggering logic errors in the IC, the Trojan can also be designed to assist
in software-based attacks like privilege escalation, login backdoor and password theft [199].
A type of hardware trojan attracting considerable research attention in recent years is the
“information leakage” Trojan, which leaks secret information via an interface “backdoor”.
It could also involve side-channel attack where the information is leaked through the power
trace [222, 221]. We would investigate a Trojan that helps the leakage of secret information,
later in this chapter. Another type of Trojan payload proposed is that implementing a
“Denial of Service” (DoS) attack, which causes a system functionality to be unavailable.
In the rest of the section, we describe two novel Trojan designs, capable of causing potent
“information leakage attacks”.
(a) (b)
(c) (d)
FIGURE 13.5: Examples of (a) combinational and (b) sequential hardware Trojans that
cause malfunction conditionally; Examples of Trojans leaking information (c) through logic
values and (d) through side-channel parameter [221].
FIGURE 13.6: Example of multi-level attacks for both ASIC and FPGA realizations of
cryptographic hardware.
applying all the three patterns in correct sequence and triggering the Trojan is given by:
3
1 1
Ptrigger = = 90 ≈ 10−27 (13.1)
230 2
which is minuscule. Hence, we can safely assume that only the deployer can trigger the
Trojan by a “chosen plain-text” attack, by controlling the plain-text input to the encrypter.
The Trojan activates after detecting this sequence of patterns through a sequence de-
tector. The Trojan also includes a delay-based glitch generator circuitry which generates
glitches (F AU LT _GLIT CH) by XOR-ing the system clock with a delayed version of
itself, as shown in Fig. 13.7(a). On activating, the Trojan waits for seven clock cycles
before enabling a multiplexor through the CLK_SEL signal that lets a narrow glitch
(F AU LT _GLIT CH) being applied at the clock input instead of the system-clock. This
causes a setup time violation in the input flip-flops when the seventh round cipher-text is
fed-back to the input of the encryption hardware to start the eighth round encryption [355].
Thus, the inserted Trojan hardware injects a fault in the AES circuit at the beginning of
the eighth round during encryption. It has been shown [262] that by analyzing two faulty
cipher-texts corresponding to two known plain-texts, the 128-bit AES key can be deduced
exactly without any brute-force search. If only one plain-text cipher-text pair is known, the
key can be deduced exactly by a brute-force search of the order of 232 .
The adversaries want that only they and no other party would be able to deduce the
secret key from the faulty cipher-text. To achieve this, she can mask the cipher-text by
adding a Linear Feedback Shift Register (LFSR) to the design. The LFSR remains active
for the time taken to encrypt a single plain-text by the AES hardware. The state transitions
of the LFSR is controlled by the EN _LF SR signal which is synchronized with the event
of multiple successful pattern matches. This concept takes its motivation from the type of
information leakage Trojan described in [221]. The infeasibility of recovering the key with
a modified faulty cipher-text is shown next in Section 13.3.1.
2δ = S −1 (x1 ⊕ kp ) ⊕ S −1 (x′1 ⊕ kp )
δ = S −1 (x2 ⊕ kq ) ⊕ S −1 (x′2 ⊕ kq )
δ = S −1 (x3 ⊕ kr ) ⊕ S −1 (x′3 ⊕ kr )
3δ = S −1 (x4 ⊕ ks ) ⊕ S −1 (x′4 ⊕ ks )
where x1 , x2 , x3 , x4 are the actual cipher-text, and x′1 , x′2 , x′3 , x′4 are the corresponding faulty
cipher-text values. Here δ ∈ {0, . . . , 255} and S −1 represents the InverseSubByte operation
of AES. We now prove by contradiction that the masked output cipher-text will not reveal
the secret key.
Let us assume that the masked cipher reveals the actual key and x1 in above equation
represents ci and the corresponding faulty cipher byte and the masked faulty cipher bytes
are x′1 and x′1 ⊕ α, where α is a non-zero masked value generated by the LFSR. Therefore,
the above equation should give same quartet of key bytes {kp , kq , kr , ks } with the masked
value. In that case only the first equation changes and the rest of the equations remain
unchanged. Hence, from the first equation we can write,
S −1 (x1 ⊕ kp ) ⊕ S −1 (x′1 ⊕ α ⊕ kp )
= S −1 (x1 ⊕ kp ) ⊕ S −1 (x′1 ⊕ kp )
which implies
S −1 (x′1 ⊕ α ⊕ kp ) = S −1 (x′1 ⊕ kp )
which implies x′1 = x′1 ⊕ α and α = 0, since the S −1 mapping is bijective. This conclusion
contradicts our assumption.
(a) Power trace of circuit without Tro- (b) Power trace of circuit with Trojan.
jan
then using it to retrieve the encryption key, for somebody who is not part of the nexus. To
discover the scheme, a third-party (who is not part of the nexus) must perform two tasks
successfully:
• Activate the inserted Trojan by applying the three correct patterns (P 1, P 2 and P 3),
and,
• Identify the bit positions of the output cipher-text whose values have been inverted
by the Trojan LFSR.
In general, if each of the Trojan activation sequence vectors is M -bit long and the
length of the initialization sequence is N , the complexity of activating the Trojan by a
brute-force method is O(2M ·N ). To perform the second task successfully by brute force
(since a third-party has no way of knowing this information), P bit positions out of 128
bits AES cipher-text must be chosen, and corresponding to each of the assumed choices,
an average of 232 operations must be performed
to calculate the key. Only one of these
128 32
operations of overall complexity O ·2 will yield the correct key. Hence, for a
P
third-party to actually launch a successful
fault-attack
on the above hardware will require
128
brute-force operations of complexity O 2M ·N · 232 . For example, with M = 10,
P
N = 3, P = 15 (i.e. 15 bits of the output cipher-text were flipped), the above complexity
is ≈ 2126 , which is comparable to the complexity of finding an 128-bit AES encryption key
by a brute-force search.
transient power trace of the circuit. Fig. 13.8 shows the simulated power traces of the circuit
with and without Trojan. Table 13.1 shows the percentage increase in the average power
consumption of the infected design as compared to the golden design, and Table 13.2 shows
the hardware overhead. As is evident from these two tables, the Trojan is small relative to
the original circuit and has negligible effect on the average power consumption, and is thus
extremely difficult to detect using side-channel techniques which are commonly affected by
experimental noise and process variation effects.
To show the effectiveness of the above multi-level attack scenario, a fault attack was
launched using a glitch as shown in Fig. 13.9. When the effect of the masking by the output
of the LFSR was not considered, the fault analysis technique described in [262] yielded an
incorrect key, which was different in all the sixteen bytes compared to the original key.
When the effect of the masking was taken into consideration, the correct key was recovered
through the fault analysis attack, as expected.
FIGURE 13.10: The c17 ISCAS–85 circuit and three different inserted combinational
Trojans. Trojan–1 and Trojan–3 are triggered and cause malfunction for 2 out of 32 possible
input vectors. Trojan–2 is triggered relatively more often and cause malfunction for 4 out
of the 32 possible input vectors.
can evade the detection techniques using simulation, verification or post-silicon testing,
or (b) the malicious parties would neutralize such steps. Consider the scenario where every
personnel associated with the design and manufacturing flow is potentially untrusted. Then,
one of the possible prevention techniques is to verify the design against every preceding
level of design abstraction, by first reverse-engineering the design back to higher levels of
design abstraction. For example, the GDS-II of the design should be independently reverse-
engineered and compared with the design at gate-level, RTL and behavioral level. Such a
technique can however, be extremely expensive, and may not guarantee security against all
forms and complexities of multi-level attacks.
ure. Our definitions and notation would follow those given in [207], and would use relatively
simple analytical models of circuit failure.
Suppose a circuit starts operating at time t = 0 and remains operational until it is hit
by a failure for the first time at time t = T . Then T is the lifetime of the component, and let
T be a random variable distributed according to the probability density function (pdf) f (t).
If F (t) is the cumulative distribution function (cdf) for T , then f (t) and F (t) are related
by: Z t
d F (t)
f (t) = , F (t) = Prob.{T ≤ t} = f (τ ) dτ (13.2)
dt 0
Being a density function, f (t) satisfies:
Z ∞
f (t) ≥ 0 for t ≥ 0 and f (t) dt = 1 (13.3)
0
The cdf, F (t), of a given circuit is the probability that the circuit will fail at or before time
t. Its counterpart, reliability (R(t)) is the probability that the circuit will survive at least
until time t, and is given by:
Perhaps the most important quantitative metric used to estimate reliability is the failure
rate or hazard rate, denoted by λ(t), and defined as:
f (t) f (t)
λ(t) = = (13.5)
1 − F (t) R(t)
λ(t) denotes the conditional instantaneous probability of a circuit failing, given it is yet to
fail at time t. Since d R(t)
dt
1 d R(t)
= −f (t), we have λ(t) = − R(t) d t . For circuits which have a
failure rate that is constant over time (which is the most commonly considered model for
ICs), i.e. λ(t) = λ, we have:
d R(t)
= −λR(t) (13.6)
dt
Solving eqn. (13.6) with the initial condition R(t) = 0 leads to:
Besides λ, another important parameter that characterizes circuit reliability is the mean
time to failure (MTTF), which is the expected lifetime of the circuit, and is given (assuming
a constant failure rate) by:
Z ∞ Z ∞
1
MTTF = E[T ] = tf (t) dt = λte−λt dt = (13.9)
0 0 λ
A major part of our analysis for different types of hardware Trojans would be devoted
to modelling how the presence of a undetected hardware Trojan modifies the values of
the failure rate and MTTF of a circuit. Depending on the nature of the inserted hardware
Trojan, the circuit may or may not recover its original functionality. However, in the analyses
that follow, we would assume the interval between the start of functioning of a circuit and
its first functional failure due to the activation of a hardware Trojan as defining the lifetime
of the circuit.
TABLE 13.3: Combinational Trojans: Trigger Condition and Trojan Activation Probabil-
ity
Trojan Trigger Condition Troj. Activation Probability
Trojan–1 N 10 = 0, N 19 = 0 2/32
Trojan–2 N 10 = 0, N 11 = 0 4/32
Trojan–3 N 10 = 0, N 16 = 0 2/32
FIGURE 13.11: Failure trends of the c17 circuit with three types of inserted combinational
Trojans (shown in Figs. 13.10(b)–(d)).
where the effective failure rate λef f is the sum of the failure rate of the original (Trojan-free)
circuit (λorg ) and λtro is the contribution of the included Trojan in increasing the failure
rate. Thus,
λef f = λorg + λtro (13.11)
M T T Forg
∆M T T F = M T T Fef f − M T T Forg = λorg
(13.12)
1+ λtro
To observe the actual failure trend of the c17 circuit, we simulated the three circuits
shown in Figs. 13.10(b)-(d) with 100 sets of random vectors, each set consisting of 32 vectors.
However, each set of 32 vectors did not consist of all possible vectors 00000 to 11111, and
hence, the set of test vectors were truly random. Figs. 13.11(a)-(c) show frequency histogram
of the number of failures (n) that occur in the intervals 0-8, 9-16, etc., in a set of 32 vectors,
considering the average over 100 vector sets. The upper range of the intervals are denoted
by r. Thus, the time is discretized in terms of the number of vectors applied. This curve is
representative of the failure probability density function of the Trojan-infected circuit. The
best–fitted curves of the form n = ae−br were obtained using MATLAB.
The parameter b, which is interpreted as representative of the M T T F of the circuit, can
also be estimated by averaging the time to failure for the different vector sets, following the
Method of Moments [207]. Table 13.4 shows the estimated and average value (as obtained
by averaging the first instance of failure of the vector sets) of the M T T F (considering only
the presence of Trojan), along with the 95% confidence interval for the estimated value.
Note that the MTTF is in terms of the number of input vectors applied before the first
circuit failure. From the plots and the table, it is evident that the data and the data trends
fitted very well to the theoretically predicted trend.
FIGURE 13.12: c17 with an asynchronous binary counter-type Trojan. The reset signal
for the Trojan counter has not been shown for simplicity.
Clearly, f (t) ≥ 0 for non-negative value ofR t as defined above, and it can be easily verified
∞
that the above definition of f (t) satisfies 0 f (t) dt = 1 as follows:
Z ∞ Z Ttro Z ∞
f (t) dt = λe−λt dt + e−λTtro δ(t − Ttro ) dt = 1 (13.16)
0 0 0
R∞
using the sampling property of δ(t), i.e. −∞
f (t)δ(t − t0 ) dt = f (t0 ). The failure probability
F (t) is the given by:
Z t n1−e−λt for t<Ttro
F (t) = f (τ ) dτ = (13.17)
0 1 for t≥Ttro
To determine the M T T F , we consider two different cases: (a) λ1 ≥ Ttro and (b) λ1 < Ttro .
When λ1 ≥ Ttro , the circuit is expected to fail due the activation of the Trojan before it fails
due to other reasons. However, there is still a finite probability of the circuit failing in the
time interval 0 ≤ t < Ttro . Hence, if λ1 ≥ Ttro ,
Z Ttro
M T T Fef f = Ttro − Ttro f (t) dt = Ttro 1 − (1 − e−λTtro ) = e−λTtro Ttro (13.18)
0
1
M T T Fef f = M T T Forg = (13.20)
λ
and the corresponding change in M T T F is:
∆M T T F = 0 (13.21)
Fig. 13.12 shows the c17 circuit with an inserted sequential Trojan. The Trojan is an
asynchronous binary counter which increases its count whenever a positive edge occurs on
the net Ttrojan , which in turn is derived by AN D–ing nodes N 11, N 16 and N 19. The Trojan
flips the logic–value at the primary output node N 23 whenever the count reaches (2n − 1).
Two different such Trojans were considered: Table 13.5 shows the Trojan clocking and acti-
vation conditions, along with the activation probability estimated through simulations using
3200 test vectors. Fig. 13.13 shows the simulated failure trends of the c17 circuit infected
with two types of Trojans: 3–bit asynchronous counter (Trojan-4) and 4-bit asynchronous
counter (Trojan-5). Table 13.6 shows the M T T F extracted from the best–fitted curve and
by averaging the simulation data. Again, the M T T F derived from the best–fit curve and
from the simulation data agree very well, and as expected, the circuit with the more difficult
to activate Trojan (Trojan-5) has a smaller M T T F .
Thus, in all the types of Trojan considered by us, we found close match to the theo-
retically predicted results. Next, we consider an interesting example of a hardware Trojan,
whereby ring oscillators are instantiated in a FPGA, to cause rise of operating tempera-
ture. Such rise of operating temperature causes decrease in IC reliability, which would be
estimated quantitatively in Section 13.5.8.
FIGURE 13.13: Failure trends of the c17 circuit with two types of inserted sequential
Trojans (described in Sec. 13.4.3).
FIGURE 13.14: Xilinx Virtex-II configuration bitstream file organization [159, 257].
TABLE 13.7: Virtex-II (XC2V40) Column Types and Frame Count [159]
Column Type Columns Frames per Column
Input/Output Block (IOB) 2 4
Input/Output Interface (IOI) 2 22
Configurable Logic Block (CLB) 8 22
Block RAM (BRAM) 2 62
BRAM Interconnect (BRAM_INT) 2 22
Global Clock (GCLK) 1 4
Other than the configuration bits in the frames, the .bit file also contains other auxiliary
parts, for example, a “header” of ASCII encoded name of the top-level module of the design,
the device family, timestamp; cyclic redundancy check (CRC) words, and a long trailing
series of zeros at the end of the files to signify “no operation” (NOOP). The default size of
the configuration bitstream can be deduced as:
The order in which the different sections are present in the bitstream configuration file is:
header words, frame words, CRC words, and trailing (NOOP) words.
of the original circuit and the Trojan are independent of each other as there is no
connection between the ports of the original design and the hardware Trojan. We call
these Type-I Trojans in this chapter.
In the first case, the Trojan insertion operation by bitstream modification is relatively
easier to perform. In contrast, Trojan insertion in the second case requires detailed knowl-
edge of the correlation between the routing of the interconnects in the FPGA fabric with the
bitstream. Even in a relatively low-end (by modern standards) FPGA model such as Virtex-
II (device XC2V40), this is an extremely complicated exercise in the absence of supporting
documentation to provide at least some basic information. Even this has been attempted
previously [274]. However, the approach taken by the authors in [274] does not work directly
on the .bit file, but first derives the design description in another proprietary plaintext for-
mat called the Xilinx Design Language (“.xdl” being the file extension); modifies the .xdl
file, and then transforms it back to the .bit format. However, the .xdl format again has no
official public documentation, and future support for this format by the FPGA CAD tools
is not guaranteed by Xilinx. Hence, in this work we adopt the more generic approach of
directly modifying the .bit file, and we consider only Type-I Trojans.
Fig. 13.15 shows the steps of the proposed attack. “Golden.bit” represents the .bit file
corresponding to the original design. The modification program takes this as one of the
inputs, and searches a database of .bit files corresponding to different Type-I Trojans se-
quentially. The Type-I Trojans are designed such that they are physically restricted within
only one of the four halves (upper, lower, right and left) of the FPGA, both in terms of
on-chip resource usage as well as the I/O pin usage. The modification program first checks
the bit in the 1-th index of the “Status Register” (STAT) at address 0x00111. If this bit
is 1, the following configuration bitstream is triple-DES encrypted, otherwise if it is 0, the
configuration bitstream is unencrypted [159]. If the program finds that the “Golden.bit” file
is encrypted, it exits without performing any modification. Otherwise, the Trojan insertion
algorithm identifies the blocks in the bitstream based on which “column” they represent;
finds the resources on the FPGA which are unutilized by the original design; finds whether
it is possible to integrate the selected Trojan bitstream with the original bitstream; finally,
if integration is possible, integrates the two bitstreams and disables the CRC. If it fails in
finding a proper Trojan .bit file, it exits without performing the modification. Note that the
proposed attack depends on the availability of resources in proper locations of the FPGA
to be successful. Fig. 13.16 shows two possible cases of successful Trojan insertion where
there is no connection or overlap between the original circuit and the Trojan circuit.
To correctly parse the bitstream, it is essential to know the different types of columns
in the bitstream (IOB, BRAM, CLB, etc.); the number of each type of column for a given
device; the number of frames per column; and the number of 4-byte words in a frame. After
knowing these information [159], the sizes of the blocks is calculated as follows (assuming
we are considering the “CLB” type columns, and from Table 13.7):
FIGURE 13.18: Effects of inserting 3–stage ring–oscillator based MHL in Virtex-II FPGA:
(a) rise of temperature as a function of the number of inserted 3-stage ring oscillators; (b)
theoretical aging acceleration factor as a function of percentage FPGA resource utilization
by ring-oscillator Trojans.
bitstream, the elevated operating temperature, and the two LEDs lighting up, as shown in
Fig. 13.17.
Next we discuss some possible hardware techniques to prevent or detect the proposed
attack.
difficult for an adversary to insert hardware Trojans by direct modification of the bitstream.
Even if we assume that read-back is possible in-field from the configuration memory, the
adversary would not be able to calculate the exact bitstream to be put into the configu-
ration memory, so that the hardware de-scrambler de-scrambled it as intended. Here, the
challenge is to keep the scrambling algorithm secret. This is a low-cost alternative to FPGAs
supporting encrypted bitstreams.
13.6 Conclusion
The issue of hardware Trojans and effective countermeasures against them have drawn
considerable interest in recent times. In this chapter, we have presented a comprehensive
study of different Trojan types, and demonstrated a few potent attacks. Considering the
varied nature and size of hardware Trojans, it is likely that a combination of techniques,
both during design and testing, would be required to provide an acceptable level of secu-
rity. Design-time approaches would span various levels of design descriptions. On the other
hand, post-silicon validation would require a combination of logic and side-channel test
approaches to cover Trojans of different types and sizes under large parameter variations.
In the next chapters, we explore different approaches to detect and/or provide protection
against hardware Trojans.
“The addition of any function not visualized in the original design will
inevitably degenerate structure. Repairs also, will tend to cause deviation
from structural regularity since, except under conditions of the strictest
control, any repair or patch will be made in the simplest and quickest way.
No search will be made for a fix that maintains structural integrity.”
14.1 Introduction
In this chapter, we propose a novel testing methodology, referred to as MERO (Multiple
Excitation of Rare Occurence) protection against hardware Trojans. MERO comprises of a
statistical test generation approach for Trojan detection and a coverage determination ap-
proach for quantifying the level of trust. The main objective of the proposed methodology is
409
to derive a set of test patterns that is compact (minimizing test time and cost), while maxi-
mizing the Trojan detection coverage. The basic concept is to detect rare or low probability
conditions at the internal nodes, and then derive an optimal set of vectors than can trigger
each of the selected low probability nodes individually to their rare logic values multiple
times (e.g. at least N times, where N is a given parameter). As analyzed in Section 14.2.1,
this increases the probability of detection of an arbitrary Trojan instance. By increasing the
toggling of nodes that are random-pattern resistant, it improves the probability of activat-
ing an unknown Trojan compared to purely random patterns. The proposed methodology
is conceptually similar to the N-detect test [31, 302] used in stuck-at ATPG (automatic test
pattern generation), where a test set is generated to detect each single stuck-at fault in a
circuit by at least N different patterns to improve test quality and defect coverage [302]. In
this chapter, we focus on digital Trojans [408], which can be inserted into a design either in
a design house (e.g. by untrusted CAD tool or IP) or in a foundry. We do not consider the
Trojans where the triggering mechanism and/or effect are analog in nature (e.g. thermal
fluctuations).
Since the proposed detection is based on functional validation using logic values, it is
robust with respect to parameter variations and can reliably detect very small Trojans,
e.g. the ones with few logic gates. Thus, the technique can be used as complementary to
the side-channel Trojan detection approaches [23, 309, 38, 167], which are more effective in
detecting large Trojans (e.g. ones with area > 0.1% of the total circuit area). In side-channel
approaches existence of a Trojan is determined by noting its effect in a one or more physical
side-channel parameters, such as current or delay. Besides, the MERO approach can be used
to increase the detection sensitivity of many side-channel techniques such as the ones that
monitor the power/current signature, by increasing the activity in a Trojan circuit [38].
Using an integrated Trojan coverage simulation and test generation flow, we validate the
approach for a set of ISCAS combinational and sequential benchmark circuits. Simulation
results show that the proposed test generation approach can be extremely effective for
detecting arbitrary Trojan instances of small size, both combinational and sequential.
The rest of the chapter is organized as follows. Section 14.2 describes the mathematical
justification of the MERO methodology, the steps of the MERO test generation algorithm
and the Trojan detection coverage estimation. Section 14.3 describes the simulation setup
and presents results for a set of ISCAS benchmark circuits with detailed analysis. Section
14.4 concludes the chapter.
ab′ = 1 is satisfied at least 8 times (the maximum number of states of a 3-bit counter),
then the Trojan would be activated. Next, we present a mathematical analysis to justify
the concept.
EAB = pj ·T (14.2)
In the context of this problem, we can assume pj > 0, because an adversary is unlikely to
insert a Trojan which would never be triggered. Then, to ensure that the Trojan is triggered
at least once when T test vectors are applied, the following condition must be satisfied:
pj ·T ≥1 (14.3)
From inequality (14.1), let us assume T = c· pN1 . where c≥1 is a constant depending on the
actual test set applied. Inequality (14.3) can then be generalized as:
pj
S = c· ·N (14.4)
p1
where S denotes the number of times the trigger condition is satisfied during the test proce-
dure. From this equation, the following observations can be made about the interdependence
of S and N :
As pi < 1 ∀i = 1, 2, · · ·q, hence, with the increase in q, S decreases for a given c and
N . In other words, with the increase in the number of trigger nodes, it becomes more
difficult to satisfy the trigger condition of the inserted Trojan for a given N . Even if
the nodes are not mutually independent, a similar dependence of S on q is expected.
3. The trigger nodes can be chosen such that pi ≤θ ∀i = 1, 2, · · ·q, so that θ is defined
as a trigger threshold probability. Then as θ increases, the corresponding selected rare
node probabilities are also likely to increase. This will result in an increase in S for a
given T and N , i.e. the probability of Trojan activation would increase if the individual
nodes are more likely to get triggered to their rare values.
All of the above predicted trends were observed in our simulations, as shown in Section
14.3.
14.2.5 Choice of N
Fig. 14.2 shows the trigger and Trojan coverage for two ISCAS-85 benchmark circuits
with increasing values of N , along with the lengths of the corresponding test-set. From
these plots it is clear that similar to N-detect tests for stuck-at fault where defect coverage
typically improves with increasing N , the trigger and Trojan coverage obtained with the
MERO approach also improves steadily with N , but then both saturate around N = 200
and remain nearly constant for larger values of N . As expected, the test size also increases
with increasing N . We chose a value of N = 1000 for most of our experiments to reach a
balance between coverage and test vector set size.
1. Improvement of test quality: We can consider number of nodes observed along with
number of nodes triggered for each vector during test generation. This means, at
step 13-14 of Algorithm 14.1, a perturbation is accepted if the sum of triggered and
(a)
(b)
FIGURE 14.1: Impact of sample size on trigger and Trojan coverage for benchmarks
c2670 and c3540, N = 1000 and q = 4: (a) deviation of trigger coverage, and (b) deviation
of Trojan coverage.
observed nodes improves over previous value. This comes at extra computational cost
to determine the number of observable nodes for each vector. We note that for a small
ISCAS benchmark c432 (an interrupt controller), we can improve the Trojan coverage
by 6.5% with negligible reduction in trigger coverage using this approach.
2. Observable test point insertion: We note that insertion of very few observable test
points can achieve significant improvement in Trojan coverage at the cost of small
design overhead. Existing algorithm for selecting observable test points for stuck-at
fault test [130] can be used here. Our simulation with c432 resulted in about 4%
improvement in Trojan coverage with 5 judiciously inserted observable points.
3. Increasing N and/or increasing the controllability of the internal nodes: Internal node
controllability can be increased by judiciously inserting few controllable test points or
increasing N . It is well-known in the context of stuck-at ATPG, that scan insertion
improves both controllability and observability of internal nodes. Hence, the proposed
(a)
(b)
FIGURE 14.2: Impact of N (number of times a rare point satisfies its rare value) on the
trigger/Trojan coverage and test length for benchmarks (a) c2670 and (b) c3540.
14.3 Results
14.3.1 Simulation Setup
The test generation and the Trojan coverage determination methodology was imple-
mented in three separate C programs. All the three programs can read a Verilog netlist and
create a hypergraph from the netlist description. The first program, named as RO-Finder
(Rare Occurence Finder), is capable of functionally simulating a netlist for a given set of
input patterns, computing the signal probability at each node and identifying nodes with
low signal probability as rare nodes. The second program, MERO implements Algorithm
FIGURE 14.3: Integrated framework for rare occurrence determination, test generation
using MERO approach, and Trojan simulation.
TABLE 14.1: Comparison of Trigger and Trojan Coverage Among ATPG Patterns [246],
Random (100K, input weights: 0.5), and MERO Patterns for q = 2 and q = 4, N = 1000,
θ = 0.2
ATPG Patterns Random (100K Patterns) MERO Patterns
Nodes q= 2 q=4 q=2 q=4 q=2 q= 4
Circuit (Rare/ Trigger Trojan Trigger Trojan Trigger Trojan Trigger Trojan Trigger Trojan Trigger Trojan
Total) Cov. Cov. Cov. Cov. Cov. Cov. Cov. Cov. Cov. Cov. Cov. Cov.
(%) (%) (%) (%) (%) (%) (%) (%) (%) (%) (%) (%)
c2670 297/1010 93.97 58.38 30.7 10.48 98.66 53.81 92.56 30.32 100.00 96.33 99.90 90.17
c3540 580/1184 77.87 52.09 16.07 8.78 99.61 86.5 90.46 69.48 99.81 86.14 87.34 64.88
c5315 817/2485 92.06 63.42 19.82 8.75 99.97 93.58 98.08 79.24 99.99 93.83 99.06 78.83
c6288 199/2448 55.16 50.32 3.28 2.92 100.00 98.95 99.91 97.81 100.00 98.94 92.50 89.88
c7552 1101/3720 82.92 66.59 20.14 11.72 98.25 94.69 91.83 83.45 99.38 96.01 95.01 84.47
s13207‡ 865/2504 82.41 73.84 27.78 27.78 100 95.37 88.89 83.33 100.00 94.68 94.44 88.89
s15850‡ 959/3004 25.06 20.46 3.80 2.53 94.20 88.75 48.10 37.98 95.91 92.41 79.75 68.35
s35932‡ 970/6500 87.06 79.99 35.9 33.97 100.00 93.56 100.00 96.80 100.00 93.56 100.00 96.80
Avg. 724/2857 74.56 58.14 19.69 13.37 98.84 88.15 88.73 72.30 99.39 93.99 93.50 82.78
‡
These sequential benchmarks were run with 10,000 random Trojan instances to reduce run time of Tetramax
14.1 described in Section 14.2.2 to generate the reduced pattern set for Trojan detection.
The third program, TrojanSim (Trojan Simulator), is capable of determining both Trig-
ger and Trojan coverage for a given test set using random sample of Trojan instances. A
q-trigger random Trojan instance is created by randomly selecting the trigger nodes from
the list of rare nodes. We consider one randomly selected payload node for each Trojan.
Fig. 14.3 shows the flowchart for the MERO methodology. Synopsys TetraMAX was used to
justify the trigger condition for each Trojan and eliminate the false Trojans. All simulations
and test generation were carried out on a Hewlett-Packard Linux workstation with a 2 GHz
dual-core processor and 2GB main memory.
the algorithm in [246]), weighted random patterns and MERO test patterns. It also lists
the number of total nodes in the circuit and the number of rare nodes identified by RO-
Finder tool based on signal probability. The signal probabilities were estimated through
simulations with a set of 100,000 random vectors. For the sequential circuits, we assume
full-scan implementation. We consider 100,000 random instances of Trojans following the
sampling policy described in Section 14.2.4, with one randomly selected payload node for
each Trojan. Coverage results are provided in each case for two different trigger point count,
q = 2 and q = 4, at N = 1000 and θ = 0.2.
Table 14.2 compares reduction in the length of the testset generated by the MERO test
generation method with 100,000 random patterns, along with the corresponding run-times
for the test generation algorithm. This run-time includes the execution time for Tetramax
to validate 100,000 random Trojan instances, as well as time to determine the coverage by
logic simulation. We can make the following important observations from these two tables:
1. The stuck-at ATPG patterns provide poor trigger and Trojan coverage compared to
MERO patterns. The increase in coverage between the ATPG and MERO patterns is
more significant in the case of higher number of trigger points.
2. From Table 14.2, it is evident that the reduced pattern with N =1000 and θ = 0.2
provides comparable trigger coverage with significant reduction in test length. The
average improvement in test length for the circuits considered is about 85%.
(a) (b)
FIGURE 14.4: Trigger and Trojan coverage with varying number of trigger points (q) for
benchmarks (a) c3540 and (b) c7552, at N = 1000, θ = 0.2.
(a) (b)
FIGURE 14.5: Trigger and Trojan coverage with trigger threshold (θ) for benchmarks (a)
c3540 and (b) c7552, for N = 1000, q = 4.
TABLE 14.3: Comparison of Sequential Trojan Coverage between Random (100K) and
MERO Patterns, N = 1000, θ = 0.2, q = 2
Trigger Cov. for 100K Random Vectors (%) Trigger Cov. for MERO Vectors (%)
Ckt. Trojan State Count Trojan State Count
0 2 4 8 16 32 0 2 4 8 16 32
s13207 100.00 100.00 99.77 99.31 99.07 98.38 100.00 100.00 99.54 99.54 98.84 97.92
s15850 94.20 91.99 86.79 76.64 61.13 48.59 95.91 95.31 94.03 91.90 87.72 79.80
s35932 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00
Avg. 98.07 97.33 95.52 91.98 86.73 82.32 98.64 98.44 97.86 97.15 95.52 92.57
Trojan Cov. for 100K Random Vectors (%) Trojan Cov. for MERO Vectors (%)
Ckt. Trojan State Count Trojan State Count
0 2 4 8 16 32 0 2 4 8 16 32
s13207 95.37 95.37 95.14 94.91 94.68 93.98 94.68 94.68 94.21 94.21 93.52 92.82
s15850 88.75 86.53 81.67 72.89 58.4 46.97 92.41 91.99 90.62 88.75 84.23 76.73
s35932 93.56 93.56 93.56 93.56 93.56 93.56 93.56 93.56 93.56 93.56 93.56 93.56
Avg. 92.56 91.82 90.12 87.12 82.21 78.17 93.55 93.41 92.80 92.17 90.44 87.70
presents the trigger and Trojan coverage respectively obtained by 100,000 randomly gen-
erated test vectors and the MERO approach for three large ISCAS-89 benchmark circuits.
The superiority of the MERO approach over the random test vector generation approach
in detecting sequential Trojans is evident from this table.
Although these results have been presented for a specific type of sequential Trojans
(counters which increase their count conditionally), they are representative of other sequen-
tial Trojans whose state transition graph (STG) has no “loop”. The STG for such a FSM
has been shown in Fig. 14.6. This is a 8-state FSM which changes its state only when a
particular internal node condition Ci is satisfied at state Si , and the Trojan is triggered
when the FSM reaches state S8 . The example Trojan shown in Fig. 13.5(b) is a special case
of this model, where the conditions C1 through C8 are identical. If each of the conditions
Ci is as rare as the condition a = 1, b = 0 required by the Trojan shown in Fig. 13.5(b),
then there is no difference between these two Trojans as far as their rareness of getting
triggered is concerned. Hence, we can expect similar coverage and test length results for
other sequential Trojans of this type. However, the coverage may change if the FSM struc-
ture is changed (as shown with dotted line). In this case, the coverage can be controlled by
changing N .
14.4 Conclusions
Conventional logic test generation techniques cannot be readily extended to detect hard-
ware Trojans because of the inordinately large number of possible Trojan instances. In this
chapter, we have presented a statistical Trojan detection approach using logic testing where
the concept of multiple excitation of rare logic values at internal nodes is used to generate
test patterns. Simulation results show that the proposed test generation approach achieves
about 85% reduction in test length over random patterns for comparable or better Trojan
detection coverage. The proposed detection approach can be extremely effective for small
combinational and sequential Trojans with small number of trigger nodes, for which side-
channel analysis approaches cannot work reliably. Hence, the proposed detection approach
can be used as complementary to side-channel analysis based detection schemes. Future
work can be directed towards improving the test quality which will help in minimizing the
test length and increasing Trojan coverage further.
421
• “Cyber is a complicated domain. There is no sliver bullet that will eliminate the
threats inherent to leveraging cyber as a force multiplier, and it is impossible to
completely defend against the most sophisticated cyber attacks.”
15.1 Introduction
In Chapter-14, we saw an example of the application of logic testing in detecting hard-
ware Trojans. However, it was mentioned and shown that such a technique, while effective
for simple and ultra-small Trojan circuits (consisting of a few gates) with simple trigger con-
ditions, generally finds its difficult to effectively detect larger Trojans with complex trigger
conditions. In addition, logic testing based techniques are simply unable to detect Trojans
which do not affect the logic functionality of the original circuit.
An alternative to logic technique is to apply techniques which rely on detecting the
variations in the measurement of observable physical “side-channel” parameters like power
signature [23, 38, 308, 309] or delay [167, 310] of an IC in order to identify a structural change
in the design. Such approaches have the advantage that they do not require triggering a
malicious change and observing its impact at the primary output. The major disadvan-
tage is in terms of the extensive process variations which can cause extreme variations in
the measured side-channel parameter e.g. 20X power and 30% delay variations in 180 nm
technology [60]. Existing side-channel approaches suffer from one or more of the following
shortcomings: 1) they do not scale well with increasing process variations; 2) they consider
only die-to-die process variations and do not consider local within-die variations; and 3)
they require design modifications which can potentially be compromised by an adversary.
The detection sensitivity degrades with increasing parameter variations or decreasing Tro-
jan size. Besides, effect of process variations is augmented by measurement noise (electrical
and environmental) which makes isolation of a Trojan effect further difficult.
In this chapter, we propose two novel techniques for side-channel parameter analysis
based Trojan detection that are effective under large process-induced noise and parameter
variations. The first technique is a multiple-parameter based side-channel analysis based
approach for effective detection of complex Trojans under large process-induced param-
eter variations. The concept takes its inspiration from multiple-parameter testing [191],
which considers the correlation of the intrinsic leakage (IDDQ ) to the maximum operat-
ing frequency (Fmax ) of the circuit in order to increase the sensitivity of IDDQ testing to
distinguish fast, intrinsically leaky ICs from defective ones. The basic idea is to exploit
the intrinsic dependencies of variations in the supply current and operating frequency on
process variations to identify the ICs affected with malicious hardware in a non-invasive
manner. Instead of using only the power signature (which is highly vulnerable to parameter
variations [23, 60]), the proposed side-channel approach achieves high signal-to-noise ra-
tio (SNR) using the intrinsic dependencies between active-mode supply current (IDDT ) and
maximum operating frequency (Fmax ) of a circuit to identify the ICs affected with malicious
hardware in a non-invasive manner. We provide a theoretical analysis regarding the effect
of various components of process variations on the relationship between the parameters.
The proposed approach requires no modification to the design flow and incurs no hardware
overhead.
In this chapter, we propose a Trojan detection technique using region-based side-channel
analysis where the effect of Trojan compared to the effect of process noise is magnified by
using a suitable vector generation algorithm. The vector generation approach can selectively
activate one region at a time while minimizing activity in other regions of the circuit under
test (CUT). Assuming that the effect of the Trojan on the side-channel parameter will
manifest itself only on the measured value for a single region, we can use the measurements
from other regions as the golden comparison point to identify the presence of a Trojan
within that region. By decreasing the golden current for comparison, the sensitivity of
Trojan detection can be increased using partitions which have mutually exclusive activity
for the chosen vector set. The test vectors can also be designed [81] to increase the activity
in parts of the Trojan circuit, even if the Trojan’s malicious output does not get generated
or propagated to the primary outputs during the testing process. Besides, we propose a
novel side-channel analysis approach which we refer as self-referencing. The concept is to
decompose a large design into a group of functional blocks, and then consider one functional
block at a time to determine the existence of Trojan using only transient supply current
(IDDT ) measurement. Next, we will use a Region Slope Matrix to compare the current values
of different regions and reduce the effect of process variations. Any effect due to the presence
of a Trojan is reflected in the matrix values and helps identify the infected chips from the
golden chips, even under process variations. The method can also be used to localize the
Trojan and identify the region in which it is present.
(a) (b)
(c) (d)
FIGURE 15.1: (a) Average IDDT values at 100 random process corners (with maximum
variation of ±20% in inter-die Vth ) for c880 circuit. The impact of Trojan (8-bit comparator)
in IDDT is masked by process noise. (b) Corresponding Fmax values. (c) The Fmax vs. IDDT
plot shows the relationship between these parameters under inter-die process variations.
Trojan-inserted chips stand out from the golden trend line. (d) The approach remains
effective under both inter-die and random intra-die process variations. A limit line is used
to account for the spread in IDDT values from the golden trend line.
in the side-channel parameters amidst process noise, we demonstrate that the use of this
novel multiple-parameter approach allows higher detection sensitivity compared to existing
side-channel Trojan detection approaches.
the critical path of the circuit. The spread in IDDT due to variation easily masks the effect
of the Trojan, making it infeasible to isolate from process noise, as shown in Fig. 15.1(a).
The problem becomes more severe with decreasing Trojan size or increasing variations in
transistor parameters. To overcome this issue, the intrinsic relationship between IDDT and
Fmax can be utilized to differentiate between the original and tampered versions. The plot
for IDDT vs. Fmax for the ISCAS–85 circuit c880 is shown in Fig. 15.1(c). It can be observed
that two chips (e.g. Chipi and Chipj ) can have the same IDDT value, one due to presence
of Trojan and the other due to process variation. By considering only one side-channel
parameter, it is not possible to distinguish between these chips. However, the correlation
between IDDT and Fmax can be used to distinguish malicious changes in a circuit under
process noise. The presence of a Trojan will cause the chip to deviate from the trend line. As
seen in Fig. 15.1(c), the presence of a Trojan causes a variation in IDDT when compared to
a golden chip, while it does not have similar effect on Fmax as induced by process variation
- i.e. the expected correlation between IDDT and Fmax is violated by the presence of the
Trojan.
Note that in the proposed multiple-parameter approach, Fmax is used for calibrating
the process corner of the chips. In practice, the delay of any path in the circuit can be used
for this purpose. Hence, it becomes difficult for an attacker to know in advance which path
delay will be used for calibrating process noise. Since a typical design will have exponentially
large number of paths, it is infeasible for an attacker to manipulate all circuit paths in order
to hide the Trojan effect. Furthermore, even if the path is guessed by the attacker, a Trojan
is likely to increase both delay and activity of the path on which it is inserted. Hence, a
chip containing the Trojan will deviate from the expected IDDT vs. Fmax trend line, where
both current and frequency increase or decrease simultaneously. Finally, in order to alter
the Fmax in a way that the Trojan evades multiple-parameter detection approach (i.e., it
falls within the limit line in Fig. 15.1(c)), the adversary needs to know the exact magnitude
of process variation for each path of each chip, which is difficult to estimate at fabrication
time [60].
Fig. 15.1(d) shows the effect of random intra-die process variation effects on the IDDT
and Fmax values for 1000 instances of the c880 circuit with and without Trojan. We
performed Monte Carlo simulations in HSPICE using inter-die (σ = 10%) and intra-die
(σ = 6%) variations in Vth . In this case, the transistors on the same die can have random
variations on top of a common inter-die shift from the nominal process corner. The trend
line is obtained by using polynomial curve fitting of order three in MATLAB, which matches
the trend obtained by considering only inter-die process variation effects. We observe that
there is a spread in the values from the trend line due to random intra-die variations, but
the spread is much reduced in comparison to the spread in each side-channel parameter
alone, because we consider the multiple-parameter relationship. By computing the spread
in IDDT values for a given Fmax , corresponding to a particular inter-die process corner, we
can estimate the sensitivity of the approach in terms of Trojan detection. Any Trojan which
consumes current less than the amount of spread will remain undetected. The limit line is
obtained by scaling the trend line by the spread factor, which is computed using the mean
and standard deviation of the actual spread in IDDT values for a given Fmax for the sample
of ICs and allows us to identify all the Trojan instances without any error, even for a small
Trojan. Next, we provide a theoretical basis for the existence of a trend line between IDDT
and Fmax under process variations.
where VDD is the supply voltage, α is the velocity saturation index with 1 ≤ α ≤ 2, Vth
is the transistor threshold voltage, and kg is a gate-dependant constant. Consider the i-th
IC from a lot of manufactured ICs.The effect of process variations can cause the threshold
voltage to vary from die-to-die (Inter-die variations) or within-die (Intra-die variations). As
shown in Fig. 15.2, if these variations are modelled using Gaussian distributions around the
nominal threshold voltage of the process, the Vth of a transistor on a die can be expressed
as the sum of the nominal threshold voltage (VT ), the inter-die component of variation
(∆VT i ) for the i-th IC, and ∆vT g , the random component of variation. The ∆VT i value is
common for every device in the i-th IC, while the ∆vT g value varies from gate to gate. Eqn.
(15.1) can be re-written as:
α
∆vT g
Ig = kg (VDD − VT − ∆VT i )α 1 − (15.2)
VDD − VT − ∆VT i
∆vT g
Expanding binomially, assuming ≪ 1 and discarding higher-order terms,
VDD − VT − ∆VT i
the above equation can be approximated as:
X
where kav · ntot,i = kg and ntot,i is the total number of switching gates in the IC.
g∈ICi
X
The term kg ∆vT g represents the sum of several random variables, each distributed
g∈ICi
with mean µ = 0 and variance (say) σ 2 . Hence, by the Central Limit Theorem [291], their
where βg is another gate-dependant constant. Applying the same approximations, the delay
of the gate is given by:
Summing the delays for the gates on the critical path of the i-th IC:
X
Tcrit,i = tdg ≈ βav · ncrit,i · (VDD − VT − ∆VT i )−α (15.8)
g∈Pcrit,i
where ncrit,i is the number of gates on the critical path Pcrit,i of the i-th IC. Hence, the
maximum operating frequency of the i-th IC is given by:
1 1
fmax,i = ≈ · (VDD − VT − ∆VT i )α (15.9)
Tcrit,i βav ncrit,i
Combining equations (15.5) and (15.9), the relationship between Iddt,i and fmax,i is given
by:
Iddt,i
≈ kav · βav · ntot,i · ncrit,i (15.10)
fmax,i
This shows that the relationship between the transient switching current and the maxi-
mum operating frequency of an IC at different process corners is linear, based on first-order
approximation. In the presence of a Trojan circuit in the i-th IC with ntrojan,i switching
gates, the value of the transient current changes to:
while the expression for the maximum operating frequency remains unchanged, if the Trojan
is not inserted on the critical path. Hence, for the i-th IC, in the presence of Trojan, the
relationship between Iddt,i and fmax,i changes to:
Iddt,i,trojan
≈ kav · βav · (ntot,i + ntrojan,i ) · ncrit,i (15.12)
fmax,i
Comparing equations (15.10) and (15.12), we observe that the primary effect of the
inserted Trojan is the change in the slope of the linear relationship between the transient
current and the maximum operating frequency in a tampered IC. From these two equations,
we can observe that there is a strong correlation between dynamic current and circuit delay
(which, for the critical path of the circuit, translates to maximum operating frequency or
Fmax ) through the common parameter Vth . Due to process variation, if the current changes
by Vth shift, frequency (f = t1d ) also shifts in the same direction.
The IDDT vs. Fmax relationship for a Trojan-free design can be determined either at
design time by simulation or by monitoring a number of golden ICs. Once the relationship is
determined, we need to define the limit line that distinguishes variation-induced shift from
Trojan-induced one. Judicious selection of the limit line should minimize the probability of
false negatives as well as take into consideration other design marginalities (such as cross
talk and power supply noise). Note that although Fmax is a unique parameter and does not
depend on the applied input vector, the average IDDT is a function of the applied input
vector, and hence, should be measured for a set of input vectors for each golden IC. A set of
patterns that maximizes the activity in the Trojan circuit, while reducing the background
current, is likely to provide the best signal-to-noise ratio.
Itampered − Ioriginal
Sensitivity = × 100% (15.13)
Ioriginal
from equations (15.5) and (15.11). The detection sensitivity of the proposed approach re-
duces with decreasing Trojan size and increasing circuit size. In order to extend the ap-
proach for detecting small sequential/combinational Trojans in large circuits (with > 105
transistors), we need to improve the SNR using appropriate side-channel isolation tech-
niques. Clearly, the sensitivity can be improved by increasing the current contribution of
the Trojan circuit relative to that of the original circuit. Next, we describe a test generation
technique used to reduce Ioriginal and increase its difference from Itampered .
1. The blocks should be reasonably large to cancel out the effect of random parameter
variations, but small enough to minimize the background current.
Next, we consider each module separately to generate test vectors for activating it. The
test vector generation algorithm needs to take into account two factors:
1. Only one region must be activated at a time. If the inputs to different modules are mu-
tually exclusive and the regions have minimal interconnection, it is easy to maximally
activate one region while minimizing activity in other regions. If complex intercon-
nections exist between the modules, the inputs need to be ranked in terms of their
sensitivity towards activating different modules and the test generation needs to be
aware of these sensitivity values.
2. When a particular region is being activated, the test vectors should try to activate
possible Trojan trigger conditions and should be aimed at creating activity within most
of the innumerable possible Trojans. This motivates us to consider a modified version
of the statistical test generation approach (MERO) proposed in [81] for maximizing
Trojan trigger coverage. Note that, unlike functional testing approaches, the Trojan
payload need not be affected during test time, and the observability of Trojan effect
on the side-channel parameter is enough to signify the presence of the Trojan.
For each module Mi , we use connectivity analysis in order to assign weights to the
primary inputs in terms of their tendency to maximize activity in the region under consid-
eration while minimizing activity in other regions. This step can also identify control signals
which can direct the activity exclusively to particular regions. Next, we generate weighted
random input vectors for activating the region under consideration and perform functional
simulation using a graph-based approach, which lets us estimate the activity within each
region for each pair of input vectors. We sort the vectors based on a metric Cij which is
higher for a vector pair which can maximally activate module Mi while minimizing activity
in each of the other modules. Then, we prune the vector set to choose a reduced but highly
efficient vector set generated by the MERO algorithm, which is motivated by the N -detect
test generation technique [31]. In this approach, we identify internal nodes with rare values
within each module, which can be candidate trigger signals for a Trojan. Then we identify
the subset of vectors which can take the rare nodes within the module to their rare values
at least N times, thus increasing the possibility of triggering the Trojans within the region.
Once this process is completed for all the regions, we combine the vectors and generate a
test suite which can be applied to each chip for measuring supply current corresponding to
each of its regions.
For a functional test of a multi-core processor, we can use specially designed small test
programs which are likely to trigger and observe rare events in the system such as events on
the memory control line or most significant bits of the datapath, multiple times. In general
a design is composed of several functional blocks and activity in several functional blocks
can be turned off using input conditions. For example in a processor, activity in the floating
point unit (FPU), branch logic or memory peripheral logic can be turned off by selecting an
integer ALU operation. Many functional blocks are pipelined. In these cases, we will focus
on one stage at a time and provide initialization to the pipeline such that the activities of
all stages other than the one under test are minimized by ensuring that the corresponding
stage inputs do not change.
For each IC from a population of untrusted ICs, the dynamic current (IDDT ) values are
collected for a pre-defined set of test vectors, which are generated with the vector generation
algorithm. The operating frequencies (Fmax ) for these ICs are determined using structural
or functional delay testing approach. Next, average IDDT vs. Fmax measurements for the
untrusted ICs are compared against that of a golden IC for each test pattern. A mismatch
between the two allows us to identify a Trojan by effectively removing the contribution of
process noise.
FIGURE 15.4: Schematic showing the functional modules of the AES cipher test circuit.
The AES “Key Expand” module is clock-gated and operand isolation is applied to the
“SBOX” modules to reduce the background current, thereby improving detection sensitivity.
The Trojan instance is assumed to be in the logic block 2.
FIGURE 15.5: The correlation among IDDT , IDDQ and Fmax can be used to improve
Trojan detection confidence.
other modules can be power gated. It should be noted that depending on the functionality
of the CUT, it might not be always possible to switch-off certain units whose outputs
feed other dependent modules. Thus, when testing for Trojan in module 4, we cannot
shut off the modules 1 and 3 that affect the controllability (and hence the activity) of the
internal nodes in 4. One major concern against using power gating is that if we introduce
power gating during test-time as a method to increase our Trojan detection sensitivity,
the attacker can use these control signals to disable the Trojan during test time. Note
that the adversary should not be able to exploit the sensitivity improvement techniques to
reduce the Trojan contribution during testing. However, in this case, it is difficult for the
adversary to distinguish between the normal functional mode and Trojan detection mode,
since the decision about which blocks are turned-off or biased is taken dynamically. Hence,
the attacker cannot use the power gating techniques to reduce the current contribution of
the inserted Trojan.
(IDDT ), a Trojan will also contribute to the leakage current (IDDQ ). Moreover, similar to
IDDT , the value of IDDQ increases monotonically with the Fmax for a given design from
one process corner to another. Thus any decision derived from studying the IDDT vs. Fmax
relation can be reinforced by observing the IDDQ vs. Fmax relation for the same set of ICs.
For example, if one of the ICs is observed to have considerably larger IDDT and IDDQ values
but smaller Fmax compared to the other, then it is highly likely to be structurally different
and to be affected with a Trojan intrusion. Similar to IDDT , the value for IDDQ is input-
dependent, thus, carefully chosen input vectors can reduce the leakage current (by 30–40%).
Thus a low-leakage vector can improve the IDDQ sensitivity of a Trojan. Fig. 15.5(a) shows
the IDDQ vs. Fmax plot for the 8-bit ALU benchmark with and without Trojan instances.
To understand the joint effect of the three variables, we simulated the c880 ISCAS-85 cir-
cuit with and without an 8-bit comparator Trojan. Fig. 15.5(c) shows a 3-D plot of IDDT ,
IDDQ and Fmax , with projections on the IDDQ –Fmax and IDDT –Fmax planes for the cor-
responding 2-D plots. We can observe that a Trojan instance clearly isolates a chip in the
multiple-parameter space from process induced variations.
During side-channel testing, the choice of testing conditions can have a significant impact
on sensitivity of Trojan detection. For instance, the placement of the current sensor to
measure IDDT for the chip is an important parameter. It should be noted that in case of
a non-invasive approach for Trojan detection, the current sensors are not inserted within
the chip. Also, if they are inserted within the chip, they can be tampered with, by the
attacker. However, it is advisable to measure the current as close to the pins as possible.
If we measure the current drawn from the power supply, the averaging effect of the bypass
capacitors on the board-level can cause a negative impact on Trojan detection sensitivity.
Also, if the current sensing can be done at individual VDD pins at the chip-level, instead
of at the common supply node at the board-level, we can divide the background current to
a considerably smaller value. It can also help in isolating the Trojan effect if the functional
regions being activated draw supply current dominantly from different VDD pins. In this
context, a region-based Trojan detection approach described in [308], explains how one can
use the IDDT values for different regions to calibrate the process noise assuming that the
Trojan affects only one or few of the regions’ currents. Moreover, the IDDT values obtained
for different test vector sets which selectively activate one region compared to another can
be used as reference parameters, as described in [114], in order to detect a Trojan which is
localized in one region of the IC.
The value of the supply voltage and the operating frequency during testing can also
be used to get better Trojan detection sensitivity by our approach. As the supply voltage
is reduced below nominal, the gates start switching slowly. Also, the dynamic and leakage
current get reduced. As we measure average current over a clock period as the current
corresponding to a pair of test vectors, the value contains components from both switching
current and the leakage current. Based on the equations derived in Section 15.3.1, and the
trend lines in Fig. 15.5(a), a trend line exists between Fmax and IDDT and the relation
between Fmax and IDDQ is non-linear. We can see that if the measured average current
is dominated by the leakage component, the relationship has a non-linear trend. If the
trend remains close to linear, it is easier to get a limit line and determine a threshold for
characterizing process variations. We can reduce the leakage component by measuring the
average leakage current for the same vector and subtracting it from the measured switching
current to extract the actual IDDT . On the other hand, we can get similar sensitivity by
measuring the current for shorter period of time (i.e. at high operating frequency) which
leaves very little margin beyond the critical path delay, or by testing at a lower supply
voltage, when the gate delays increase and eat up the slack for the particular operating
frequency. If other low-power design techniques are built into the design, like applying
body-bias to reduce leakage or clock/supply gating or adaptive voltage scaling for different
functional regions, these can be used to our advantage in order to increase Trojan detection
sensitivity.
15.4 Results
In this section, we describe the simulation and measurement setup for validating the
proposed approach and provide corresponding results.
1. Trojan I: It consists of a 24-bit counter with a “gated” clock and occupies ∼ 1.1%
of the AES circuit area. The gating signal for the clock is an internal signal with
very low signal probability (obtained by simulating each test circuit with random
input patterns). When the clock is not gated, the counter counts through the states
and when it reaches the maximum count, an AND gate connected to all the outputs
triggers the Trojan output and an XOR gate modifies the payload node. The low signal
probability of the internal gating signal makes it highly unlikely that the counter will
reach the maximum count within the testing period.
2. Trojan II: It consists of a 10-bit counter which occupies ∼ 0.4% of the AES circuit
area. Instead of a clock signal, the counter has an internal signal as its state transition
enable. The counter moves from one state to the next only when it encounters a
positive edge on this internal signal. An internal signal with very low activity is chosen
from both the test circuits.
3. Trojan III: The third Trojan is a variant of Trojan II with less number of flip-flops. It
contains a 3-bit counter and occupies only ∼ 0.11% of the AES circuit area.
4. Trojan IV: This is a purely combinational Trojan which occupies a meagre 0.04% of
FIGURE 15.6: IDDT vs. Fmax relationship for both golden and tampered AES and IEU
circuits showing the sensitivity of our approach for detecting different Trojan circuits.
TABLE 15.1: Trojan Detection Sensitivity for Different Trojan Sizes in AES
Trojan Trojan Sensitivity
Type Size w/o gating w/ gating
I (seq, 24-FF) 1.10% 2.63% 12.20%
II (seq, 10-FF) 0.40% 1.70% 8.60%
III (seq, 3-FF) 0.11% 0.81% 3.53%
IV (comb, 8-bit) 0.04% 0.23% 1.12%
the AES circuit area. It compares an 8-bit input to a pre-defined constant and triggers
a malfunction only when the input matches with it. The Trojan trigger condition is
derived from eight rare nodes of the circuit, such that the probability of occurrence
of all eight rare values at the Trojan input is extremely low.
15.4.1.2 Results
Fig. 15.6(a) shows a plot of IDDT vs. Fmax for the AES circuit, with and without an
inserted Trojan of type I. From this plot, it is observed that the current differential due
to the Trojan circuit is only 2.63% at different process corners. For smaller Trojan circuits
(Trojan II-IV), this difference can be less prominent and is, therefore, likely to be masked
by process noise or design marginalities. Thus the clock gating and operand isolation as
discussed in Section 15.3.2.2 were implemented to improve the Trojan detection sensitivity
in the AES test circuit. As a result of selective gating, it was possible to reduce the average
activity per node significantly (from 0.16 to 0.05). Fig. 15.6(b) shows the average IDDT
TABLE 15.2: Trojan Detection Sensitivity for Different Trojan Sizes in IEU
Trojan Trojan Sensitivity
Type Size high act. vectors low act. vectors
I (seq, 24-FF) 1.40% 2.14% 3.83%
II (seq, 10-FF) 0.50% 1.00% 1.80%
III (seq, 3-FF) 0.14% 0.72% 1.29%
IV (comb, 8-bit) 0.05% 0.45% 0.84%
FIGURE 15.7: Effect of random process variations is observed by performing Monte Carlo
simulations with inter-die σ = 10% and random intra-die σ = 6% for the 32-bit IEU circuit
with Trojan IV inserted. Using a 2% sensitivity limit line to accommodate for the random
process variation effects, we can obtain 99.8% detection accuracy and limit the false alarms
to 0.9%.
vs. Fmax plots for Trojan I, with power gating applied, which increases the sensitivity
from 2.63% (Fig. 15.6(a)) to 12.2%. The sensitivity for different Trojan sizes is shown in
Table 15.1. From the above results we conclude that selective gating or operand isolation
can be extremely effective in improving the resolution of side channel analysis using current
signature. Fig. 15.6(c) and 15.6(d) show IDDT vs. Fmax trends for the 32-bit IEU circuit,
which shows sensitivity reduction with decrease in Trojan size. These sensitivity values can
be improved by choosing proper low-activity vectors which reduce the background current.
the improvement in sensitivity for different Trojan circuits due to low activity vectors is
shown in Table 15.2.
Fig. 15.7 shows the results of performing Monte Carlo simulations for 1000 instances of
the 32-bit IEU circuit with and without Trojan IV inserted. Here we consider both die-to-
die and within-die variations as well as uncorrelated variations between NMOS and PMOS
threshold voltages. Using a 2% sensitivity limit line, we obtain 99.3% Trojan detection
accuracy, with 0.3% false alarms. Hence, the multiple-parameter approach is shown to work
even under random process variation effects on top of inter-die variations. However, these
results were obtained at nominal supply voltage of 1V and relatively low operating frequency
of 200 MHz (clock period = 5 ns). As described in Section 15.3.2.4, we can use supply
voltage scaling and frequency scaling during testing to make the measured supply current
get dominated by the switching current (IDDT ) only by reducing the slack when no switching
activity takes place in the circuit. This can give better Trojan detection sensitivity as shown
in Fig. 15.8. By decreasing the clock period to 3 ns, we limit the idle time within the
measurement period. Also, by reducing the supply voltage to 0.8V, the switching speed of
the gates gets reduced and the slack decreases further.
FIGURE 15.8: Choosing a faster clock period (3ns) and lowering the supply voltage from
1V to 0.8V gives better Trojan detection accuracy.
FIGURE 15.9: (a) Test PCB schematic. (b) Test circuit schematic. (c) Experimental
setup. (d) Snapshot of measured IDDT waveform from oscilloscope.
bypass capacitors. A differential probe was used to measure the voltage waveforms, which
were recorded using an Agilent mixed-signal oscilloscope (100 MHz, 2 Gsa/sec). The wave-
forms were synchronized with a 10 MHz clock input and are recorded over 16 cycles corre-
sponding to a pattern of 16 input vectors. A “SYNC” signal is used to correspond to the
first input vector in the set, so that the current can be measured for the same vectors in all
cases. Frequency (an estimate of Fmax ) was measured using a 15-inverter chain ring oscil-
lator circuit, mapped to different parts of the FPGA, with an on-chip counter as described
FIGURE 15.10: Measurement results for 10 FPGA chips showing IDDT vs. Fmax trend for
the IEU test circuit with and without a 16-bit sequential Trojan (0.14% of original design).
FIGURE 15.11: Measured IDDT vs. Fmax results for 8 golden and 2 Trojan chips for the
IEU circuit with and without a 4-bit sequential Trojan (0.03% of original design).
in [295]. We performed experiments with 10 FPGA chips from the same lot, which were
placed in the same test board using a BGA socket, with the same design mapped into each
chip. The test setup is shown in Fig. 15.9.
15.4.2.2 Results
The experimental results for multiple-parameter testing approach are shown in
Fig. 15.10. The results show that while measurements of IDDT only (Fig. 15.10(a)) may not
be able to capture the effect of a Trojan under parameter variations, multiple-parameter
based side-channel analysis can be effective to isolate it. For a set of golden chips, IDDT vs.
Fmax follows an expected trend under process noise and deviation from this trend indicates
the presence of structural changes in the design. Fig. 15.10(b) shows this scenario for 10
FPGA chips, 8 golden and 2 with Trojans (16-bit sequential Trojan). The ones with Trojans
stand out from the rest in the IDDT vs. Fmax space. Note that some design marginalities,
such as small capacitive coupling, which cause localized variation, can make the IDDT vs.
Fmax plot for golden chips to deviate from the linear trend. Also, better trend can be
obtained by performing measurements over larger population of chips, than was available.
Fig. 15.11(a) shows the measured IDDT vs. Fmax trend for a 4-bit sequential Trojan,
which occupied 0.03% of logic resources in the FPGA. By drawing a limit line with a sensi-
tivity of 2%, we get errors in Trojan detection. Lowering the sensitivity to 1% will decrease
the number of false negatives (Trojan chips classified as golden), but increase the number
of false positives (golden chips classified as Trojan). To improve the sensitivity of Trojan
detection, we subtracted the background current (current measured with no input activity)
FIGURE 15.12: Sensitivity of Trojan detection decreases with Trojan size but improves
by proper test vector selection.
for each chip and the corresponding IDDT vs. Fmax trend is shown in Fig. 15.11(b). Even
with a sensitivity of 1%, we can now clearly identify the Trojan chips without any errors.
Fig. 15.12 shows the variation in Trojan detection sensitivity with Trojans of various sizes
and with sets of input test vectors with differing activity levels. It is clear from this graph,
that the sensitivity of Trojan detection decreases with decrease in Trojan size, and for very
small Trojans, we need to use sensitivity improvement techniques to avoid classification
errors. The sensitivity towards Trojan detection by measuring from individual pins com-
pared to the sensitivity when measuring the total current is plotted in Fig. 15.13. It can
be observed that when activating the multiplier which is spread out over a large part of
the FPGA, we do not get much improvement in sensitivity. However, the supply current
corresponding to the logic operations shows clear improvement in sensitivity of ∼1.25X
over the overall current sensitivity, for the pin R2 which is closest to the placement of the
logic block of the IEU on the FPGA. The sensitivity can potentially be improved further
by integrating current sensors into the packaging closer to the pins and by using current
integration circuitry to perform the averaging.
FIGURE 15.14: Improvement of Trojan coverage by logic testing approach with insertion
of observable test points (5 and 10).
be sufficient to detect the Trojan if the effect on the side-channel parameter being measured
is substantial, relative to the background value. Hence, the proposed methodology can also
be integrated with logic-testing based Trojan detection approaches (such as MERO [81]) to
provide comprehensive coverage for Trojans of different types and sizes.
The overall coverage can be estimated by a statistical sampling approach, in which a
random sample of Trojan instances of a specific size (e.g. 100K) is chosen from the Trojan
population and the percentage of Trojans in the sample detected by a given test-set is
determined using functional simulation. Trojan detection coverage for a particular test set
is defined as:
# of T rojans detected
Coverage = × 100% (15.14)
# of sampled T rojans
One can also use coverage enhancement techniques like test-point insertion to enhance
logic-testing based Trojan detection. Similar to current sensor integration for improving
sensitivity of side-channel technique, we can insert low-overhead test-points to increase the
observability of poorly observable internal circuit nodes and making them primary outputs.
to reduce pin overhead, we can use multiplexing of the test points on existing pins. Similarly,
controllable test point insertion can be used to improve trigger coverage. In order to observe
the effect of observable test points, we performed simulations with 5 and 10 inserted test
points (TP). To select the test points, the nodes were ranked in descending order based on
the following metric:
fin + fout
M= (15.15)
abs(fin − fout ) + 1
where fin and fout represent the sizes of the fanin and fanout cones of a node, respectively.
The metric indicates that nodes closer to the primary inputs/outputs have less chance of
getting selected. Fig. 15.14 shows the effect of test point insertion on the Trojan coverage
as compared to a baseline case with no inserted test point, for three sequential (ISCAS’89)
benchmark circuits with N = 1000, q = 2 and θ = 0.2. As observed from this plot, test point
insertion helps to improve the Trojan coverage considerably for some circuits and helps to
reduce the gap between trigger coverage and Trojan coverage.
We computed the Trojan detection coverage for different ISCAS-85 benchmark circuits
for a population of 100,000 10-input combinational Trojans, using the MERO logic testing
algorithm as well as the combined side-channel and MERO approach, as shown in Table
15.3. We used the sensitivity value derived in Section 15.3.2 to determine if a Trojan is
detected using side-channel approach. It can be readily observed that although the logic
testing approach in isolation achieves relatively poor coverage for large Trojans (size >= 8
FIGURE 15.15: Complementary nature of MERO and Side-channel analysis for maximum
Trojan detection coverage in 32-bit IEU.
inputs), the total coverage is almost 100% for most circuits when the two approaches are
combined.
We also analysed the complementary nature of coverage by the logic testing and side-
channel approaches of Trojan detection. Fig. 15.15 shows the Trojan coverage for logic
testing approach (MERO) and side-channel approach without any sensitivity improvement
techniques applied, as well as the total coverage for the combined approach, for a 32-bit
Integer Execution Unit (IEU) for Trojans of different sizes. It can be observed that for
larger Trojans with 8 or more inputs, the detection coverage of the MERO approach is
much inferior to that for the side-channel based multi-parameter testing. On the other hand,
smaller Trojans are easier to trigger and detect using logic testing, but their contribution
to side-channel parameter may be difficult to distinguish. From this analysis, we note that
Trojans of different types and sizes can be detected with high confidence by the combined
approach.
FIGURE 15.16: (a) An simple test circuit: a 4-bit Arithmetic Logic Unit (ALU). (b) A
combinational Trojan inserted into the subtractor.
FIGURE 15.17: (a) Comparison of supply current between golden and tampered chip for
four regions of a 4-bit ALU. (b) Correlation of region currents at different process points
for golden and tampered ICs.
of regions in an IC, the method is scalable with increasing process noise. To increase the
Trojan detection sensitivity, we propose a region-based vector generation approach, which
tries to maximize the Trojan effect while minimizing the background current. Current values
of n regions are then compared with all other using a slope heuristic and the resultant region
slope matrix is used to compare a chip with another. We validate the proposed approach
using both simulation and measurements for several large open source designs.
The idea of self-referencing can be illustrated using an example 4-bit ALU, as shown in
Fig. 15.16(a). The ALU contains four distinct functional units (FUs) – adder, subtractor,
multiplier and shifter, which are activated based on the input “opcode” value. There are two
4-bit operands and a 4-bit output. In such a circuit, a single region or FU can be selectively
activated by proper choice of opcode, we can easily generate test vectors which target
separate activation of the four regions. We consider three different process corners (nominal
±25%) for the entire design (modeled as a change in the transistor threshold voltage VT )
and simulate the design in HSPICE for four different vector pairs which activate each of
the four regions separately. We also measure the background current. The Trojan circuit,
as shown in Fig. 15.16(b) was designed to invert an output bit of the subtractor if two
input bits were equal. We simulated the circuit with the Trojan in the subtractor module
(occupying 2.7% area of the ALU) at the nominal process corner for the same set of vectors.
Fig. 15.17(a) shows the plot of the average IDDT values for the four different vectors
activating the four different regions without the background current. We can observe the
tampered circuit consumes more current for the vector which activates the subtractor region.
We plot the current for one region (adder) with respect that for another (subtractor) for a
set of golden and tampered chips at 20 different process points in Fig. 15.17(b). We expect
a correlation between the region currents across process corners. However, since there is
a Trojan in the subtractor, it shows uncorrelated behavior in supply current. Hence, the
current for the adder can be used to calibrate the process noise and check for the presence
of Trojan in other modules. In real life, since we do not know the region which contains the
Trojan, we need to compare each region with all others. This also allows us to cancel out
the effect of random and systematic intra-die process variations, as explained next.
Ignoring all second order terms involving both random and systematic shifts of the
threshold voltage, the above equation can be approximated by:
Ig ≈ k (VDD − VT )2 − 2(VDD − VT )(∆vT g1 + ∆VT i ) − 2(VDD − VT )∆vT g2 (15.17)
| {z } | {z }
constant for each gate g∈Ri random for each gate g∈Ri
Summing the currents for all the switching gates of the region Ri , the total switching
current for region Ri is:
X X
Ii = Ig = kni (VDD − VT )2 − 2(VDD − VT )(∆vT g1 + ∆VT i ) −2(VDD −VT ) ∆vT g2
g∈Ri g∈Ri
X (15.18)
where ni is the number of switching gates in region Ri . Now, the term ∆vT g2 represents
g∈Ri
the sum of ni (normally distributed) random variables, each with mean µ = 0 and
X standard
deviation σT (let). Hence, by the Central Limit Theorem [291], the term ∆vT g2 is
g∈Ri
approximately normally distributed with mean µ = 0 and a reduced standard deviation
σT
ni . Hence, for reasonably large value of ni , this term is approximately equal to zero, and
√
Hence, the difference between the currents of regions Ri and Rj can be expressed as:
Ii − Ij |observed = k (VDD − VT )2 − 2(VDD − VT )∆vT g1 (ni − nj )
− 2k(VDD − VT )(ni ∆VT i − nj ∆VT j ) = c1 (ni − nj ) + c2 (ni ∆VT i − nj ∆VT j )
| {z }
due to systematic intra-die variation
(15.21)
where c1 , c2 are constants. If the contribution due to the intra-die systematic component is
negligible, the above expression can be re-written as:
ni − nj
Sij,golden = = Sij,observed (15.24)
ni
Similarly, it can be shown that Sji,golden = Sji,observed . This shows that under negligible
systematic intra-die variations, the ratio of the difference in the switching currents of two
regions and the current of each region should remain approximately unchanged. This equal-
ity fails to be satisfied in case one of the regions is modified by the insertion of a Trojan,
because then the switching current of the gates constituting the Trojan circuit disturbs the
balance. This observation is the main motivation behind using the region slope values for
reducing the process noise. For a circuit with N regions, if we compute the region slope
values for all pairs of regions, we obtain an N × N region slope matrix, with zeros on the
diagonal. It is observed that systematic variations still cause some variations in the Re-
gion Slope values, but the effect of process variation has been reduced greatly compared to
the variations in individual current values, thus giving us improved sensitivity for Trojan
detection.
Next, we describe the major steps which constitute our self-referencing approach for
Trojan detection.
The Trojan detection sensitivity of this approach reduces with decreasing Trojan or
increasing circuit size. In order to detect small sequential/combinational Trojans in large
circuits (> 105 transistors), we need to improve the SNR (Signal-to-Noise Ratio) using
appropriate side-channel isolation techniques. At a single VT point the sensitivity, for an
approach where transient current values are compared for different chips, can be expressed
as:
Itampered,nominal − Igolden,nominal
Sensitivity = (15.25)
Igolden,process variation − Igolden,nominal
Clearly, the sensitivity can be improved by increasing the current contribution of the Trojan
circuit relative to that of the original circuit. We can divide the original circuit into several
small regions and measure the supply current (IDDT ) for each region. The relationship
between region currents also helps to cancel the process variation effects. In Fig. 15.17(a), if
we consider the “slope” or relative difference between the current values of ‘add’ and ‘sub’
regions, we can see that there is a larger shift in this value due to Trojan than in the original
current value due to process variations. We refer to this approach as the self-referencing
approach, since we can use the relative difference in the region current values to detect a
Trojan by reducing the effect of process variations.
The major steps of the self-referencing approach are as follows. First, we need to perform
a functional decomposition to divide a large design into several small blocks or regions, so
that we can activate them one region at a time. Next, we need a vector generation algorithm
which can generate vectors that maximize the activity within one region while producing
minimum activity in other regions. Also, the chosen set of test vectors should be capable
of triggering most of the feasible Trojans in a given region. Then, we need to perform self-
referencing among the measured supply current values. For this we use a region slope matrix
as described earlier. Finally, we reach the decision making process which is to compare the
matrix values for the test chip to threshold values derived from golden chips at different
process corners, in order to detect the presence or absence of a Trojan. Next we describe
each of the steps in detail.
1. The blocks should be reasonably large to cancel out the effect of random parameter
variations, but small enough to minimize the background current. It should also be
kept in mind that if the regions are too small, the number of regions can become
unreasonably large for the test vector generation algorithm to handle.
the test generation process can increase the activity of one block (or few blocks) while
minimizing the activity of all others.
3. The decomposition process can be performed hierarchically. For instance, a system-
on-a-chip (SoC) can be divided into the constituent blocks which make up the system.
But, for a large SoC, one of the blocks could itself be a processor. Hence, we need to
further divide this structural block into functional sub-blocks.
1. Only one region must be activated at a time. If the inputs to different modules are mu-
tually exclusive and the regions have minimal interconnection, it is easy to maximally
activate one region while minimizing activity in other regions. If complex intercon-
nections exist between the modules, the inputs need to be ranked in terms of their
sensitivity towards activating different modules and the test generation needs to be
aware of these sensitivity values.
2. When a particular region is being activated, the test vectors should try to activate
possible Trojan trigger conditions and should be aimed at creating activity within
most of the innumerable possible Trojans. This motivates us to consider a statistical
test generation approach like the one described in [81] for maximizing Trojan trigger
coverage. Note that, unlike functional testing approaches, the Trojan payload need not
be affected during test time, and the observability of Trojan effect on the side-channel
parameter is ensured by the region-based self-referencing approach described earlier.
Fig. 15.18 shows a flow chart of the test vector generation algorithm on the right. For
each region, we assign weights to the primary inputs in terms of their tendency to maximize
activity in the region under consideration while minimizing activity in other regions. This
step can also identify control signals which can direct the activity exclusively to particular
regions. Next, we generate weighted random input vectors for activating the region under
consideration and perform functional simulation using a graph-based approach, which lets
us estimate the activity within each region for each pair of input vectors. We sort the vectors
based on a metric Cij which is higher for a vector pair which can maximally activate region
Ri while minimizing activity in each of the other regions. Then, we prune the vector set to
choose a reduced but highly efficient vector set generated by a statistical approach such as
MERO [81]. In this approach (motivated by the N-detect test generation technique), within
a region, we identify internal nodes with rare values, which can be candidate trigger signals
for a Trojan. Then we identify the subset of vectors which can take the rare nodes within
the region to their rare values at least N times, thus increasing the possibility of triggering
the Trojans within the region. Once this process is completed for all the regions, we combine
the vectors and generate a test suite which can be applied to each chip for measuring supply
current corresponding to each of its regions.
For functional test of a multi-core processor, we can use specially designed small test
programs which are likely to trigger and observe rare events in the system such as events on
the memory control line or most significant bits of the datapath multiple times. In general
FIGURE 15.18: The major steps of the proposed self-referencing methodology. The steps
for test vector generation for increasing sensitivity and threshold limit estimation for cali-
brating process noise are also shown.
a design is composed of several functional blocks and activity in several functional blocks
can be turned off using input conditions. For example in a processor, activity in the floating
point unit (FPU), branch logic or memory peripheral logic can be turned off by selecting an
integer ALU operation. Many functional blocks are pipelined. In these cases, we will focus
on one stage at a time and provide initialization to the pipeline such that the activities of
all stages other than the one under test are minimized by ensuring that the corresponding
stage inputs do not change. Next we describe how the self-referencing approach can be
applied to compare the current values for different regions and identify the Trojan-infected
region.
In this step, we measure the current from different blocks which are selectively activated,
while the rest of the circuit is kept inactive by appropriate test vector application. Then
the average supply current consumed by the different blocks is compared for different chip
instances to see whether the relations between the individual block currents are maintained.
Any discrepancy in the “slope” of the current values between different blocks indicates the
presence of Trojan. This approach can be hierarchically repeated for further increasing sen-
sitivity by decomposing the suspect block into sub-blocks and checking the self-referencing
relationships between the current consumed by each sub-block.
The flowchart for this step is shown in Fig. 15.18. Note that the best Trojan detection
capability of region-based comparison will be realized if the circuit is partitioned into regions
of similar size. The region slope matrix is computed by taking the relative difference between
the current values for each region. We estimate the effect of process variations on the “slopes”
to determine a threshold for separating the golden chips from the Trojan-infested ones. This
can be done by extensive simulations or measurements from several known-golden chips. as
previously discussed, for a design with n regions, the region slope matrix is an n × n matrix,
Ii − Ij
Sij = ∀i, j ∈ [1, n] (15.26)
Ii
For each region, we get 2n − 1 slope values, of which one of them is ‘0’, since the diagonal
elements Sii will be zero.
The intra-die systematic variation is eliminated primarily because we use the current
from an adjacent block, which is expected to suffer similar variations, to calibrate process
noise of the block under test. The intra-die random variations can be eliminated by consid-
ering switching of large number of gates. In our simulations we find that even switching of
50 logic gates in a block can effectively cancel out random deviations in supply current.
N X
X N
D(k) = (Sij |Chip k − Sij |golden,nominal )2 . (15.27)
i=1 j=1
The limiting “threshold” value for golden chips can be computed by taking the difference
D(golden, process variations) as defined by
N X
X N
T hreshold = (Sij |golden,process variation − Sij |golden,nominal )2 . (15.28)
i=1 j=1
Any variation beyond the threshold is attributed to the presence of a Trojan. The steps for
computing the golden threshold limits are illustrated on the left side of Fig. 15.18. Since
unlike conventional testing, a go/no-go decision is difficult to achieve, we come up with a
measure of confidence about the trustworthiness of each region in a chip using an appro-
priate metric. We compare the average supply current consumed by the different blocks for
different chip instances to see whether the expected correlation between the individual block
currents is maintained. The Trojan detection sensitivity of the self-referencing approach can
be defined as
D(tampered, nominal)
Sensitivity = (15.29)
T hreshold
Since, the slope values are less affected by process variations compared to the current
values alone, we expect to get better sensitivity compared to eqn. (15.25). Note that since
we perform region-based comparison, we can localize a Trojan and repeat the analysis
within a block to further isolate the Trojan. This approach can be hierarchically repeated
to increase the detection sensitivity by decomposing a suspect block further into sub-blocks
and applying the self-referencing approach for those smaller blocks. We can also see that
the region-based self-referencing approach is scalable with respect to design size and Trojan
size. For the same Trojan size, if the design size is increased two-fold, we can achieve
same sensitivity by dividing the circuit into twice as many regions. Similarly we can divide
the circuit into smaller regions to increase sensitivity towards detection of smaller Trojan
circuits.
FIGURE 15.19: Self-referencing methodology for detecting Trojan in the 32-bit ALU and
FIR circuits. Blue and red lines (or points) denote golden and Trojan chips, respectively.
FIGURE 15.20: Sensitivity analysis with (a) different number of regions, (b) different
circuit sizes, and (c) different Trojan sizes.
TABLE 15.4: Probability of Detection and Probability of False Alarm (False Positives)
Circuit Name TN(%) FP(%) FN(%) TP(%)
32-bit ALU 99.10 0.90 5.90 94.10
FIR 97.72 2.28 6.60 93.40
pairs activating the subtracter module. The region slope matrix for this case is shown in
Fig. 15.19(b). This matrix contains 8 regions, since each of the four structurally separate
regions of the ALU are further divided into two sub-blocks corresponding to the two different
vector pairs which share the same opcode values. It can be readily observed that increasing
the number of regions increases the sensitivity of Trojan detection.
Fig. 15.19(c) shows the simulation results for the FIR design. The test vectors are chosen
by the MATLAB tool and used to dominantly activate different regions of the design. The
region slope matrix is computed for 50 golden chips and 50 Trojan-infected chips and we can
successfully detect the Trojan-infected region (region 4). Fig. 15.20 shows the variation in
sensitivity of the self-referencing approach by varying different parameters of the ALU. For
a 16-bit ALU, we see that increasing the number of regions helps increase the sensitivity in
Fig. 15.20(a). In Fig. 15.20(b), we plot the sensitivity of the approach for increasing circuit
sizes. Finally in Fig. 15.20(c), we show that increasing the number of regions also helps
to keep the sensitivity nearly constant as we scale down the Trojan size. The percentage
of true positives, true negatives, false positives and false negatives as obtained from the
Monte Carlo simulations are presented in Table 15.4. We used a process point with 20% VT
variation to compute the threshold. For smaller circuits and larger Trojans the sensitivity
is higher and hence, the accuracy of classification is also better.
FIGURE 15.21: Experimental results for 8 golden and 2 tampered FPGA chips. Region
slope matrix for (a) 32-bit DLX processor; (b) 32-bit ALU. The limit lines are obtained by
analyzing the 8 golden chips. The red points denote the values for the Trojan-containing
test chips while the blue points denote the values for the golden chips.
which contains the Trojan in both cases. Next, we repeat the procedure using test vectors
which only activate the four sub-regions inside the 32-bit ALU and identify that the Trojan
is located within the subtracter.
15.5.6 Conclusions
In this chapter, we have described techniques for hardware Trojan detection that depend
on the effect of an inserted Trojan on the current consumption of the circuit. The main
challenge is to have high Trojan detection sensitivity in the presence of process variation
noise. In addition, we have shown how to increase the sensitivity of the techniques further
by combining them with logic testing techniques. We have demonstrated the effectiveness
of the two proposed experimental techniques, and have given theoretical justifications for
them.
“Byrne’s Law: In any electrical circuit, appliances and wiring will burn
out to protect fuses.”
Robert Byrne
16.1 Introduction
In the two previous chapters, we have examined testing techniques for Trojan detection.
Logic-testing based Trojan detection techniques are effective for small Trojans that do affect
the circuit logic functionality, and side-channel analysis based testing techniques are effective
for complex Trojans that affect the side-channel parameters (current signature, delay, etc.)
substantially. However, there can be a third line-of-thought while somebody is trying to
protect against hardware Trojans: what if the original circuit, by its very design, provides
protection against Trojan insertion, or at least, resists attempts of Trojan insertion?
453
In this chapter, we explore three conceptually different design techniques directed to-
wards Trojan detection following the above–mentioned line–of–thought. The first technique
relies on obfuscation of circuit functionality through structural modifications that help in
reducing the potency of an inserted hardware Trojan. The technique is developed based
on the argument that any successful attempt to insert a hardware Trojan would require
a through understanding of the circuit functionality on part of an adversary. However, as
a result of the adopted functional obfuscation technique through structural modifications,
an adversary might find it difficult to correctly comprehend the circuit functionality. This
in turn might result in an inserted Trojan to either not trigger, or become more vulnera-
ble to logic testing based Trojan detection techniques, like that proposed in Chapter 14.
A conceptually similar protection scheme has been earlier suggested in [399] for software,
whereby by decreasing the comprehensibility of a program, malicious software modification
is prevented.
The second technique relies on structural and functional isolation of modules on a FPGA
to prevent the possibility of them interacting in a malicious manner.
The third and final technique provides a FPGA-based design infrastructure to insert
extra logic that performs run-time execution monitoring to detect and prevent any imminent
circuit logic malfunction.
(a)
(b)
FIGURE 16.1: The obfuscation scheme for protection against hardware Trojans: (a) mod-
ified state transition graph and (b) modified circuit structure.
easily hidden from the end-user by utilizing the inherent latency of most ASICs during
a “boot-up” or similar procedure on power-ON [77]. We consider the post-manufacturing
testing phase to be “trusted” such that there is no possibility of the secret initializaton key
to be leaked to the adversary in the fab. This is a commonly accepted convention which
was first explicitly stated in [107]. To protect against the possibility of an user releasing
the initialization key sequence of the design in the public domain, user-specific initialization
key sequence or in the extreme case instance-specific initialization key sequence might be
employed.
To “blow up” the size of the obfuscation state space, a number of extra state elements
are added depending on the allowable hardware overhead. The size of the obfuscation state
space has an exponential dependence on the number of extra state elements. An inserted
parallel finite state machine (PSM) defines the state transitions of the extra state elements.
However, to hide possible structural signature formed by the inserted PSM, the cicruit
description of the PSM is folded into the modified state machine in the original circuit
(MOSM) (as shown in Fig. 16.1(b)) to generate an integrated state machine. A logic re-
synthesis step is performed, including logic optimization under design constraints in terms
of delay, area or power. In effect, the circuit structures such as the input logic cones of the
original state elements change significantly compared to the unobfuscated circuit, making
reverse-engineering of the obfuscated design practically infeasible for a adversary. This effect
is illustrated in Section 16.2.5 through an example benchmark circuit.
To increase the level of structural difference between the obfuscated and the original
circuits, the designer can choose to insert modification cells as proposed in [77] at selected
internal nodes. Furthermore, the level of obfuscation can be increased by using more states
in the obfuscated state space. This can be achieved by: 1) adding more state elements to
the design and/or 2) using more unreachable states from the original design. However, this
can increase the design overhead substantially. In Section 16.4.1 we describe a technique to
reduce the design overhead in such cases.
Selected states in the isolation state space can also serve the purpose of authenticating
the ownership of the design, as described in [77]. Authentication for sequential circuits is
usually performed by embedding a digital watermark in the STG of the design [276, 381],
and our idea of hiding such information in the unused states of the circuit is similar to
[426]. A digital watermark is a unique characteristic of the design which is usually not
part of the original specification and is known only to the designer. Fig. 16.1 shows such a
scheme where the states S0A , S1A and S2A in the isolation state space and the corresponding
output values of the circuit are used for the purposes of authenticating the design. The
design goes through the state transition sequence S0O →S0A →S1A →S2A on the application
of the sequence A1 →A2 →A3 . Because these states are unreachable in the normal mode of
operation, they and the corresponding circuit output values constitute a property that was
not part of the original design. As shown in [77], the probability of determining such an
embedded watermark and masking it is extremely small, thus establishing it as a robust
watermarking scheme.
assuming 2−pk ≪ 1. Similarly, the probability that the simulations started in the initializa-
tion state space or the isolation state space and remained confined there:
S ′ h i
f1 ·2n ·B
P T⊆ I U = 1− k·2−p
(f ·2n ·B+M )(1−2−p )
· f ·2
1
n ·B+M (16.3)
1
′
where U denotes the complement set of U . Again, the probability that the simulations
started in the normal state space, and remained confined there is:
M
P (T ⊆ U ) = (16.4)
f1 ·2n ·B + M
To maximize the probability of keeping the simulations confined in the obfuscation state
space, the designer should ensure:
n [ ′ o n [ o
P T ⊆ I U ≫ P (T ⊆ U ) + P T ⊆ I U (16.5)
k·2−p
Approximating M ≫ 1−2−p , and simplifying, this leads to:
f1 ·2n ·B ≫ M (16.6)
This equation essentially implies the size of the obfuscation state space should be much
larger compared to the size of the normal state space, a result that is intuitively expected.
From this analysis, the two main observations are:
• The size of the obfuscation state space has an exponential dependence of the number
of extra state elements added.
• In a circuit where the size of the used state space is small compared to the size of the
unused state space, higher levels of obfuscation can be achieved at lower hardware
overhead.
As an example, consider the ISCAS-89 benchmark circuit s1423 with 74 state elements (i.e.
N = 74), and > 272 unused states (i.e., 2N − M > 272 ) [424]. Then, M < 1.42×1022 , and
considering 10 extra state elements added (i.e. n = 10), f1 > 0.0029 for eqn. (16.6) to hold.
Thus, expanding the state space in the modified circuit by about 3% of the available unused
state space is sufficient in this case.
test vectors. However, let the actual rare logic value probabilities of these internal nodes be
pi + ∆pi , for the i-th trigger node. Then, the Trojan would be actually activated once (on
average) by:
′ 1 N
N = q = q (16.8)
Y Y ∆pi
(pi + ∆pi ) (1 + )
i=1 i=1
pi
test vectors. The difference between the estimated and the actual number of test vectors
′
before the Trojan is activated is ∆N = N − N , which leads to a percentage normalized
difference:
∆N 1
(%) =
1 − Y
q
× 100% (16.9)
N ∆pi
(1 + )
i=1
pi
To appreciate the effect that ∆p and q has on this change on the average number of
vectors that can activate the Trojan, assume ∆p
pi = f
i
∀i = 1, 2. . .q; then eqn. (16.9) can
be simplified to:
∆N 1
(%) = 1 − × 100% (16.10)
N (1 + f )q
Fig. 16.2 shows this fractional change plotted vs. the number of trigger nodes (q) for
different values of the fractional mis-estimation of the signal probability (f ). From this plot
and eqns. (16.9) and (16.10), it is evident that:
• The probability of the Trojan getting detected by logic testing increases as the number
of Trojan trigger nodes (q) increases. However, it is unlikely that the adversary will
have more than 10 trigger nodes, because otherwise as shown by our simulations, it
becomes extremely difficult to trigger the Trojans at all.
• For values 2≤q≤10, the number of random input patterns required to activate the
trojan decreases sharply with q. The improvement is more pronounced at higher values
of f . This observation validates the rationale behind an obfuscation-based design
approach that resists the adversary from correctly estimating the signal probabilities
at the internal circuit nodes.
FIGURE 16.2: Fractional change in average number of test vectors required to trigger a
Trojan, for different values of average fractional mis-estimation of signal probability f and
Trojan trigger nodes (q).
(a)
(b)
FIGURE 16.3: Comparison of input logic cones of a selected flip-flop in s15850: (a) original
design and (b) obfuscated design.
FIGURE 16.4: Steps to find unreachable states for a given set of S state elements in a
circuit.
input logic cones (up to 4 levels) of a selected flip-flop in the gate -level netlist of the s15850
ISCAS-89 benchmark, and its obfuscated version, shown in Fig. 16.3. Similarly significant
structural difference was observed for all the benchmark circuits considered by us. If the
adversary is not in possession of an unmodified reference gate-level design, this task is even
more difficult, as the adversary would have no idea about the netlist structure of the origi-
nal design. The theoretical complexity of reverse-engineering similar key–based obfuscation
schemes for circuits has been analysed in [229, 208], where such systems were shown to be
“provably secure” because of the high computational complexity.
logic-1 event in those nodes, we select a set of candidate trigger nodes with Sp less than a
specified trigger threshold (θ). Next, starting from a large set of weighted random vectors, a
smaller testset is generated to excite each of these candidate trigger nodes to its rare value
at least N times, where N is a given parameter. This is done because excitation of each
rare node individually to its corresponding rare value multiple times is likely to increase
the probability of the Trojans triggered by them to get activated, as shown by the analysis
in [81]. We have observed through extensive simulations on both combinational (ISCAS-
85) and sequential (ISCAS-89) benchmark circuits, that such a statistical test generation
methodology can achieve higher Trojan detection coverage than weighted random vector
set, with 85% reduction in test length on average [81]. Note that for sequential circuits,
we assume a full-scan implementation. Sequential justification is applied to eliminate false
Trojans, i.e. Trojans which cannot be triggered during the operation of the circuit.
nodes for which the Sp values differ by a pre-defined threshold. The difference in estimated
Sp prevents an adversary from exploiting the true rare events at the internal circuit nodes
in order to design a hard-to-detect Trojan. On the other hand, true non-rare nodes may
appear as rare in the obfuscated design, which potentially serve as decoy to the adversary.
The above two effects are summed up by the increase in Trojan detection coverage due to
the obfuscation. The coverage increase is estimated by comparing the respective coverage
values obtained for the obfuscated and the original design for the same number of test
patterns.
FIGURE 16.6: Variation of protection against Trojans in s1196 as a function of (a), (b)
and (c): the number of added flip-flops in state encoding (S); (d), (e) and (f): the number
of original state elements used in state encoding (n). For (a), (b) and (c), four original
state elements were selected for state encoding, while for (d), (e) and (f), four extra state
elements were added.
The RTL is then integrated with the original gate-level netlist, with appropriate control
signals to enable the operation in the two different modes. The modified circuit is then
re-synthesized under input design constraints using Synopsys Design Compiler to generate
the obfuscated version of the circuit. If the area of the re-synthesized circuit is larger than
the user-specified area overhead constraint, S and n are each decreased by one and the
process is repeated until the area constraint is satisfied for the obfuscated design.
We assumed the Trojan model shown in Fig. 13.5(a). We wrote three C programs to
estimate the effectiveness of the proposed obfuscation scheme for protection against Trojans.
The computation of signal probabilities at the internal nodes is done by the program RO-
Finder (Rare Occurrence Finder). The testset for Trojan detection achieving multiple
excitation of rare trigger conditions is performed by the program ReTro (Reduced pattern
generator for Trojans). The generation of the reduced pattern set by the elimination of
the patterns with states in the obfuscation state space is performed by a TCL program.
The decrease in the Trojan potency and the increase in the Trojan detectability are then
estimated by a cycle-accurate simulation of the circuit by the simulator TrojanSim (Trojan
Simulator). TetraMax is used for sequential justification of the Trojan triggering conditions.
Fig. 16.5 shows the steps to estimate the effectiveness of the obfuscation scheme [79]. The
entire flow was integrated with the Synopsys design environment using TCL scripts. A LEDA
250nm standard cell library was used for logic synthesis. All simulations, test generation
and logic synthesis were carried out on a Hewlett-Packard Linux workstation with a 2GHz
dual-core processor and 2GB RAM.
TABLE 16.1: Effect of Obfuscation on Security Against Trojans (100,000 random patterns,
20,000 Trojan instances, q = 2, k = 4, θ = 0.2)
Obfuscation Effects
Benchmark Trojan Obfus. Benign False Prob. Func. Troj.
Circuit Instances Flops Trojans Nodes Cov. Incr.
(n + S) (%) (%) (%)
s1488 192 8 38.46 63.69 0.00
s5378 2641 9 40.13 85.05 1.02
s9234 747 9 29.41 65.62 1.09
s13207 1190 10 36.45 83.59 0.56
s15850 1452 10 40.35 68.95 2.65
s38584 342 12 33.88 81.83 0.45
TABLE 16.2: Effect of Obfuscation on Security Against Trojans (100,000 random patterns,
20,000 Trojan instances, q = 4. k = 4, θ = 0.2)
Obfuscation Effects
Benchmark Trojan Benign False Prob. Func. Troj.
Circuit Instances Trojans (%) Nodes (%) Cov. Incr. (%)
s1488 98 60.53 71.02 12.12
s5378 331 70.28 85.05 15.00
s9234 20 62.50 65.62 25.00
s13207 36 80.77 83.59 20.00
s15850 124 77.78 79.58 18.75
s38584 11 71.43 77.21 50.00
16.4 Results
To verify the trends predicted in Section 16.2.3, we investigated the effects of adding
extra state elements (n) and unreachable states determined from variable number of existing
state elements (S) on the level of protection against Trojans. Fig. 16.6 shows the variation
in the percentage of Trojans rendered benign, percentage of internal nodes with false signal
probability, and the percentage increase in detectability of Trojans for the s1196 benchmark
circuit. These plots clearly show the increasing level of protection against Trojans with the
increasing size of the obfuscation state space, which matches the theoretical predictions in
Section 16.2.3.
Table 16.1 and Table 16.2 show the effects of obfuscation on increasing the security
against hardware Trojans for a set of ISCAS-89 benchmark circuits with 20,000 random
instances of suspected Trojans, trigger threshold (θ) of 0.2, trigger nodes (q) 2 and 4,
respectively. Optimized vector set was generated using N =1000. The same value of n + S
applies to both sets of results. The length of the initialization key sequence was 4 (k = 4)
for all the benchmarks. The effect of obfuscation was estimated by three metrics: (a) the
fraction of the total population of structurally justifiable Trojans becoming benign; (b) the
difference between the signal probabilities at internal nodes of the obfuscated and original
circuit, and (c) the improvement in the functional Trojan coverage, i.e. the increase in the
percentage of valid Trojans detected by logic testing. Note that the number of structurally
justifiable Trojans (as determined by TetraMax) decreases with the increase in the number
of trigger nodes of the Trojan, and increasing size of the benchmark circuits. From the tables
it is evident that the construction of the obfuscation state space with even a relatively small
number of state elements (i.e. a relatively small value of n + S) still makes a significant
fraction of the Trojans benign. Moreover, it obfuscates the true signal probabilities of a
FIGURE 16.7: Effect of obfuscation on Trojans: (a) 2-trigger node Trojans (q = 2), and
(b) 4-trigger node Trojans (q = 4).
large number of nodes. The obfuscation scheme is more effective for 4-trigger node Trojans.
This is expected since a Trojan with larger q is more likely to select at least one trigger
condition from the obfuscation state space.
Fig. 16.7 shows the two different effects by which Trojans are rendered benign (as dis-
cussed in Section and 16.2.3) - i.e. some of them are triggered only in the obfuscation state
space, while the effect of some are propagated to the primary output only in the obfuscation
state space. In these plots, the greater effectiveness of the obfuscation approach for 4-trigger
node Trojans is again evident.
Fig. 16.8 shows the improvement in Trojan detection coverage in the obfuscated design
compared to the original design for the same number of random vectors. This plot illustrates
the net effect of the proposed obfuscation scheme in increasing the level of protection against
Trojans, with an average increase of 14.83% for q = 2 and 20.24% for q = 4. The greater
effectiveness for q = 4 agrees with the theoretical observation in Section 16.2.4.
Table 16.3 shows the design overheads (at iso-delay) and the run-time for the proposed
obfuscation scheme. The proposed scheme incurs modest area and power overheads, and
the design overhead decreases with increasing size of the circuit. The results and trends are
comparable with the STG modification based watermarking schemes proposed in [276, 426].
As mentioned earlier, the level of protection against Trojan can be increased by choosing
a larger n + S value at the cost of greater design overhead. The run-time presented in the
table is dominated by TetraMax, which takes more than 90% of the total time for sequential
justifications.
16.4.1 Discussions
16.4.2 Application to Third-party IP Modules and SoCs
Pre-designed third-party hardware IP blocks, which have been supplied either as syn-
thesizable “Register Transfer Level” (RTL) descriptions (also known as “soft macros”), or
as synthesized gate-level netlists (also known as “firm macros”), can be modified to imple-
ment the proposed methodology. For RTL descriptions, obfuscation of an IP module can
be achieved in two ways. In the first, “direct method”, the used and unused states of the
circuits can be identified by direct analysis of the control and data flow graph (CDFG) of
the derived from the RTL, and then additional RTL code can be automatically generated
to realize the change in the STG of the circuit. An automated design flow for performing
similar key-based control-flow obfuscation for the purpose of IP protection of RTL designs
has been previously proposed in [80]. In the second, “indirect method”, the RTL description
can be synthesized to a gate-level netlist to apply the proposed technique.
The proposed technique can also be extended to multi–IP system-on-chips (SoCs), even
in those cases where the communication fabric of the SoC is custom-designed. This is possi-
ble since the proposed methodology does not depend on the structure and communication
protocols used by the communication fabric of the SoC. A SoC design methodology that
employs key-based obfuscation of hardware IP modules has been previously proposed in
[77].
FIGURE 16.9: Comparison of conventional and proposed SoC design flows. In the pro-
posed design flow, protection against malicious modification by untrusted CAD tools can
be achieved through obfuscation early in the design cycle.
tially used as Trojan triggers or payloads. Moreover, since large number of states belong to
the obfuscation state space, an automation tool is very likely to insert a Trojan randomly
that is only effective in the obfuscation mode. Note that since we obfuscate the gate-level
netlist, protection against CAD tools can be achieved during the design steps following logic
synthesis (e.g. during physical synthesis and layout).
To increase the scope of protection by encompassing the logic synthesis step, we propose
a small modification in the obfuscation-based design flow. Fig. 16.9 compares a conventional
IP-based SoC design flow with the proposed modified design flow. In the conventional de-
sign flow, the RTL is directly synthesized to a technology mapped gate-level netlist, and
obfuscation is applied on this netlist. However, in the modified design flow, the RTL is first
compiled to a technology independent (perhaps unoptimized) gate-level description, and
obfuscation is applied on this netlist. Such a practice is quite common in the industry, and
many commercial tools support such a compilation as a preliminary step to logic synthesis
[374]. The obfuscated netlist is then optimized and technology mapped by a logic synthesis
tool. Note that the logic synthesis step now operates on the obfuscated design, which pro-
tects the design from potential malicious operations during logic synthesis. Also, the RTL
compilation (without logic optimization) is a comparatively simpler computational step for
which the SoC design house can employ a trusted in-house tool. This option provides an
extra level of protection.
This proposed obfuscation methodology also provides protection against malicious CAD
tools in Field Programable Gate Array (FPGA) based design flows. As noted in [107], the
main threat of Trojan insertion in such a flow comes from the CAD tools which convert the
RTL description of a design to the FPGA device specific configuration bitstream. Typically,
the fabric itself can be assumed to be Trojan-free [107]. Similar to the SoC design flow,
we propose a small modification to the FPGA design flow that maximizes the scope of
protection against FPGA CAD tools. Fig. 16.10 shows the proposed design flow. The RTL
FIGURE 16.10: Proposed FPGA design flow for protection against CAD tools.
FIGURE 16.11: Obfuscation for large designs can be efficiently realized using multiple
parallel state machines which are constructed with new states due to additional state ele-
ments as well as unreachable states of original state machine.
FIGURE 16.12: Functional block diagram of a crypto-SoC showing possible Trojan attack
to leak secret key stored inside the chip. Obfuscation coupled with bus scrambling can
effectively prevent such attack.
verification between the integer unit in Fig. 16.12 and a functionally equivalent reference
design to find the port association.
However, if all the modules in the given SoC are obfuscated using the proposed approach,
it would be practically infeasible for a formal verification tool to establish structural equiv-
alence [77]. The other choice left to the attacker is to simulate the circuit by applying input
vectors. For simplicity, assume all modules in the SoC are initialized simultaneously and
by the same initialization key sequence. Then, to reach the normal mode of operation, the
adversary needs to first apply the correct unknown initialization vectors in correct order to
enable normal operating mode of the IC. Only then the adversary would be able to establish
the actual bus order through simulations, the complexity of which has already been shown
to be extremely high. The probability of succeeding in reaching the normal mode by the
application of random vectors to the primary input of a SoC with M primary inputs and an
initialization key sequence length of N is M1·N . Assuming the SoC shown in Fig. 16.12 has
2
32 inputs, and assuming the length of the initialization key sequence to be 4, the probabil-
ity of the adversary taking the obfuscated SoC to the normal mode is ∼ 10−39 . The width
of the data bus for the key is typically 128 or 256, which would increase the complexity
exponentially. A similar argument can be presented for Trojans of the type shown in Fig.
13.5(d).
Usually, automatic place-and-route CAD tools for FPGAs do not isolate cores placed on
the same FPGA. Hence, it becomes extremely difficult for large designs to verify the con-
nectivity in such overlapped place-and-route situations. The implementation of the moats is
done by the authors by performing extensive manual placement of the cores on the FPGA
fabric, and the connectivity is verified by custom tools written by the authors.
Fig. 16.13 shows a scenario where several cores on a FPGA are isolated by restricting
their placement to different non-overlapping locations of the FPGA. If the interconnects
connecting the different “switch-boxes” on the FPGA routing fabric are restricted to a
width w, then it is sufficient to separate the modules by a moat of width w while performing
placement. Thus, w = 2 in the above scenario. However, restricting the interconnect lengths
to some limiting values usually results in higher routing-related overhead. For the Xilinx
FPGA platform, the authors reported 15% average increase in area and 19% average increase
in delay.
to prevent any disastrous circuit malfunction from happening. Thus, instead of fighting an
invisible enemy, the approach is to detect any deviation from the expected circuit operation,
and take pro–active preventive measures when such a situation is detected.
Since sometimes it might be difficult to distinguish between a circuit malfunction due to
Trojans or circuit defects, these techniques are essentially variants of fault-tolerant circuit
design techniques, which often use checksums at important internal circuit nodes to detect
the state of the circuit. The design is analyzed, and the given HDL code is modified to
instantiate special reconfigurable, synthesizable error checking hardware primitives. The
authors term such add–on “infrastructure logic” as “DEFENSE” (acronym for “Design–
for–Enabling–Security”). The authors describe the features of a commercial CAD software
tool to perform the above task, with the user having the liberty to mention the nodes to be
monitored.
The checker circuitry periodically performs sanity checking during run-time (without
disrupting normal functionality), and reports any deviation from expected behavior on-
the-fly. The assertions required to perform the sanity checking, and the comparison of the
observed internal logic values with the expected values are performed by re-configurable
FSMs called “Security Monitors” (SMs). These FMSs have to be specified by the designer.
SMs get their inputs from distributed, pipelined MUX networks termed “Security Probe
Networks” (SPNs). SPNs and SMs are configured by a master controller called the “Security
and Control Processor” (SECORPRO). The configuration of one SM does not disrupt the
normal system operation of the checking activity of another SM. If any deviation from
expected behavior is detected, the SECORPRO is informed by the SM that detected the
deviation, and this causes the SECORPRO to take remedial measures, such as disabling the
misbehaving core. Fig. 16.14 shows the overall scheme. System recovery has to be performed
by supervising hardware/software, based on any signal that SECORPRO might generate.
The DEFENSE logic is inserted at the RTL, and then the modified RTL can be subjected
to the usual IP–based IC design flow. The SECORPRO is configured by first decrypting an
encrypted configuration bitstream stored on a secure flash memory module inaccessible to
the chip manufacturer, and any potential attacker. However, the authors do not mention
the hardware or performance overhead figures.
473
475
“The mantra of any good security engineer is: Security is a not a prod-
uct, but a process. It’s more than designing strong cryptography into a
system; it’s designing the entire system such that all security measures,
including cryptography, work together.”
Bruce Schneier
17.1 Introduction
The pervasive use of hardware devices makes them vulnerable to physical attacks due
to their easy availability to adversaries. To have a secure “digital life”, we need to ensure
the security of the hardware components along with that of the software. As mentioned
in previous chapters, hardware counterfeiting is pervasive in the modern world, and causes
great revenue loss to the hardware industry. It also compromises a user’s safety and security
as she is unsure about the origin of the electronic devices being used, which might carry
out surreptitious malicious activities.
Most practical security mechanisms are based on some secret information. In traditional
cryptography based solution, this secret information is used as an input(key) to the en-
cryption/decryption algorithm. While cryptographic algorithms are mathematically secure
against attack, it is known that digitally-stored secret key in on/off-chip memory can be vul-
nerable to physical attacks. In security tokens, such as smart cards, the secret key is stored
in on-chip non-volatile memory, but FPGAs instead store the key in off-chip memory.
The Physically Unclonable Functions (PUFs) offer an efficient alternative to storing
secret keys in on/off-chip memory. They exploit uncontrollable and intrinsic physical char-
acteristic patterns of silicon devices due to physical process variations in manufacturing of
integrated circuits (ICs). A PUF maps a set of digital input vectors, known as challenges,
to a corresponding set of outputs, known as responses for use as unique fingerprint after
required amount of post-processing. The basic property of a PUF is that the challenge-
to-response mapping function is instance specific, and cannot be replicated by any known
physical mean. Thus, it can be used as a volatile key storage mechanism and deters the
invasive attacks for key discovery.
2. Unique: Γ(c) contains some information about the identity of the physical entity em-
bedding Γ.
5. Unpredictable: given only a set Q = {(ci , ri = Γ(ci )}; it is hard to predict ru = Γ(cu )
where cu 6∈ Q.
Research in the field of PUF was started in 2000 by the seminal work of Lofstrom et al.
[224] that exploit mismatch in silicon devices for identification of ICs. In 2001, Pappu et
al. [293] presented the concept of Physical one-way function which subsequently led to the
idea of PUF [128]. Since then, different types of PUFs have been proposed. The type of PUF
that soon drew the most attention of the most researchers is Silicon PUF (sPUF). B.Gassend
et al. [128] were the first to implement silicon PUF in Commodity Field Programmable Gate
Arrays (FPGAs). Table 17.1 shows a chronologic list of progress on PUF design starting
from 2000. More detailed study of different type of PUF have been presented in [234]. In
the next section we present a detailed classification of PUFs depending on diverse aspects
• Optical PUF (see Fig. 17.1) could be considered as the first proposed PUF [294],
although it was originally proposed as the physical embodiment of an (cryptographic)
one-way function. The core component of an optical PUF is a transparent token with
randomly doped scattering particles. When radiated with a laser a complex image with
bright and dark spots arises, a so-called “speckle pattern”. A Gabor filter turns out
to be a good feature extractor for such a pattern and the filter output is the response
of the optical PUF, while the physical parameters of the laser (location, orientation,
wave length) constitute the challenge. Due to the complex nature of the interaction
of the laser light with the scattering particles, the responses are highly random and
unique. The high dependence of the response on the exact microscopic physical details
of the optical token causes two equally produced tokens to exhibit different responses
to the same challenge, and prevents a particular token from being cloned with high
precision.
• Silicon PUF exploits uncontrollable CMOS manufacturing variations which are the
result of unavoidable imperfections in modern IC fabrication processes. Manufactur-
ing variation of parameters such as dopant concentrations and line widths manifest
themselves as differences in timing behavior between instances of the same IC. These
timing differences can be measured using a suitable circuit, and if desired, encoded
to a digital value. Ideally, a silicon PUF should not require a deviation from the nor-
mal CMOS processing steps, as well as be implementable using standard EDA design
flows. It has been observed that these PUFs are quite sensitive to temperature varia-
tions and that compensation schemes for this effect have to be implemented to make
the system work properly. More details about silicon PUFs are presented in [234].
• Challenge-Response Pairs The number of CRPs must be very large; often it is expo-
nential with respect to some system parameters, for example, number of components
used for building the PUF.
• Practicality and Operability The CRPs should be sufficiently stable and robust to
variations in environmental conditions and multiple readings.
• Access Mode Any entity that has access to the Strong PUF can apply multiple chal-
lenges to it and can read out the corresponding responses. There is no protected,
controlled or restricted access to the PUF’s CRPs.
• Security Without physically possessing a Strong PUF, neither an adversary nor the
PUF’s manufacturer can correctly predict the response to a randomly chosen challenge
with a high probability.
Weak PUFs possess only a small number of fixed challenges and its responses may remain
secret and internal.
It is worth mentioning that meaning of the strong and weak pseudorandom function in
classical cryptography are different from the notions of weak and strong PUFs described
above. In cryptography, weak does not refer to an amount of CRPs, but to the ability of
the adversary to select his queries adaptively.
PUFs, for example when rPUF-based wireless access tokens are re-used, the new user would
not be able to access privacy-sensitive information of the previous user of the token.
Physical reconfiguration changes the physical structure of PUF. In [214], authors describe
that physical reconfiguration of optical PUF by driving the laser at a higher current such
that a laser beam of higher intensity is created which melts the polymer locally and enables
to the scattering particles to reposition. After a short time the laser beam is removed and
the structure cools down such that the particles freeze. Physical reconfiguration of silicon
PUFs are possible when they are implemented only on reconfigurable hardware like FPGA.
This gives a clear indication of in-applicability of physical reconfiguration for most of the
PUF implementation. In [184], the authors present the concept of Logically Reconfigurable
PUF that can be used as a practical alternative to physically reconfigurable PUFs. LR-
PUFs consist of a PUF with a control logic that changes the challenge/response behavior
of the LR-PUF according to its logical state, without physically replacing or modifying the
underlying PUF.
All reconfigurable PUFs have forward and backward unpredictability: the former assures
that responses measured before the reconfiguration event are invalid thereafter, while the
latter assures that an adversary with access to a reconfigured PUF cannot estimate the
PUF behavior before reconfiguration.
We now concentrate on silicon PUF designs.
1
tdelay =
2nstages fosc
selects two distinct ROs, say Ra and Rb , and compare their frequencies to generate response
by eqn. (17.1). (
1, if fa > fb
r= (17.1)
0, otherwise
where fa and fb are frequencies of Ra and Rb , respectively.
If the delay offset between the two paths is too small, the setup time constraint or hold
time constraint of the flip-flop arbiter might be violated, and its output will not depend on
the outcome of the race any more, but be determined by random noise (as the flip-flop will
go to a metastable state). The effect of such metastability would be manifested as statistical
noise in the PUF responses.
1. For each stage, the inverter is duplicated, and a pair of multiplexor (MUX) and
demultiplexor (DEMUX) is added to select either of the inverters to be connected
in the inverter loop, and,
2. All the inverters are replaced with 2-input NOR or NAND gates, and the second
inputs of the gates are connected to a reset signal that replaces the power up and
power off operations.
Addition of reset signal, as shown in Fig. 17.5, eliminates the power-off operation re-
quired before every new measurement. It allows to bring ring into all 0’s state by simply
applying the reset signal to logic-1; and HIGH-to-LOW transition of the reset signal starts
the evaluation of BRPUF for new challenge. Output of any stage can be used as response
(1-bit response) and that must be consistently maintained for all challenges. One important
point needed to be mentioned is that the settling time of BRPUF, the time required by the
ring to settle into a stable state, must be estimated accurately. Wrong estimation of this
time manifests as random noise in PUF response and reduces the reliability.
FIGURE 17.6: (a) Basic delay element. (b) Loop PUF structure.
FIGURE 17.7: Memory PUF cell variants: (a) SRAM PUF cell; (b) Butterfly PUF cell;
(c) Latch PUF cell.
particular control words. Let, fX , fY and fZ are frequencies of LPUF due to control words
X, Y and Z, respectively. The response generated by the controller might be defined by
eqn. (17.3).
r = (ID0 , ID1 , ID2) (17.3)
configuration all unused configuration memory cell are initialized to some predefined values
to deter accidental damage like short-circuit.
FIGURE 17.8: Configuration memory and its use to initialize D Flip-flop [233].
• If read-back is done after the capture event, then current values of flip-flops, rather
that initial values, can be inspected.
The authors of [233] observed that amount of randomness present is power-up values of
D flip-flops is limited due to non-uniform distribution of power-up states of flip-flops and
it is skewed towards value ’0’. This implies the further post-processing of these values is
required to make them suitable to be used as random key.
Next, we consider some performance metrics that helps us to quantitatively estimate
the quality of a given PUF design.
17.5.1 Uniqueness
It represents the ability of PUF instance to uniquely distinguish itself among a set of
PUF instances of the same PUF type on different chip. Average Hamming Distance (HD)
is used to evaluate uniqueness and this value, and ideally it should be close to 50%, which
(C)
means half the bits are different on average. Let, Ri be the n-bit response of a PUF
instance on chip i due challenge set C. The average inter-chip HD among k chips with
Randomness
Steadiness
Hori et al. Correctness
Diffuseness
Uniqueness
Uniformity
Bit-aliasing
Maiti et al.
Uniqueness
Reliability
Su et al. Probability of Misidentification
It is an estimation of the inter-chip variation in terms of the PUF responses and accuracy
of estimation could be improved using large population of PUF instances.
17.5.2 Reliability
The ideal PUF should have perfectly consistent (or reliable) response for a given chal-
lenge. But the dynamic noise, due to variation in operating voltage and temperature, affects
the different circuit components in a non-uniform manner. As a result, a certain level of
inconsistency/instability is introduced in PUF response and is considered as error. A sim-
ple estimation of this error can be performed by the deviation of the PUF response from
the golden response (ideal response collected on ambient temperature). Average intra-chip
(C)
Hamming Distance could be used to estimate this deviation. Let, Ri is the golden re-
sponse of the PUF instance on chip i for challenge set C and it is compared with m n-bit
PUF responses collected in m different working environments. Error rate (λ) is defined by
eqn. (17.5):
m
1 X HD(RiC , Ri,t )
λi = ) × 100% (17.5)
m t=1 n
17.5.3 Uniformity
There should be a uniform distribution of 0’s and 1’s in a given PUF response r for PUF
instance i. For truly random PUF responses, this proportion must be 50%. The uniformity
metric can be estimated by eqn. (17.6):
n
1X
ϕi = ri,l × 100% (17.6)
n
l=1
17.5.4 Bit-aliasing
Bit-aliasing happens when different chips produce nearly identical PUF responses, which
is undesirable. We estimate bit-aliasing of l-th bit of PUF responses for a given challenge as
the percentage of Hamming Weight (HW) of the l-th bit PUF response across the k devices
by eqn. (17.7):
k
1X
βl = ri,l × 100% (17.7)
k i=1
where ri,l is the l-th bit of PUF response on chip i.
17.5.5 Bit-dependency
The autocorrelation test can be used to detect correlation between bits of a response.
Systematic aspect of processes variation may show up as significant correlation at particular
intervals. Since the multi-bit responses are extracted from a common fabric it is possible
for spatial correlation to appear. This is used to measure the randomness in PUF response.
It can be defined by eqn. (17.8):
n
1X
ρxx (j) = Ri ⊕ Ri−j (17.8)
n i=1
where Ri is the n-bit response being observed, and ρxx (j) is autocorrelation coefficient with
lag j. This value tend toward 0.5 for uncorrelated bit-string and toward 0 or 1 for correlated
bit-string.
a mathematical approximator which models the original PUF behavior up to some small
error. None of the known silicon PUFs is mathematically unclonable.
The second, unpredictability states that adversary can’t predict response of a new
challenge form a known set of CRPs. Classically, the notion of unpredictability of a random
function f is formalized by the following security experiment consisting of a learning and
a challenge phase. In the learning phase, the adversary learns the evaluations of f on
a set of input challenges {x1 , x2 , ..., xn } (which may be given from outside or chosen by
adversary). Then, in the challenge phase, the adversary must return (x, f (x)) for some x 6∈
{x1 , x2 , ..., xn }. Usually, unpredictability is measured in terms of entropy(more specifically
average min-entropy) of PUF distribution.
We now describe several security applications of PUFs.
extractors was discussed in [62]. A first efficient hardware implementation of a HDA is given
in [61].
Informally, HDA comprises of two phase as shown in Fig. 17.9:
1. Generation Phase: It starts with the measurement of the fuzzy secret R, in this
context it is PUF response(s). The Secure Sketch procedure SS generates (public)
helper data S ← SS(R) that will be used in reproduction phase and could be stored
in public database. The privacy amplification procedure (Ext) extracts a (secret) key
K ← Ext(R) that could be used in cryptographic primitives as a secret.
2. Reproduction Phase: In this phase the same data is measured, but due to the
fuzzy nature of the source, the produced response R′ will not be exactly the same
as R. However, if R′ lies sufficiently close to R, the response reconciliator procedure
(Rec) is able to reproduce the fuzzy secret R ← Rec(R′ , S) with the help of the
previously produced helper data S. This is also known as information reconciliation.
Like Generation phase, privacy amplification procedure (Ext) extracts a (secret) key
K ← Ext(R).
Now, we discuss one implementation of Gen and Rep using a set of Hash functions
(Universal Hash Function) H and error correcting code C. The parameters [n, k, d] of linear
code C is determined by length of response R and number t of errors to be corrected. Both
generation and key reproduction phase are explained below:
1. Generation Phase: (K, W = [W1 , W2 ]) ← Gen(R). This phase start with assigning
a randomL
code word Cs ← C to response R. Then first component of helper data vector
W1 = Sc R is calculated by SS. Next, a hash function hi is randomly selected from
H by Ext and this sets the W2 = i. The key is calculated by K = hi (R).
used or propose to use PUF as primitives. Till date, researchers have been able (fully or
partially) to apply three traditional cryptographic attacks, described next.
attacks and it is built following three key principles:(a) mixing multiple delay lines, (b)
transformations of the challenge bits, and (c) combination of the outputs from multiple
lines.
The relatively ease-to-model problem of delay-Based PUF has a direct implication to its
usability as a security primitive. Mitigation of modeling attacks is possible by ensuring that
adversary cannot easily challenge the PUF and directly access the response of the PUF.
The “Controlled PUF” was introduced to prevent modeling attack. Here, chosen challenge
attack is prevented by placing a random hash function at the PUF’s input, and a random
hash function is also placed at the output of the PUF, to prevent raw responses form
being accessed. Clearly, it does not solve the fundamental weakness of delay-based PUFs to
modeling attack. So, invasive attacks to collect CRPs is still possible.
We introduce a novel modeling attack on PUFs using ideas of evolutionary computing
in the next chapter.
Scott Adams
Dilbert Comic strip, 1989.
18.1 Introduction
In the previous chapter, we introduced Physically Unclonable Function (PUF) circuits,
discussed some machine learning based attacks on them. In this chapter, we take a different
approach to modelling PUF circuits. One of the features of modelling PUF circuits through
machine learning techniques is that except for arbiter PUFs [220], other PUFs cannot be
modelled very satisfactorily in a way to suggest which machine learning to apply to model
them. Thus, we should concentrate on developing techniques that model an arbitrary given
PUF instance accurately and with little computational effort. This observation suggests
heuristic techniques which are effective in estimating input–output relationships when the
nature of the data is discrete, and the relationship is either unknown or exceedingly complex.
One of the features of modeling PUF circuits through machine learning techniques is
495
that except for arbiter PUFs [220], other PUFs cannot be modelled very satisfactorily in a
way to suggest which machine learning to apply to model them. Thus, we should concentrate
on developing techniques that model an arbitrary given PUF instance accurately and with
little computational effort, rather than trying to figure out a correlation between a particular
modelling technique and the corresponding type of PUF that is amenable to be modelled
by it. This suggests heuristic techniques which are effective in estimating input–output
relationships when the nature of the data is discrete, and the relationship is either unknown
or exceedingly complex.
Evolutionary computation [2] provides algorithms which are often extremely successful
in finding patterns in data with the above-mentioned characteristics. Probably the most
widely known class of algorithms under this parasol are the so-called genetic algorithms
[133, 203]. In these, a computer program iteratively solves (or approximately solves) a given
problem using biological concepts such as genetic crossover, mutation, etc., and Darwinian
natural selection. The characteristic steps of these algorithms are:
1. Set iteration count i = 0, and randomly generate a large initial population of indi-
viduals, where each individual represents a feasible solution to the problem at hand.
Often, the characteristics of the individuals are encoded by a binary string called a
chromosome.
3. Rank the individuals based on a fitness function. If a member of the current population
seems a good enough solution to the problem, or if the number of generations has
reached a pre-defined threshold, STOP. Else, continue and eliminate the members
with fitness function values below a pre-decided threshold. This simulates survival of
the fittest, a basic tenet of Darwinian natural selection.
4. If the number of iterations or the computation time has exceeded pre-decided thresh-
old, STOP and declare the top-ranked member found so far to be the solution. Else,
generate a new population, set i ← i + 1 and go back to step (2).
• The set of terminals (i.e., the independent variables of the problem, zero-argument
functions, and random constants) for each branch of the to-be-evolved program.
• The set of primitive operators called functions, for each branch of the to-be-evolved
program.
• A way to measure the fitness of a given individual, the “fitness function” (not to be
confused with the operator “function”).
• Finally, the termination criterion and method for designating the end result of the
iterations.
FIGURE 18.2: Example of expression trees and their crossover: (a) and (b) parent expres-
sion trees; (c) and (d) children expression trees resulting from the crossover of the parent
trees.
Our FPGA-based RO-PUF (shown in Fig. 18.3 is a variation of the design described in
[373, 238]. There are two banks of 128 3-stage ring oscillators; the outputs of each bank of
ring oscillators are connected to the input of a 128:1 multiplexer. Through a 7-bit input
which acts as the challenge here, the i-th ring oscillator outputs are selected from both
the banks by the multiplexors, 1 ≤ i ≤ 128. The multiplexer outputs are used to drive two
counters, and a comparator compares the count values to estimate the oscillation frequencies
of the ring oscillators. The output of the comparator (1-bit) is considered to be the response
for the challenge at the input. Because of device-level process variation effects, the oscillation
frequencies of the i-th ring oscillator pairs from the two banks are not identical, potentially
resulting in a difference in count between the two counters, and a 0 or 1 value at the
comparator output shows whether the selected ring oscillator from bank-1 has a lower or a
higher frequency than the selected ring oscillator from bank-2 (a tie in count results in an 0
being output). To eliminate bias due to placement and routing, the i-th ring oscillator pair
from the two banks are manually laid out as much symmetrically as possible, and defined
to be hard macros on the FPGA on which they are to be mapped.
18.3 Methodology
The particulars of the genetic programming scheme implemented by us are:
• The terminals are: variables attached to the inputs of the PUF circuit and their logical
complements. For example, if A1 and A2 are the two input variables, the possible
terminal nodes are A1, A2, !A1 and !A2.
• The functions are: AND (“&”) and OR (“|”). For simplicity, we consider only a min-
imal functionally complete set of operators {AND, OR, NOT}. Thus, all our expres-
sions would be Boolean expressions composed of the input variables and their logic
complements. Also, we allow each operator to operate on two variables at a time.
• The fitness of an individual is judged: (a) during the training (i.e. model building)
phase, by the PUF response prediction success rate for 32 arbitrary recorded CRPs
of a given PUF, and (b) during the validation (i.e. model accuracy estimation) phase,
the PUF response prediction success rate for the remaining 96 challenges outside the
training set, by evaluating the Boolean expression it represents for the same N –bit
binary challenge. Note that a given 7–input PUF instance has 128 (=32+96) CRPs.
• The termination criteria are: if the prediction accuracy for the best individuals from
two successive generations differ by less than 0.5%, or the number of generations has
reached 10, or if the prediction accuracy for the validation CRP set is 100%.
18.4 Results
18.4.1 Experimental Setup
The experiments were run on a laptop with a 2.10 GHz CPU and 3 GB of main memory.
An Altera Cyclone-III FPGA was used to implement six instances of the 7-input RO-
PUF (PUF-1 through PUF-6 hereafter) shown in Fig. 18.3, using the Quartus-II design
environment from Altera. The genetic programming methodology was implemented in C.
Each of the two reproduction schemes (elitist and tournament selection) were executed 100
times for each of the six PUFs. Fig. 18.4 shows the GP methodology implemented by us.
Fig. 18.5 shows the variation of prediction accuracy over generations for two reproduc-
tion schemes - Tournament Selection Model and Elitist Model, averaged over the results
from the six PUF instances. From this plot, it is clear that Tournament Selection Model
achieves greater prediction accuracy on average than the Elitist model, and PUF-6 was
most successfully modelled. One interesting feature is that the prediction accuracy shows
a slight degradation in some generations compared to the previous generation, but even-
tually the performance improves. This characteristic of degradation in objective function
with subsequent improvement is termed as genetic drift in the genetic algorithms literature
[203].
Fig. 18.6 shows the variation of the prediction accuracy for the two different reproduction
schemes (averaged over the six PUF instances), when the population size varies from 100 to
1000. Table 18.1 shows the execution time of the genetic programming scheme for the same
population sizes, again for the two different reproduction schemes (averaged over the six
PUF instances). From Fig. 18.6 and Table 18.1, it is apparent that the prediction accuracy
is independent of the reproduction scheme for population sizes below 250, and improves at
a higher rate in the tournament selection scheme. However, this improvement in prediction
accuracy comes at a cost of about 40% increase in execution time with respect to the
FIGURE 18.5: Variation of prediction accuracy with generation number for two repro-
duction schemes – Tournament Selection Model and Elitist Model, averaged over the results
from the six PUF instances.
FIGURE 18.6: Impact of population size on prediction accuracy for two reproduction
schemes – Tournament Selection Model and Elitist Model, averaged over the results from
the six PUF instances.
elitist model. Table 18.2 shows the best estimator Boolean expressions obtained for the 6
PUFs, considering both the elitist and the tournament selection models. Here “·” stands for
Boolean AND, “+” stands for Boolean OR and “!” stands for Boolean inversion. Fig. 18.7
shows the best prediction accuracy for the elitist model for the six PUF instances, for the
100 models built for each. Again, we can observe that the model built for PUF-6 was the
most accurate.
18.4.2 Conclusions
Thus we have demonstrated the effectiveness of an evolutionary computation technique,
Genetic Programming in modelling 7-input RO-PUFs mapped on FPGAs. We envisage
a single RO-PUF instance behavior as an unknown 7-input Boolean function, and in 10
1
generations (iterations), derive a Boolean function based on only of the truth table of
4
FIGURE 18.7: Best Prediction accuracy for the six different PUF instances across 100
models built through Genetic Programming.
3
the function, that can accurately predict 84% (in the best case) of the remaining of
4
the truth table. The technique is also computationally inexpensive, and gives satisfactory
results at reasonable runtimes. There is scope for future research on this topic to be directed
towards modelling other types of PUFs with larger challenge space, and more efficient
implementation of the methodology.
[1] 39th International Symposium on Computer Architecture (ISCA 2012), June 9-13,
2012, Portland, OR, USA. IEEE, 2012.
[6] Masayuki Abe, editor. Topics in Cryptology - CT-RSA 2007, The Cryptographers’
Track at the RSA Conference 2007, San Francisco, CA, USA, February 5-9, 2007,
Proceedings, volume 4377 of Lecture Notes in Computer Science. Springer, 2006.
[7] M. Abramovici and P. L. Levin. Protecting integrated circuits from silicon Tro-
jan horses. Military Embedded Systems, 2009. http://www.mil-embedded.com/
articles/id/?3748.
[8] Onur Aciiçmez. Yet another MicroArchitectural Attack: : exploiting I-Cache. In Peng
Ning and Vijay Atluri, editors, CSAW, pages 11–18. ACM, 2007.
[9] Onur Aciiçmez, Billy Bob Brumley, and Philipp Grabher. New Results on Instruction
Cache Attacks. In Mangard and Standaert [244], pages 110–124.
[10] Onur Aciiçmez and Çetin Kaya Koç. Trace-Driven Cache Attacks on AES (Short
Paper). In Peng Ning, Sihan Qing, and Ninghui Li, editors, ICICS, volume 4307 of
Lecture Notes in Computer Science, pages 112–121. Springer, 2006.
[11] Onur Aciiçmez, Çetin Kaya Koç, and Jean-Pierre Seifert. On the Power of Simple
Branch Prediction Analysis. IACR Cryptology ePrint Archive, 2006:351, 2006.
[12] Onur Aciiçmez, Çetin Kaya Koç, and Jean-Pierre Seifert. Predicting secret keys via
branch prediction. In Abe [6], pages 225–242.
[13] Onur Aciiçmez, Shay Gueron, and Jean-Pierre Seifert. New Branch Prediction Vul-
nerabilities in OpenSSL and Necessary Software Countermeasures. In Steven D. Gal-
braith, editor, IMA Int. Conf., volume 4887 of Lecture Notes in Computer Science,
pages 185–203. Springer, 2007.
505
[14] Onur Aciiçmez and Werner Schindler. A Vulnerability in RSA Implementations Due
to Instruction Cache Analysis and Its Demonstration on OpenSSL. In Tal Malkin,
editor, CT-RSA, volume 4964 of Lecture Notes in Computer Science, pages 256–273.
Springer, 2008.
[15] Onur Aciiçmez, Werner Schindler, and Çetin Kaya Koç. Cache Based Remote Timing
Attack on the AES. In Abe [6], pages 271–286.
[16] Onur Aciiçmez, Jean-Pierre Seifert, and Çetin Kaya Koç. Micro-Architectural Crypt-
analysis. IEEE Security & Privacy, 5(4):62–64, 2007.
[17] S. Adee. The hunt for the kill switch. IEEE Spectrum, 45(5):34–39, May 2008.
[18] Aes (Rijndael) ip-cores. http://www.erst.ch/download/aes_standard_cores.pdf,
2011.
[19] Full datasheet aes-ccm core family for actel fpga. http://www.actel.com/ipdocs/
HelionCore_AES-CCM_8bit_Actel_DS.pdf, 2011.
[20] M. Agarwal, B. Zhang Paul, M., and S. Mitra. Circuit failure prediction and its appli-
cation to transistor aging. In VTS’07: Proceedings of the IEEE VLSI Test Symposium,
pages 277–286, 2007.
[21] Michel Agoyan, Jean-Max Dutertre, Amir-Pasha Mirbaha, David Naccache, Anne-
Lise Ribotta, and Assia Tria. How to Flip a Bit? pages 235–239. IOLTS, Jul 2010.
[22] Michel Agoyan, Jean-Max Dutertre, David Naccache, Bruno Robisson, and Assia Tria.
When Clocks Fail: On Critical Paths and Clock Faults. pages 182–193. CARDIS,
2010.
[23] D. Agrawal, S. Baktir, D. Karakoyunlu, P. Rohatgi, and B. Sunar. Trojan detection
using IC Fingerprinting. In Proc. IEEE Symposium on Security and Privacy, pages
296–310, Washington, DC, USA, 2007.
[24] Gregory C. Ahlquist, Brent E. Nelson, and Michael Rice. Optimal Finite Field
Multipliers for FPGAs. In FPL ’99: Proceedings of the 9th International Work-
shop on Field-Programmable Logic and Applications, pages 51–60, London, UK, 1999.
Springer-Verlag.
[25] Monjur Alam, Sonai Ray, Debdeep Mukhopadhyay, Santosh Ghosh, Dipanwita Roy
Chowdhury, and Indranil Sengupta. An area optimized reconfigurable encryptor for
aes-rijndael. In DATE, pages 1116–1121, 2007.
[26] Subidh Ali and Debdeep Mukhopadhyay. A Differential Fault Analysis on AES Key
Schedule Using Single Fault. In Breveglieri et al. [63], pages 35–42.
[27] Subidh Ali and Debdeep Mukhopadhyay. An Improved Differential Fault Analysis on
AES-256. In Abderrahmane Nitaj and David Pointcheval, editors, AFRICACRYPT,
volume 6737 of Lecture Notes in Computer Science, pages 332–347. Springer, 2011.
[28] Subidh Ali, Debdeep Mukhopadhyay, and Michael Tunstall. Differential Fault Anal-
ysis of AES using a Single Multiple-Byte Fault. Cryptology ePrint Archive, Report
2010/636, 2010. http://eprint.iacr.org/.
[29] Y. M. Alkabani and F. Koushanfar. Active hardware metering for intellectual property
protection and security. In SS’07: Proceedings of USENIX Security Symposium, pages
20:1–20:16, 2007.
[33] B. Ansari and M.A. Hasan. High-performance architecture of elliptic curve scalar mul-
tiplication. Computers, IEEE Transactions on, 57(11):1443 –1453, November 2008.
[36] Australian Government DoD-DSTO. Towards countering the rise of the silicon
trojan. http://dspace.dsto.defence.gov.au/dspace/bitstream/1947/9736/1/
DSTO-TR-2220\%20PR.pdf, 2008.
[38] M. Banga and M. S. Hsiao. A region based approach for the identification of hardware
Trojans. In Proc. IEEE International Workshop on Hardware-Oriented Security and
Trust (HOST’08), pages 40–47, Washington, DC, USA, 2008.
[39] M. Banga and M. S. Hsiao. A novel sustained vector technique for the detection
of hardware Trojans. In VLSID’09: Proceedings of the International Conference on
VLSI Design, pages 327–332, January 2009.
[42] Alberto Battistello and Christophe Giraud. Fault Analysis of Infective AES Com-
putations. In Wieland Fischer and Jörn-Marc Schmidt, editors, Fault Diagnosis and
Tolerance in Cryptography – FDTC 2013, pages 101–107. IEEE Computer Society,
2013.
[43] M. Bednara, M. Daldrup, J. von zur Gathen, J. Shokrollahi, and J. Teich. Recon-
figurable Implementation of Elliptic Curve Crypto Algorithms. In Parallel and Dis-
tributed Processing Symposium., Proceedings International, IPDPS 2002, Abstracts
and CD-ROM, pages 157–164, 2002.
[44] Daniel J. Bernstein. Cache-timing Attacks on AES. Technical report, 2005.
[45] Guido Bertoni, Luca Breveglieri, Israel Koren, Paolo Maistri, and Vincenzo Piuri.
Error Analysis and Detection Procedures for a Hardware Implementation of the Ad-
vanced Encryption Standard. IEEE Trans. Computers, 52(4):492–505, 2003.
[46] Guido Bertoni, Vittorio Zaccaria, Luca Breveglieri, Matteo Monchiero, and Gianluca
Palermo. AES Power Attack Based on Induced Cache Miss and Countermeasure. In
ITCC (1), pages 586–591. IEEE Computer Society, 2005.
[47] Régis Bevan and Erik Knudsen. Ways to Enhance Differential Power Analysis. In
Proceedings of Information Security and Cryptology (ICISC 2002), LNCS Volume
2587, pages 327–342. Springer-Verlag, 2002.
[48] Eli Biham. A Fast New DES Implementation in Software. In FSE [49], pages 260–272.
[49] Eli Biham, editor. Fast Software Encryption, 4th International Workshop, FSE ’97,
Haifa, Israel, January 20-22, 1997, Proceedings, volume 1267 of Lecture Notes in
Computer Science. Springer, 1997.
[50] Eli Biham and Adi Shamir. Differential Fault Analysis of Secret Key Cryptosystems.
Proceedings of Eurocrypt, Lecture Notes in Computer Science, 1233:37–51, 1997.
[51] Alex Biryukov and David Wagner. Slide attacks. In FSE, pages 245–259, 1999.
[52] Johannes Blömer, Jorge Guajardo, and Volker Krummel. Provably secure masking of
aes. In Proceedings of the 11th international conference on Selected Areas in Cryptog-
raphy, SAC’04, pages 69–83, Berlin, Heidelberg, 2005. Springer-Verlag.
[53] Johannes Blömer and Volker Krummel. Analysis of Countermeasures Against Access
Driven Cache Attacks on AES. In Carlisle M. Adams, Ali Miri, and Michael J. Wiener,
editors, Selected Areas in Cryptography, volume 4876 of Lecture Notes in Computer
Science, pages 96–109. Springer, 2007.
[54] Johannes Blömer and Jean-Pierre Seifert. Fault Based Cryptanalysis of the Advanced
Encryption Standard (AES). In Financial Cryptography, pages 162–181, 2003.
[55] Kaijjie Wu Bo Yang and R. Karri. Secure scan: A design-for-test architecture for
crypto-chips. In DAC’05: Proceedings of 42nd Design Automation Conference, pages
135–140, 2005.
[56] Andrey Bogdanov, Thomas Eisenbarth, Christof Paar, and Malte Wienecke. Differen-
tial Cache-Collision Timing Attacks on AES with Applications to Embedded CPUs. In
Josef Pieprzyk, editor, CT-RSA, volume 5985 of Lecture Notes in Computer Science,
pages 235–251. Springer, 2010.
[57] B. Bollig and I. Wegener. Improving the Variable Ordering of OBDDs is NP-Complete.
IEEE Transactions on Computers, 45:993–1002, September 1996.
[58] Joseph Bonneau and Ilya Mironov. Cache-Collision Timing Attacks Against AES. In
Louis Goubin and Mitsuru Matsui, editors, CHES, volume 4249 of Lecture Notes in
Computer Science, pages 201–215. Springer, 2006.
[59] Giacomo Boracchi and Luca Breveglieri. A Study on the Efficiency of Differential
Power Analysis on AES S-Box. Technical Report, January 15, 2007.
[61] Christoph Bósch, Jorge Guajardo, Ahmad-Reza Sadeghi, Jamshid Shokrollahi, and
Pim Tuyls. Efficient Helper Data Key Extractor on FPGAs. In Cryptographic Hard-
ware and Embedded Systems (CHES), volume 5154 of Lecture Notes in Computer
Science, pages 181–197. 2008.
[62] X. Boyen. Reusable cryptographic fuzzy extractors. In Proc. of the 10th ACM con-
ference on Computer and Communications, pages 82–91, 2004.
[63] Luca Breveglieri, Sylvain Guilley, Israel Koren, David Naccache, and Junko Takahashi,
editors. 2011 Workshop on Fault Diagnosis and Tolerance in Cryptography, FDTC
2011, Tokyo, Japan, September 29, 2011. IEEE, 2011.
[64] Ernie Brickell, Gary Graunke, Michael Neve, and Jean-Pierre Seifert. Software Mit-
igations to Hedge AES Against Cache-based Software Side Channel Vulnerabilities.
Cryptology ePrint Archive, Report 2006/052, 2006.
[65] Billy Bob Brumley and Nicola Tuveri. Remote Timing Attacks are Still Practical. In
Vijay Atluri and Claudia Díaz, editors, ESORICS, volume 6879 of Lecture Notes in
Computer Science, pages 355–371. Springer, 2011.
[66] David Brumley and Dan Boneh. Remote Timing Attacks are Practical. Computer
Networks, 48(5):701–716, 2005.
[67] R.E. Bryant. Graph-based algorithms for boolean function manipulation. IEEE
Transactions on Computers, 35:677–691, August 1986.
[69] C. Rebeiro, S. S. Roy, D. S. Reddy and D. Mukhopadhyay. Revisiting the Itoh Tsujii
Inversion Algorithm for FPGA Platforms. IEEE Transactions on VLSI Systems.,
19(8):1508–1512, 2011.
[71] David Canright. A very compact s-box for aes. In CHES, pages 441–455, 2005.
[72] Anne Canteaut, Cédric Lauradoux, and André Seznec. Understanding Cache Attacks.
Research Report RR-5881, INRIA, 2006.
[74] Ç. K. Koç and B. Sunar. An Efficient Optimal Normal Basis Type II Multiplier. IEEE
Trans. Comput., 50(1):83–87, 2001.
[75] Çetin K. Koç and Tolga Acar. Montgomery Multiplication in GF (2k ). DES Codes
Cryptography, 14(1):57–69, 1998.
[79] R. S. Chakraborty and S. Bhunia. Security against hardware Trojan through a novel
application of design obfuscation. In ICCAD ’09: Proceedings of the International
Conference on CAD, pages 113–116, 2009.
[82] R.S. Chakraborty and S. Bhunia. Security through obscurity: An approach for protect-
ing Register Transfer Level hardware IP. In HOST’08: Proceedings of the International
Workshop on Hardware Oriented Security and Trust, pages 96–99, 2009.
[83] R.S. Chakraborty and S. Bhunia. RTL hardware IP protection using key-based control
and data flow obfuscation. In VLSID ’10: Proceedings of the International Conference
on VLSI Design, pages 405–410, 2010.
[84] H. Chang and M.J. Atallah. Protecting software code by guards. In DRM ’01:
Revised Papers from the ACM CCS-8 Workshop on Security and Privacy in Digital
Rights Management, pages 160–175, 2002.
[85] E. Charbon and I. Torunoglu. Watermarking techniques for electronic circuit design.
In IWDW’02: Proceedings of the International Conference on Digital Watermarking,
pages 147–169, 2003.
[86] Suresh Chari, Josyula R. Rao, and Pankaj Rohatgi. Template attacks. In Burton
S. Kaliski Jr., Çetin Kaya Koç, and Christof Paar, editors, CHES, volume 2523 of
Lecture Notes in Computer Science, pages 13–28. Springer, 2002.
[87] W. N. Chelton and M. Benaissa. Fast Elliptic Curve Cryptography on FPGA. IEEE
Transactions on Very Large Scale Integration (VLSI) Systems, 16(2):198–205, Febru-
ary 2008.
[88] Chien-Ning Chen and Sung-Ming Yen. Differential fault analysis on AES key schedule
and some countermeasures. In G. Goos, J. Hartmanis, and J. van Leeuwen, editors,
ACISP 2003, volume 2727 of LNCS, pages 118–129. Springer, 2003.
[89] Deming Chen, Jason Cong, and Peichen Pan. FPGA Design Automation: A Survey.
Found. Trends Electron. Des. Autom., 1(3):139–169, 2006.
[90] Q. Chen, G. Csaba, P. Lugli, U. Schlichtmann, and U. R ührmair. The Bistable Ring
PUF: A new architecture for strong physical unclonable functions. In Proc. of IEEE
International Symposium on Hardware-Oriented Security and Trust (HOST), pages
134 –141, 2011.
[91] Z. Chen, X. Guo, R. Nagesh, A. Reddy, M. Gora, and A. Maiti. Hardware Trojan
designs on BASYS FPGA board, 2012. http://isis.poly.edu/~vikram/vt.pdf.
[93] T. Chou and K. Roy. Accurate power estimation of CMOS sequential circuits. IEEE
Transactions on VLSI, 4(3):369–380, September 1996.
[94] Christophe Clavier, Jean-Sébastien Coron, and Nora Dabbous. Differential power
analysis in the presence of hardware countermeasures. In Çetin Kaya Koç and Christof
Paar, editors, CHES, volume 1965 of Lecture Notes in Computer Science, pages 252–
263. Springer, 2000.
[96] C. Collberg, C. Thomborson, and D. Low. Manufacturing cheap, resilient, and stealthy
opaque constructs. In POPL ’98: Proceedings of the 25th ACM SIGPLAN-SIGACT
symposium on Principles of programming languages, pages 184–196, 1998.
[98] Bart Coppens, Ingrid Verbauwhede, Koen De Bosschere, and Bjorn De Sutter. Prac-
tical Mitigations for Timing-Based Side-Channel Attacks on Modern x86 Processors.
In IEEE Symposium on Security and Privacy, pages 45–60. IEEE Computer Society,
2009.
[99] Jean-Sébastien Coron and Ilya Kizhvatov. An Efficient Method for Random Delay
Generation in Embedded Software. In Christophe Clavier and Kris Gaj, editors,
CHES, volume 5747 of Lecture Notes in Computer Science, pages 156–170. Springer,
2009.
[100] Jean-Sébastien Coron and Ilya Kizhvatov. Analysis and Improvement of the Random
Delay Countermeasure of CHES 2009. In Mangard and Standaert [244], pages 95–109.
[101] Scott A. Crosby, Dan S. Wallach, and Rudolf H. Riedi. Opportunities and Limits of
Remote Timing Attacks. ACM Trans. Inf. Syst. Secur., 12(3), 2009.
[104] Joan Daemen and Vincent Rijmen. The Design of Rijndael: AES - The Advanced
Encryption Standard. Springer, 2002.
[105] Ds2432 1kb protected 1-wire eeprom with sha-1 engine. http://www.maxim-ic.com/
datasheet/index.mvp/id/2914, 2012.
[106] Ds5002fp secure microprocessor chip. http://www.maxim-ic.com/datasheet/
index.mvp/id/2949, 2012.
[107] DARPA. TRUST in Integrated Circuits (TIC) - Proposer Information Pamphlet.
http://www.darpa.mil/MTO/solicitations/baa07-24/index.html, 2007.
[108] Guerric Meurice de Dormale, Philippe Bulens, and Jean-Jacques Quisquater. An
Improved Montgomery Modular Inversion Targeted for Efficient Implementation on
FPGA. In O. Diessel and J.A. Williams, editors, International Conference on Field-
Programmable Technology - FPT 2004, pages 441–444, 2004.
[109] Defense Science Board. Task force on high performance microchip supply. http:
//www.acq.osd.mil/dsb/reports/200502HPMSReportFinal.pdf, 2005.
[110] John Demme, Robert Martin, Adam Waksman, and Simha Sethumadhavan. Side-
Channel Vulnerability Factor: A Metric for Measuring Information Leakage. In ISCA
[1], pages 106–117.
[111] W. Diffie and M. Hellman. New Directions in Cryptography. In IEEE Transactions
on Information Theory (22), pages 644–654. IEEE, 1976.
[112] Yevgeniy Dodis, Leonid Reyzin, and Adam Smith. Fuzzy Extractors: How to Generate
Strong Keys from Biometrics and Other Noisy Data. In C. Cachin and J.L. Camenisch,
editors, Advances in Cryptology – EUROCRYPT 2004, volume 3027 of Lecture Notes
in Computer Science, pages 523–540. 2004.
[113] Leonid Domnitser, Aamer Jaleel, Jason Loew, Nael B. Abu-Ghazaleh, and Dmitry
Ponomarev. Non-monopolizable caches: Low-complexity Mitigation of Cache Side-
Channel Attacks. TACO, 8(4):35, 2012.
[114] D. Du, S. Narasimhan, R. S. Chakraborty, and S. Bhunia. Self-referencing: a scalable
side-channel approach for hardware Trojan detection. In Proc. of the International
Workshop on Cryptographic Hardware and Embedded Systems (CHES’11), pages 173–
187, Berlin, Heidelberg, 2010.
[115] Zoya Dyka and Peter Langendoerfer. Area Efficient Hardware Implementation of
Elliptic Curve Cryptography by Iteratively Applying Karatsuba’s Method. In DATE
’05: Proceedings of the conference on Design, Automation and Test in Europe, pages
70–75, Washington, DC, USA, 2005. IEEE Computer Society.
[116] Z. Chen et al. Hardware Trojan Designs on BASYS FPGA Board. CSAW Embedded
Systems Challenge, 2008. http://isis.poly.edu/~vikram/vt.pdf.
[117] Federal Information Processing Standards Publication 197. Announcing the Advanced
Encryption Standard (AES), 2001.
[118] Federal Information Processing Standards Publication 46-2. Announcing the Standard
for Data Encryption Standard (DES), 1993.
[119] H. Feistel. Cryptography and Computer Privacy. Scientific American, 228(5):15–23,
May 1973.
[120] M. Feldhofer, J. Wolkerstorfer, and V. Rijmen. Aes implementation on a grain of
sand. Information Security, IEE Proceedings, 152(1):13–20, 2005.
[121] Jacques J. A. Fournier and Michael Tunstall. Cache Based Power Analysis Attacks on
AES. In Lynn Margaret Batten and Reihaneh Safavi-Naini, editors, ACISP, volume
4058 of Lecture Notes in Computer Science, pages 17–28. Springer, 2006.
[122] John B. Fraleigh. First Course in Abstract Algebra. Addison-Wesley, Boston, MA,
USA, 2002.
[123] W. F. Friedman. The index of coincidence and its application in cryptography. In
Riverbank Publication, Riverbank Labs. Reprinted by Aegian Park Press, 1920.
[124] Hideo Fujiwara and Marie Engelene J. Obien. Secure and testable scan design using
extended de bruijn graphs. In ASPDAC 10: Proceedings of the 2010 Asia and South
Pacific Design Automation Connference, pages 413–418, 2010.
[125] Katsuya Fujiwara, Hideo Fujiwara, Marie Engelene J. Obien, and Hideo Tamamoto.
Sreep: Shift register equivalents enumeration and synthesis program for secure scan
design. Design and Diagnostics of Electronic Circuits and Systems, 0:193–196, 2010.
[126] G. Piret and J. J. Quisquater. A Differential Fault Attack Technique against SPN
Structures, with Application to the AES and Khazad. In CHES 2003, pages 77–88.
LNCS 2779, 2003.
[127] Jean-François Gallais, Ilya Kizhvatov, and Michael Tunstall. Improved Trace-Driven
Cache-Collision Attacks against Embedded AES Implementations. In Yongwha Chung
and Moti Yung, editors, WISA, volume 6513 of Lecture Notes in Computer Science,
pages 243–257. Springer, 2010.
[128] Blaise Gassend, Dwaine Clarke, Marten van Dijk, and Srinivas Devadas. Silicon physi-
cal random functions. In Proc. of ACM Conference on Computer and Communications
Security, pages 148–160, 2002.
[129] Blaise Gassend, Daihyun Lim, Dwaine Clarke, Marten van Dijk, and Srinivas Devadas.
Identification and authentication of integrated circuits: Research Articles. Concur-
rency and Computation: Practice & Experience, 16(11):1077–1098, 2004.
[130] M. J. Geuzebroek, J. Th. van der Linden, and A. J. van de Goor. Test point insertion
that facilitates ATPG in reducing test time and data volume. In ITC’02: Proceedings
of the International Test Conference, pages 138–147, 2002.
[131] Benedikt Gierlichs, Jörn-Marc Schmidt, and Michael Tunstall. Infective Computa-
tion and Dummy Rounds: Fault Protection for Block Ciphers without Check-before-
Output. In Alejandro Hevia and Gregory Neven, editors, Progress in Cryptology –
LATINCRYPT 2012, volume 7533 of Lecture Notes in Computer Science, pages 305–
321. Springer, 2012.
[132] Christophe Giraud. DFA on AES. In IACR e-print archive 2003/008, page 008.
http://eprint.iacr.org/2003/008, 2003.
[133] D. E. Goldberg. Genetic Algorithms in Search, Optimization and Machine Learning.
Addison Wesley, 1989.
[134] O. Goldreich. Foundations of Cryptography, volume 2. Cambridge University Press,
2005.
[135] Oded Goldreich and Rafail Ostrovsky. Software Protection and Simulation on Obliv-
ious RAMs. J. ACM, 43(3):431–473, 1996.
[136] C. Grabbe, M. Bednara, J. Shokrollahi, J. Teich, and J. von zur Gathen. FPGA
Designs of Parallel High Performance GF (2233 ) Multipliers. In Proc. of the IEEE
International Symposium on Circuits and Systems (ISCAS-03), volume II, pages 268–
271, Bangkok, Thailand, May 2003.
[137] Johann Großschädl and Guy-Armand Kamendje. Instruction Set Extension for Fast
Elliptic Curve Cryptography over Binary Finite Fields GF (2m ). In ASAP, pages
455–. IEEE Computer Society, 2003.
[138] Johann Großschädl and Erkay Savas. Instruction Set Extensions for Fast Arithmetic
in Finite Fields GF (p) and GF (2m ). In Marc Joye and Jean-Jacques Quisquater,
editors, CHES, volume 3156 of Lecture Notes in Computer Science, pages 133–147.
Springer, 2004.
[139] Jorge Guajardo, Sandeep S. Kumar, Geert Jan Schrijen, and Pim Tuyls. FPGA
intrinsic PUFs and their use for IP protection. In Proc. of Cryptographic Hardware
and Embedded Systems Workshop (CHES), volume 4727 of LNCS, pages 63–80, 2007.
[140] Jorge Guajardo, Sandeep S. Kumar, Geert Jan Schrijen, and Pim Tuyls. Physical
unclonable functions and public-key crypto for FPGA IP protection. In Field Pro-
grammable Logic and Applications, pages 189–195, August 2007.
[141] Jorge Guajardo and Christof Paar. Itoh-Tsujii Inversion in Standard Basis and Its
Application in Cryptography and Codes. Des. Codes Cryptography, 25(2):207–216,
2002.
[142] David Gullasch, Endre Bangerter, and Stephan Krenn. Cache Games - Bringing
Access-Based Cache Attacks on AES to Practice. In IEEE Symposium on Security
and Privacy, pages 490–505. IEEE Computer Society, 2011.
[143] Xiaofei Guo and R. Karri. Invariance-based Concurrent Error Detection for Advanced
Encryption Standard. In DAC, pages 573–578, Jun 2012.
[144] Nils Gura, Sheueling Chang Shantz, Hans Eberle, Sumit Gupta, Vipul Gupta, Daniel
Finchelstein, Edouard Goupy, and Douglas Stebila. An End-to-End Systems Approach
to Elliptic Curve Cryptography. In CHES ’02: Revised Papers from the 4th Interna-
tional Workshop on Cryptographic Hardware and Embedded Systems, pages 349–365,
London, UK, 2003. Springer-Verlag.
[145] Torben Hagerup and C. Rüb. A guided tour of Chernoff bounds. In Information
Processing Letters, Volume 33, Issue 6, pages 305–308. Elsevier North-Holland, Inc.,
1990.
[146] Shikha Bisht Harshal Tupsamudre and Debdeep Mukhopadhyay. Destroying fault
invariant with randomization - a countermeasure for aes against differential fault
attacks. In CHES, 2014.
[147] Ryan Helinski, Dhruva Acharyya, and Jim Plusquellic. A physical unclonable function
defined using power distribution system equivalent resistance variations. In Proc. of
46th Annual Design Automation Conference(DAC), pages 676–681, 2009.
[148] David Hely, Maurin Augagneur, Yves Clauzel, and Jeremy Dubeuf. A physical un-
clonable function based on setup time violation. In Proc. of IEEE 30th International
Conference on Computer Design (ICCD), pages 135–138, 2012.
[149] David Hely, Marie-Lise Flottes, Frederic Bancel, Bruno Rouzeyre, Nicolas Berard,
and Michel Renovell. Scan design and secure chip. In IOLTS ’04: Proceedings of the
International On-Line Testing Symposium, 10th IEEE, page 219, Washington, DC,
USA, 2004. IEEE Computer Society.
[150] Alireza Hodjat and Ingrid Verbauwhede. Area-throughput trade-offs for fully
pipelined 30 to 70 gbits/s aes processors. IEEE Trans. Comput., 55(4):366–372, April
2006.
[151] Y. Hori, T. Yoshida, T. Katashita, and A. Satoh. Quantitative and Statistical Perfor-
mance Evaluation of Arbiter Physical Unclonable Functions on FPGAs. In Proceedings
of International Conference on Reconfigurable Computing and FPGAs (ReConFig),
pages 298–303, 2010.
[152] Takashi Horiyama, Masaki Nakanishi, Hirotsugu Kajihara, and Shinji Kimura. Fold-
ing of Logic Functions and its Application to Look Up Table Compaction. ICCAD,
00:694–697, 2002.
[153] T.W. Hou, H.Y. Chen, and M.H. Tsai. Three control flow obfuscation methods for
Java software. IEE Proceedings on Software, 153(2):80–86, April 2006.
[154] Wei-Ming Hu. Lattice scheduling and covert channels. In Research in Security and
Privacy, 1992. Proceedings., 1992 IEEE Computer Society Symposium on, pages 52
–61, may 1992.
[155] Y. L. Huang, F.S. Ho, H.Y. Tsai, and H.M. Kao. A control flow obfuscation method
to discourage malicious tampering of software codes. In ASIACCS ’06: Proceedings of
the 2006 ACM Symposium on Information, Computer and Communications Security,
pages 362–362, 2006.
[156] T. Huffmire, B. Brotherton, W. Gang, T. Sherwood, R. Kastner, T. Levin, T. Nguyen,
and C. Irvine. Moats and Drawbridges: An isolation primitive for reconfigurable
hardware based systems. In SP ’07: Proceedings of the IEEE Sympusium on Security
and Privacy, pages 281–295, 2007.
[157] T. Huffmire, C. Irvine, T. D. sand T. Levin Nguyen, R. Kastner, and T. Sherwood.
Handbook of FPGA Design Security. Springer, Dordrecht, 2010.
[158] Michael Hutton, Jay Schleicher, David M. Lewis, Bruce Pedersen, Richard Yuan,
Sinan Kaptanoglu, Gregg Baeckler, Boris Ratchev, Ketan Padalia, Mark Bourgeault,
Andy Lee, Henry Kim, and Rahul Saini. Improving FPGA Performance and Area
Using an Adaptive Logic Module. In FPL, pages 135–144, 2004.
[159] Xilinx Inc. Virtex-II Platform FPGA User Guide (v 2.2).
http://www.xilinx.com/support/documentation/virtex-ii.htm, 2012.
[160] Toshiya Itoh and Shigeo Tsujii. A Fast Algorithm For Computing Multiplicative
Inverses in GF (2m ) Using Normal Bases. Inf. Comput., 78(3):171–177, 1988.
[161] C. Patel J. Plusquellic J. Lee, M. Tehranipoor. Securing scan design using lock and
key technique. In DFT 05: Proceedings of 20th IEEE International Symposium on
Defect and Fault Tolerance in VLSI Systems, pages 51–62, 2005.
[162] M.H. Jakubowski, C.W. Saw, and R. Venkatesan. Tamper-tolerant software: Modeling
and implementation. In IWSEC ’09: Proceedings of the International Workshop on
Security: Advances in Information and Computer Security, pages 125–139, 2009.
[163] K. Järvinen and J. Skytta. On parallelization of high-speed processors for elliptic curve
cryptography. Very Large Scale Integration (VLSI) Systems, IEEE Transactions on,
16(9):1162 –1175, sept. 2008.
[164] Kimmo Järvinen. On repeated squarings in binary fields. In Michael Jacobson, Vin-
cent Rijmen, and Reihaneh Safavi-Naini, editors, Selected Areas in Cryptography, vol-
ume 5867 of Lecture Notes in Computer Science, pages 331–349. Springer Berlin /
Heidelberg, 2009.
[165] D. Jayasinghe, J. Fernando, R. Herath, and R. Ragel. Remote Cache Timing Attack
on Advanced Encryption Standard and Countermeasures. In Information and Au-
tomation for Sustainability (ICIAFs), 2010 5th International Conference on, pages
177 –182, dec. 2010.
[166] Marie-Lise Flottes Jean Da Rolt, Giorgio Di Natale and Bruno Rouzeyre. New security
threats against chips containing scan chain structures. In HOST 11: Proceedings of
IEEE Symposium on Hardware-Oriented Security and Trust, pages 105–110, 2011.
[167] Y. Jin and Y. Makris. Hardware Trojan detection using path delay fingerprint. In Proc.
IEEE International Workshop on Hardware-Oriented Security and Trust (HOST’08),
pages 51–57, Washington, DC, USA, 2008.
[168] H.G. Joepgen and S. Krauss. Software by means of the protprog method. Elecktronik,
42:52–56, August 1993.
[170] M. Joye, P. Manet, and JB. Rigaud. Strengthening Hardware AES Implementations
against Fault Attack. IET Information Security, 1:106–110, 2007.
[171] D. Kahn. The codebreakers: The story of secret writing. New York: Macmillan
Publishing Co, 1967.
[172] A.B. Kahng, J. Lach, W.H. Mangione-Smith, S. Mantik, I.L. Markov, M. Potkonjak,
P. Tucker, H. Wang, and G. Wolfe. Constraint-based watermarking techniques for
design IP protection. IEEE Transactions on CAD, 20(10):1236 – 1252, October 2001.
[173] Burton S. Kaliski. The Montgomery Inverse and its Applications. IEEE Transactions
on Computers, 44(8):1064–1065, 1995.
[174] R. Kapoor. Security vs. test quality: Are they mutually exclusive? In ITC ’04:
Proceedings of the International Test Conference, page 1413, Washington, DC, USA,
2004. IEEE Computer Society.
[175] D. Karakoyunlu and B. Sunar. Differential template attacks on PUF enabled cryp-
tographic devices. In Proceedings of IEEE International Workshop on Information
Forensics and Security (WIFS), 2010.
[177] Mark Karpovsky, Konrad J. Kulikowski, and Alexander Taubin. Differential Fault
Analysis Attack Resistant Architectures for the Advanced Encryption Standard. In
CARDIS, pages 177–192, Aug 2004.
[193] Sagar Khurana, Souvik Kolay, Chester Rebeiro, and Debdeep Mukhopadhyay. Light
Weight Cipher Implementations on Embedded Processors. In Design and Technology
of Integrated Systems (DTIS). IEEE Computer Society, 2013.
[194] C. Kim. Improved Differential Fault Analysis on AES Key Schedule. Information
Forensics and Security, IEEE Transactions on, PP(99):1, 2011.
[195] Chang Hoon Kim, Soonhak Kwon, and Chun Pyo Hong. FPGA Implementation of
High Performance Elliptic Curve Cryptographic processor over GF (2163 ). Journal of
Systems Architecture - Embedded Systems Design, 54(10):893–900, 2008.
[196] Chong Hee Kim. Differential fault analysis against AES-192 and AES-256 with mini-
mal faults. In Luca Breveglieri, Marc Joye, Israel Koren, David Naccache, and Ingrid
Verbauwhede, editors, Fault Diagnosis and Tolerance in Cryptography — FDTC 2010,
pages 3–9. IEEE Computer Society, 2010.
[197] Chong Hee KIM. Differential fault analysis of aes: Toward reducing number of faults.
Cryptology ePrint Archive, Report 2011/178, 2011. http://eprint.iacr.org/.
[198] Chong Hee Kim and Jean-Jacques Quisquater. New Differential Fault Analysis on
AES Key Schedule: Two Faults Are Enough. In Gilles Grimaud and François-Xavier
Standaert, editors, CARDIS, volume 5189 of Lecture Notes in Computer Science,
pages 48–60. Springer, 2008.
[199] S. T. King, J. Tucek, A. Cozzie, C. Grier, W. Jiang, and Y. Zhou. Designing and
implementing malicious hardware. In LEET’08: Proceedings of the Usenix Workshop
on Large-Scale Exploits and Emergent Threats, pages 5:1–5:8, 2008.
[200] Alexander Klimov and Adi Shamir. A New Class of Invertible Mappings. In Burton
S. Kaliski Jr., Çetin Kaya Koç, and Christof Paar, editors, CHES, volume 2523 of
Lecture Notes in Computer Science, pages 470–483. Springer, 2002.
[201] Donald E. Knuth. The Art of Computer Programming Volumes 1-3 Boxed Set.
Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1998.
[204] Jingfei Kong, Onur Aciiçmez, Jean-Pierre Seifert, and Huiyang Zhou. Deconstructing
New Cache Designs for Thwarting Software Cache-based Side Channel Attacks. In
Trent Jaeger, editor, CSAW, pages 25–34. ACM, 2008.
[205] Jingfei Kong, Onur Aciiçmez, Jean-Pierre Seifert, and Huiyang Zhou. Hardware-
Software Integrated Approaches to Defend Against Software Cache-Based Side Chan-
nel Attacks. In HPCA, pages 393–404. IEEE Computer Society, 2009.
[206] Jingfei Kong, Onur Aciicmez, Jean-Pierre Seifert, and Huiyang Zhou. Architecting
Against Software Cache-based Side Channel Attacks. IEEE Transactions on Com-
puters, 99(PrePrints), 2012.
[208] F. Koushanfar. Provably secure active IC metering techniques for piracy avoidance
and digital rights management. IEEE Transactions on Information Forensics and
Security, 7(1):51–63, February 2012.
[211] Aswin Krishna, Seetharam Narasimhan, Xinmu Wang, and Swarup Bhunia. MECCA:
A Robust Low-Overhead PUF Using Embedded Memory Array. In Cryptographic
Hardware and Embedded Systems ( CHES ), volume 6917 of Lecture Notes in Com-
puter Science, pages 407–420. 2011.
[212] R. Kumar, V.C. Patil, and S. Kundu. Design of Unique and Reliable Physically
Unclonable Functions Based on Current Starved Inverter Chain. In Proc. of IEEE
Computer Society Annual Symposium on VLSI (ISVLSI), pages 224–229, 2011.
[213] S.S. Kumar, J. Guajardo, R. Maes, G.-J. Schrijen, and P. Tuyls. Extended abstract:
The butterfly PUF protecting IP on every FPGA. In Proc. of IEEE International
Workshop on Hardware-Oriented Security and Trust(HOST), pages 67–70, 2008.
[214] Klaus Kursawe, Ahmad-Reza Sadeghi, Dries Schellekens, Boris Škorić, and Pim Tuyls.
Reconfigurable Physical Unclonable Functions – Enabling Technology for Tamper-
Resistant Storage . In Proc. of 2nd IEEE International Workshop on Hardware-
Oriented Security and Trust (HOST), pages 22–29, 2009.
[215] J. Lach, W.H. Mangione-Smith, and M. Potkonjak. Robust FPGA intellectual prop-
erty protection through multiple small watermarks. In Proceedings of the 36th annual
ACM/IEEE Design Automation Conference, DAC ’99, pages 831–836, New York, NY,
1999. ACM.
[216] Cédric Lauradoux. Collision Attacks on Processors with Cache and Countermeasures.
In Christopher Wolf, Stefan Lucks, and Po-Wah Yau, editors, WEWoRC, volume 74
of LNI, pages 76–85. GI, 2005.
[217] Jae W. Lee, Daihyun Lim, Blaise Gassend, G. Edward Suh, Marten van Dijk, and
Srinivas Devadas. A technique to build a secret key in integrated circuits for iden-
tification and authentication application. In Proceedings of the Symposium on VLSI
Circuits, pages 176–159, 2004.
[218] Ruby B. Lee, Zhijie Shi, Yiqun Lisa Yin, Ronald L. Rivest, and Matthew J. B. Rob-
shaw. On Permutation Operations in Cipher Design. In ITCC (2), pages 569–577.
IEEE Computer Society, 2004.
[219] Wei Li, Dawu Gu, Yong Wang, Juanru Li, and Zhiqiang Liu. An Extension of Dif-
ferential Fault Analysis on AES. In Third International Conference on Network and
System Security, pages 443–446. NSS, 2009.
[220] D. Lim. Extracting secret keys from integrated circuits. Master’s thesis, Massachusetts
Institute of Technology, 2004.
[221] L. Lin, W. Burleson, and C. Parr. MOLES: Malicious off-chip leakage enabled by
side-channels. In ICCAD’09: Proceedings of the International Conference on CAD,
pages 117–122, 2009.
[222] L. Lin, M. Kasper, T. GÃijneysu, C. Paar, and W. Burleson. Trojan side-channels:
Lightweight Hardware Trojans through side-channel engineering. volume 5747 of
Lecture Notes in Computer Science, pages 382–395, 2009.
[223] C. Linn and S. Debray. Obfuscation of executable code to improve resistance to static
disassembly. In Proceedings of the ACM Conference on Computer and Communica-
tionsSecurity, pages 290–299, 2003.
[224] Keith Lofstrom, W. Robert Daasch, and Donald Taylor. IC Identification Circuit
Using Device Mismatch. In Proc. of ISSCC, pages 372–373, 2000.
[225] Victor Lomné, Thomas Roche, and Adrian Thillard. On the Need of Randomness in
Fault Attack Countermeasures - Application to AES. In Guido Bertoni and Benedikt
Gierlichs, editors, Fault Diagnosis and Tolerance in Cryptography – FDTC 2012, pages
85–94. IEEE Computer Society, 2012.
[226] Julio López and Ricardo Dahab. Fast multiplication on elliptic curves over gf(2m)
without precomputation. In Proceedings of the First International Workshop on Cryp-
tographic Hardware and Embedded Systems, CHES ’99, pages 316–327, London, UK,
UK, 1999. Springer-Verlag.
[227] Julio López and Ricardo Dahab. Improved Algorithms for Elliptic Curve Arithmetic
in GF (2n ). In SAC ’98: Proceedings of the Selected Areas in Cryptography, pages
201–212, London, UK, 1999. Springer-Verlag.
[228] Jonathan Lutz and Anwarul Hasan. High Performance FPGA based Elliptic Curve
Cryptographic Co-Processor. In ITCC ’04: Proceedings of the International Confer-
ence on Information Technology: Coding and Computing (ITCC’04) Volume 2, page
486, Washington, DC, USA, 2004. IEEE Computer Society.
[229] B. Lynn, M. Prabhakaran, and A. Sahai. Positive results and techniques for obfusca-
tion. Cryptology ePrint Archive, Report 2004/060, 2004. http://eprint.iacr.org/.
[230] P. Lysaght. Dynamic reconfiguration of Xilinx FPGAs: enhanced architectures, design
methodologies, & CAD tools. Xilinx, Inc., 2012. http://www.xilinx.com/univ/
FPL06\_Invited\_Presentation\_PLysaght.pdf.
[231] Chinese firms favoring soft IP over hard cores. http://www.eetasia.com/ART_
8800440032_480100_NT_ac94df1c.HTM, 2011.
[232] R. Maes, V. Rozic, I. Verbauwhede, P. Koeberl, E. van der Sluis, and V. van der
Leest. Experimental evaluation of Physically Unclonable Functions in 65 nm CMOS.
In Proc. of the ESSCIRC, pages 486 –489, 2012.
[233] Roel Maes, Pim Tuyls, and Ingrid Verbauwhede. Intrinsic PUFs from Flip-flops on
Reconfigurable Devices. In Proc. of 3rd Benelux Workshop on Information and System
Security (WISSec), page 17, 2008.
[234] Roel Maes and Ingrid Verbauwhede. Physically Unclonable Functions: A Study on
the State of the Art and Future Research Directions. In Ahmad-Reza Sadeghi and
David Naccache, editors, Towards Hardware-Intrinsic Security, Information Security
and Cryptography, pages 3–37. Springer, 2010.
[235] V. Maingot and R. Leveugle. Influence of Error Detecting or Correcting Codes on the
Sensitivity to DPA of an AES S-Box. In ICSES, pages 1–5, 2009.
[237] Abhranil Maiti, Vikash Gunreddy, and Patrick Schaumont. A Systematic Method
to Evaluate and Compare the Performance of Physical Unclonable Functions. IACR
Cryptology ePrint Archive, 2011:657, 2011.
[238] A. Maity and P. Schaumont. Improving the quality of a Physical Unclonable Function
using configurable Ring Oscillators. In FPL’09: International Conference on Field
Programmable Logic and Applications, pages 703–707, 2009.
[239] Mehrdad Majzoobi, Golsa Ghiaasi, Farinaz Koushanfar, and Sani R. Nassif. Ultra-
low power current-based PUF. In International Symposium on Circuits and Systems
(ISCAS), pages 2071–2074. 2011.
[240] Mehrdad Majzoobi, Farinaz Koushanfar, and Miodrag Potkonjak. Lightweight secure
PUFs. In Proc. of the 2008 IEEE/ACM International Conference on Computer-Aided
Design(ICCAD), pages 670–673, 2008.
[242] Stefan Mangard, Thomas Popp, and Berndt M. Gammel. Side-Channel Leakage of
Masked CMOS Gates. In Alfred Menezes, editor, Topics in Cryptology - CT-RSA
2005, The Cryptographers’ Track at the RSA Conference 2005, San Francisco, CA,
USA, February 14-18, 2005, Proceedings, Lecture Notes in Computer Science (LNCS),
pages 351 – 365. Springer, 2005.
[243] Stefan Mangard and Kai Schramm. Pinpointing the side-channel leakage of masked
aes hardware implementations. In CHES, pages 76–90, 2006.
[244] Stefan Mangard and François-Xavier Standaert, editors. Cryptographic Hardware and
Embedded Systems, CHES 2010, 12th International Workshop, Santa Barbara, CA,
USA, August 17-20, 2010. Proceedings, volume 6225 of Lecture Notes in Computer
Science. Springer, 2010.
[245] Robert Martin, John Demme, and Simha Sethumadhavan. TimeWarp: Rethinking
Timekeeping and Performance Monitoring Mechanisms to Mitigate Side-Channel At-
tacks. In ISCA [1], pages 118–129.
[246] B. Mathew and D. G. Saab. Combining multiple DFT schemes with test generation.
IEEE Transactions on CAD, 18(6):685–696, 1999.
[248] Mitsuru Matsui. New Block Encryption Algorithm MISTY. In Biham [49], pages
54–68.
[249] Alfred J. Menezes, Paul C. van Oorschot, and Scott A. Vanstone. Handbook of Applied
Cryptography. CRC Press, 2001.
[250] Dominik Merli, Dieter Schuster, Frederic Stumpf, and Georg Sigl. Side-channel anal-
ysis of PUFs and Fuzzy extractors. In Proceedings of the 4th international conference
on Trust and trustworthy computing,Pittsburgh, PA, TRUST’11, pages 33–47, 2011.
[251] Thomas S. Messerges, Ezzat A. Dabbish, and Robert H. Sloan. Examining Smart-
Card Security under the Threat of Power Analysis Attacks. IEEE Trans. Comput.,
51(5):541–552, 2002.
[252] P. L. Montgomery. Speeding the pollard and elliptic curve methods of factorization.
In Mathematics of Computation, volume 48, pages 243–264, January 1987.
[253] Peter L. Montgomery. Five, Six, and Seven-Term Karatsuba-Like Formulae. IEEE
Transactions on Computers, 54(3):362–369, 2005.
[254] W. A. Moore and P. A. Kayfes. US Patent 7213142 - system and method to initialize
registers with an EEPROM stored boot sequence. {http://www.patentstorm.us/
patents/7213142/description.html}, 2007.
[255] A. Moradi, A. Barenghi, T. Kasper, and C. Paar. On the vulnerability of FPGA bit-
stream encryption against power analysis attacks: extracting keys from Xilinx Virtex-
II FPGAs. In CCS’11: Proceedings of the ACM Conference on Computer and Com-
munications Security, pages 111–123, 2011.
[256] Amir Moradi, Mohammad T. Manzuri Shalmani, and Mahmoud Salmasizadeh. A
Generalized Method of Differential Fault Attack against AES Cryptosystem. In CHES,
pages 91–100, 2006.
[257] C.J. Morford. BitMaT – Bitstream Manipulation Tool for Xilinx FPGAs. Master’s
thesis, Virginia Polytechnic Institute and State University, 2005.
[258] Mehran Mozaffari-Kermani and Arash Reyhani-Masoleh. Parity-Based Fault Detec-
tion Architecture of S-box for Advanced Encryption Standard. In DFT, pages 572–580,
Oct 2006.
[259] Mehran Mozaffari-Kermani and Arash Reyhani-Masoleh. A Lightweight Concurrent
Error Detection Scheme for the AES S-boxes Using Normal Basis. In Proc. CHES,
pages 113–129, Aug 2008.
[260] Mehran Mozaffari-Kermani and Arash Reyhani-Masoleh. A Lightweight High-
Performance Fault Detection Scheme for the Advanced Encryption Standard Using
Composite Field. IEEE Trans. VLSI Systems, 19(1):85–91, 2011.
[261] Dhiman Saha Mukesh Agarwal, Sandip Karmakar and Debdeep Mukhopadhyay. Scan
based side channel attacks on stream ciphers and their counter-measures. In Indocrypt
’08: Proceedings of Progress in Cryptology-Indocrypt, LNCS 5365, pages 226–238,
2008.
[262] D. Mukhopadhyay. An improved fault based attack of the Advanced Encryption
Standard. In AFRICACRYPT’09: Progress in Cryptology, pages 421–434, 2009.
[263] D. Mukhopadhyay, S. Banerjee, D. RoyChowdhury, and B. B. Bhattacharya. Cryp-
toscan: A secured scan chain architecture. In ATS ’05: Proceedings of the 14th Asian
Test Symposium on Asian Test Symposium, pages 348–353, Washington, DC, USA,
2005. IEEE Computer Society.
[264] Debdeep Mukhopadhyay. An Improved Fault Based Attack of the Advanced Encryp-
tion Standard. In AFRICACRYPT, pages 421–434, 2009.
[265] Debdeep Mukhopadhyay. An Improved Fault Based Attack of the Advanced Encryp-
tion Standard. In Bart Preneel, editor, AFRICACRYPT, volume 5580 of Lecture
Notes in Computer Science, pages 421–434. Springer, 2009.
[266] Debdeep Mukhopadhyay and Dipanwita Roy Chowdhury. An efficient end to end
design of rijndael cryptosystem in 0.18 µ cmos. In VLSI Design, pages 405–410, 2005.
[267] Julian Murphy. Clockless physical unclonable functions. In Proc. of 5th international
conference on Trust and Trustworthy Computing, TRUST’12, pages 110–121, 2012.
[268] F. N. Najm. Transition Density: a new measure of activity in digital circuits. IEEE
Transactions on CAD, 14(2):310–323, February 1993.
[269] Giogio Di Natale, Marie-Lisa Flottes, and Bruno Rouzeyre. A Novel Parity Bit Scheme
for SBox in AES Circuits. In DDECS, pages 1–5, Apr 2007.
[270] Giogio Di Natale, Marie-Lisa Flottes, and Bruno Rouzeyre. On-Line Self-Test of AES
Hardware Implementation. WDSN, 2007.
[271] Michael Neve, Jean pierre Seifert, and Zhenghong Wang. Cache Time-Behavior Anal-
ysis on AES, 2006.
[272] Michael Neve and Jean-Pierre Seifert. Advances on Access-Driven Cache Attacks on
AES. In Eli Biham and Amr M. Youssef, editors, Selected Areas in Cryptography,
volume 4356 of Lecture Notes in Computer Science, pages 147–162. Springer, 2006.
[273] Michael Neve, Jean-Pierre Seifert, and Zhenghong Wang. A Refined Look at Bern-
stein’s AES Side-Channel Analysis. In Ferng-Ching Lin, Der-Tsai Lee, Bao-Shuh Lin,
Shiuhpyng Shieh, and Sushil Jajodia, editors, ASIACCS, page 369. ACM, 2006.
[274] J. Note and E. Rannaud. From the bitstream to the netlist. In FPGA’08: Proceedings
of the International ACM/SIGDA Symposium on Field Programmable Gate Arrays,
pages 264–271, 2008.
[277] A.L. Oliveira. Techniques for the creation of digital watermarks in sequential circuit
designs. IEEE Transactions on CAD, 20(9):1101 –1117, September 2001.
[279] Gerardo Orlando and Christof Paar. A High Performance Reconfigurable Elliptic
Curve Processor for GF (2m ). In CHES ’00: Proceedings of the Second International
Workshop on Cryptographic Hardware and Embedded Systems, pages 41–56, London,
UK, 2000. Springer-Verlag.
[280] Dag Arne Osvik, Adi Shamir, and Eran Tromer. Cache attacks and Countermeasures:
the Case of AES. Cryptology ePrint Archive, Report 2005/271, 2005.
[281] Dag Arne Osvik, Adi Shamir, and Eran Tromer. Cache Attacks and Countermeasures:
The Case of AES. In David Pointcheval, editor, CT-RSA, volume 3860 of Lecture
Notes in Computer Science, pages 1–20. Springer, 2006.
[282] Maria Elisabeth Oswald, Stefan Mangard, Norbert Pramstaller, and Vincent Rijmen.
A Side-Channel Analysis Resistant Description of the AES S-box. In Proceedings of
Fast Software Encryption (FSE 2005), LNCS Volume 3557, pages 413–423. Springer-
Verlag, 2005.
[283] E. Ozturk, G. Hammouri, and B. Sunar. Physical unclonable function with tristate
buffers. In Proc. of IEEE International Symposium on Circuits and Systems( ISCAS
), pages 3194–3197, 2008.
[284] P. C. Kocher. Timing Attacks on Implementation of Diffie-Hellman, RSA, DSS and
Other Systems. In Proceeding of Crypto, LNCS 1109, pages 104–113, 1996.
[285] G. Letourneux P. Dusart and O. Vivolo. Differential Fault Analysis on AES. In
Cryptology ePrint Archive, pages 293–306, Oct 2003.
[286] Christof Paar. Efficient VLSI Architectures for Bit-Parallel Computation in Galois
Fields. PhD thesis, Institute for Experimental Mathematics, Universität Essen, Ger-
many, June 1994.
[287] Christof Paar. A New Architecture for a Parallel Finite Field Multiplier with Low
Complexity Based on Composite Fields. IEEE Transactions on Computers, 45(7):856–
861, 1996.
[288] D. Page. Theoretical Use of Cache Memory as a Cryptanalytic Side-Channel, 2002.
[289] D Page. Defending Against Cache-Based Side-Channel Attacks. Information Security
Technical Report, 8(1):30 – 44, 2003.
[290] Dan Page. Partitioned Cache Architecture as a Side-Channel Defence Mechanism.
IACR Cryptology ePrint Archive, 2005:280, 2005.
[291] A. Papoulis and S. U. Pillai. Probability, Random Variables and Stochastic Processes
(4th ed.). McGraw–Hill, 2002.
[292] A. Papoulis and S. U. Pillai. Predictive technology model, 2012. http://www.eas.
asu.edu/~ptm/.
[293] Ravikanth S. Pappu. Physical one-way functions. PhD thesis, Massachusetts Institute
of Technology, March 2001.
[294] Ravikanth S. Pappu, Ben Recht, Jason Taylor, and Niel Gershenfeld. Physical one-way
functions. Science, 297:2026–2030, 2002.
[295] S. Paul, H. Mahmoodi, and S. Bhunia. Low-overhead fmax calibration at multiple
operating points using delay sensitivity based path selection. ACM Transactions on
Design Automation of Electronic Systems, 15(2):19:1–19:34, February 2010.
[296] Paulo S. L. M. Barreto. The AES Block Cipher in C++.
[297] David Peacham and Byron Thomas. âĂIJA DFA attack against the AES key sched-
uleâĂİ. SiVenture White Paper 001, 26 October, 2006.
[298] Colin Percival. Cache Missing for Fun and Profit. In Proc. of BSDCan 2005, 2005.
[299] Steffen Peter and Peter Langendörfer. An efficient polynomial multiplier in GF (2m )
and its application to ECC designs. In DATE ’07: Proceedings of the conference on
Design, automation and test in Europe, pages 1253–1258, San Jose, CA, USA, 2007.
EDA Consortium.
[300] G. Piret and J.J. Quisquater. A Differential Fault Attack Technique against SPN
Structures, with Application to the AES and Khazad. In CHES, pages 77–88, Sept
2003.
[301] Rishabh Poddar, Amit Datta, and Chester Rebeiro. A Cache Trace Attack on
CAMELLIA. In Marc Joye, Debdeep Mukhopadhyay, and Michael Tunstall, edi-
tors, InfoSecHiComNet, volume 7011 of Lecture Notes in Computer Science, pages
144–156. Springer, 2011.
[302] I. Pomeranz and S. M. Reddy. A measure of quality for n-detection test sets. IEEE
Transactions on Computers, 53(11):1497–1503, 2004.
[303] Reinhard Posch. Protecting Devices by Active Coating. Journal of Universal Com-
puter Science, 4(7):652–668, 1998.
[304] Norbert Pramstaller, Stefan Mangard, Sandra Dominikus, and Johannes Wolkerstor-
fer. Efficient aes implementations on asics and fpgas. In Hans Dobbertin, Vincent
Rijmen, and Aleksandra Sowa, editors, AES Conference, volume 3373 of Lecture Notes
in Computer Science, pages 98–112. Springer, 2004.
[306] Qiong Pu and Jianhua Huang. A Microcoded Elliptic Curve Processor for GF (2m )
Using FPGA Technology. In Communications, Circuits and Systems Proceedings,
2006 International Conference on, volume 4, pages 2771–2775, June 2006.
[307] R. Rivest, A. Shamir and L. Adleman. A Method for Obtaining Digital Signatures
and Public-Key Cryptosystems. Communications of the ACM, Previously released as
an MIT "Technical Memo" in April 1977, 21(2):120–126, 1978.
[309] R. M. Rad, X. Wang, M. Tehranipoor, and J. Plusquellic. Power supply signal cali-
bration techniques for improving detection resolution to hardware Trojans. In Proc.
IEEE/ACM International Conference on Computer-Aided Design (ICCAD’08), pages
632–639, Piscataway, NJ, USA, 2008.
[310] D. Rai and J. Lach. Performance of delay-based Trojan detection techniques under
parameter variations. In Proc. IEEE International Workshop on Hardware-Oriented
Security and Trust (HOST’09), pages 58–65, Washington, DC, USA, 2009.
[312] C. Rebeiro and D. Mukhopadhay. Boosting Profiled Cache Timing Attacks with Apri-
ori Analysis. Information Forensics and Security, IEEE Transactions on, PP(99):1,
2012.
[313] Chester Rebeiro, Mainack Mondal, and Debdeep Mukhopadhyay. Pinpointing Cache
Timing Attacks on AES. In VLSI Design, pages 306–311. IEEE Computer Society,
2010.
[314] Chester Rebeiro and Debdeep Mukhopadhyay. High speed compact elliptic curve
cryptoprocessor for fpga platforms. In INDOCRYPT, pages 376–388, 2008.
[315] Chester Rebeiro and Debdeep Mukhopadhyay. Power attack resistant efficient fpga
architecture for karatsuba multiplier. In VLSI Design, pages 706–711, 2008.
[316] Chester Rebeiro and Debdeep Mukhopadhyay. Power Attack Resistant Efficient
FPGA Architecture for Karatsuba Multiplier. In VLSID ’08: Proceedings of the 21st
International Conference on VLSI Design, pages 706–711, Washington, DC, USA,
2008. IEEE Computer Society.
[317] Chester Rebeiro and Debdeep Mukhopadhyay. Cryptanalysis of CLEFIA Using Dif-
ferential Methods with Cache Trace Patterns. In Aggelos Kiayias, editor, CT-RSA,
volume 6558 of Lecture Notes in Computer Science, pages 89–103. Springer, 2011.
[318] Chester Rebeiro, Debdeep Mukhopadhyay, Junko Takahashi, and Toshinori Fuku-
naga. Cache Timing Attacks on CLEFIA. In Bimal Roy and Nicolas Sendrier, editors,
INDOCRYPT, volume 5922 of Lecture Notes in Computer Science, pages 104–118.
Springer, 2009.
[319] Chester Rebeiro, Rishabh Poddar, Amit Datta, and Debdeep Mukhopadhyay. An
Enhanced Differential Cache Attack on CLEFIA for Large Cache Lines. In Daniel J.
Bernstein and Sanjit Chatterjee, editors, INDOCRYPT, volume 7107 of Lecture Notes
in Computer Science, pages 58–75. Springer, 2011.
[320] Chester Rebeiro, Sujoy Sinha Roy, and Debdeep Mukhopadhyay. Pushing the limits
of high-speed gf(2 m ) elliptic curve scalar multiplication on fpgas. In CHES, pages
494–511, 2012.
[321] Chester Rebeiro, Sujoy Sinha Roy, Sankara Reddy, and Debdeep Mukhopadhyay.
Revisiting the itoh-tsujii inversion algorithm for fpga platforms. IEEE Trans. VLSI
Syst., 19(8):1508–1512, 2011.
[324] H.G. Rice. Classes of recursively enumerable sets and their decision problems. Trans.
Am. Math. Soc,, 74:358–366, 1953.
[327] Thomas Ristenpart, Eran Tromer, Hovav Shacham, and Stefan Savage. Hey, you, get
off of my cloud: Exploring Information Leakage in Third-Party Compute Clouds. In
Ehab Al-Shaer, Somesh Jha, and Angelos D. Keromytis, editors, ACM Conference on
Computer and Communications Security, pages 199–212. ACM, 2009.
[329] Francisco Rodríguez-Henríquez and Çetin Kaya Koç. On Fully Parallel Karatsuba
Multipliers for GF (2m ). In Proc. of the International Conference on Computer Science
and Technology (CST), pages 405–410.
[334] A. Roy, F. Koushanfar, and I.L. Markov. Extended abstract: Circuit CAD tools as
a security threat. In HOST’08: Proceedings of the IEEE International Workshop on
Hardware Oriented Security and Trust, pages 65–66, 2008.
[335] J. A. Roy, F. Kaushanfar, and I. L. Markov. Extended abstract: circuit CAD tools
as a security threat. In HOST’08: Proceedings of the International Workshop on
Hardware-oriented Security and Trust, pages 61–62, 2008.
[336] J. A. Roy, F. Koushanfar, and I. L.Markov. EPIC : ending piracy of integrated cir-
cuits. In DATE’08: Proceedings of the Conference on Design, Automation and Test
in Europe, pages 1069–1074, 2008.
[337] Sujoy Sinha Roy, Chester Rebeiro, and Debdeep Mukhopadhyay. Theoretical modeling
of the itoh-tsujii inversion algorithm for enhanced performance on k-lut based fpgas.
In DATE, pages 1231–1236, 2011.
[338] Sujoy Sinha Roy, Chester Rebeiro, and Debdeep Mukhopadhyay. Theoretical Model-
ing of the Itoh-Tsujii Inversion Algorithm for Enhanced Performance on k-LUT based
FPGAs. In Design, Automation, and Test in Europe DATE-2011, 2011.
[341] Dhiman Saha, Debdeep Mukhopadhyay, and Dipanwita Roy Chowdhury. A Diago-
nal Fault Attack on the Advanced Encryption Standard. IACR Cryptology ePrint
Archive, page 581, 2009.
[342] Dhiman Saha, Debdeep Mukhopadhyay, and Dipanwita Roy Chowdhury. Pkdpa: An
enhanced probabilistic differential power attack methodology. In INDOCRYPT, pages
3–21, 2011.
[344] T. Sakurai and A. R. Newton. Alpha-power law MOSFET model and its applications
to CMOS inverter delay and other formulas. IEEE Journal of Solid-State Circuits,
25(2):584–594, apr 1990.
[347] Akashi Satoh, Sumio Morioka, Kohji Takano, and Seiji Munetoh. A Compact Ri-
jndael Hardware Architecture with S-Box Optimization. In Colin Boyd, editor,
ASIACRYPT, volume 2248 of Lecture Notes in Computer Science, pages 239–254.
Springer, 2001.
[348] Akashi Satoh, Sumio Morioka, Kohji Takano, and Seiji Munetoh. A compact rijndael
hardware architecture with s-box optimization. pages 239–254. Springer-Verlag, 2001.
[349] Akashi Satoh, Takeshi Sugawara, Naofumi Homma, and Takafumi Aoki. High-
Performance Concurrent Error Detection Scheme for AES Hardware. In CHES, pages
100–112, Aug 2008.
[350] Werner Schindler, Kerstin Lemke, and Christof Paar. A stochastic model for differ-
ential side channel cryptanalysis. In CHES, pages 30–46, 2005.
[351] B. Schneier. Applied Cryptography: Protocols, Algorithms and Source Code in C. John
Wiley & Sons, 2001.
[352] A. Schulman. Examining the Windows AARD detection code. Dr. Dobb’s Journal,
18, September 1993.
[354] Frank Sehnke, Christian Osendorfer, Jan Sölter, Jürgen Schmidhuber, and Ulrich
Rührmair. Policy Gradients for Cryptanalysis. In Proc. of 20th International Con-
ference on Artificial Neural Networks (ICANN), volume 6354, pages 168–177, 2010.
[355] N. Selmane, S. Guilley, and J. L. Danger. Practical Setup Time Violation Attacks on
AES. In EDCC’08: Proceedings of the European Dependable Computing Conference,
pages 91–96, 2008.
[356] Nidhal Selmane, Sylvain Guilley, and Jean-Luc Danger. Practical Setup Time Vio-
lation Attacks on AES. pages 91–96. European Dependable Computing Conference,
2008.
[357] Gaurav Sengar, Debdeep Mukhopadhyay, and Dipanwita Roy Chowdhury. Secured
flipped scan-chain model for crypto-architecture. IEEE Trans. on CAD of Integrated
Circuits and Systems, 26(11):2080–2084, 2007.
[358] R. Sever, A.N. Ismailoglu, Y.C. Tekmen, and M. Askar. A high speed asic implementa-
tion of the rijndael algorithm. In Circuits and Systems, 2004. ISCAS ’04. Proceedings
of the 2004 International Symposium on, volume 2, pages II–541–4 Vol.2, 2004.
[359] C. E. Shannon. Communication theory of secrecy systems, vol 28, no 4. In Bell System
Technical Journal, pages 656–715. Bell, 1949.
[360] Shay Gueron. Intelő Advanced Encryption Standard (AES) Instructions Set (Rev :
3.0), 2010.
[362] Y. Shi, N. Togawa, M. Yanagisawa, and T. Ohtsuki. Robust secure scan design against
scan-based differential cryptanalysis. Very Large Scale Integration (VLSI) Systems,
IEEE Transactions on, PP(99):1–15, 2011.
[363] Koichi Shimizu, Daisuke Suzuki, and Tomomi Kasuya. Glitch PUF: Extracting In-
formation from Usually Unwanted Glitches. IEICE Transactions on Fundamentals of
Electronics, Communications and Computer Sciences, E95.A(1):223–233, 2012.
[364] Daniel P. Siewiorek and Robert S. Swarz. Reliable Computer Systems: Design and
Evaluation. A K Peters/CRC Press; 3 edition, 1998.
[365] P. Simons, E. van der Sluis, and V. van der Leest. Buskeeper PUFs, a promising
alternative to D Flip-Flop PUFs. In Proc. of IEEE International Symposium on
Hardware-Oriented Security and Trust (HOST), pages 7 –12, 2012.
[366] Eric Simpson and Patrick Schaumont. Offline hardware/software authentication for
reconfigurable platforms. In Proc. of the 8th international conference on Cryptographic
Hardware and Embedded Systems (CHES), pages 311–323, 2006.
[367] Sergei P. Skorobogatov and Ross J. Anderson. Optical Fault Induction Attacks. In
proceedings of CHES, pages 2–12, Aug 2002.
[368] William Stallings. Cryptography and Network Security: Principles and Practice. Pear-
son Education, 2002.
[369] William Stallings. Cryptography and Network Security (4th Edition). Prentice-Hall,
Inc., Upper Saddle River, NJ, USA, 2005.
[370] Douglas Stinson. Cryptography: Theory and Practice, Second Edition, pages 117–154.
Chapman & Hall, CRC, London, UK, 2002.
[371] Y. Su, J. Holleman, and B. Otis. A 1.6pJ/bit 96% Stable Chip-ID Generating Cir-
cuit using Process Variations. In Proc. of IEEE International Solid-State Circuits
Conference(ISSCC ) , pages 406–611, 2007.
[372] G. Edward Suh and Srinivas Devadas. Physical unclonable functions for device au-
thentication and secret key generation. In Design Automation Conference, pages 9–14,
2007.
[373] G.E. Suh and S. Devadas. Physical unclonable functions for device authentication and
secret key generation. In DAC’07: Proceedings of the ACM/IEEE Design Automation
Conference, pages 9–14, 2007.
[375] W. Chelton T. Good and M. Benaissa. Review of stream cipher candidates from a
low resource hardware perspective.
[376] Junko Takahashi and Toshinori Fukunaga. Differential Fault Analysis on AES with
192 and 256-Bit Keys. Cryptology ePrint Archive, Report 2010/023, 2010. http:
//eprint.iacr.org/.
[377] Junko Takahashi, Toshinori Fukunaga, and Kimihiro Yamakoshi. DFA Mechanism on
the AES Key Schedule. In Luca Breveglieri, Shay Gueron, Israel Koren, David Nac-
cache, and Jean-Pierre Seifert, editors, FDTC, pages 62–74. IEEE Computer Society,
2007.
[380] Kris Tiri, Onur Aciiçmez, Michael Neve, and Flemming Andersen. An analytical
model for time-driven cache attacks. In Alex Biryukov, editor, FSE, volume 4593 of
Lecture Notes in Computer Science, pages 399–413. Springer, 2007.
[382] Elena Trichina. Combinational logic design for aes subbyte transformation on masked
data. IACR Cryptology ePrint Archive, 2003:236, 2003.
[384] Eran Tromer, Dag Arne Osvik, and Adi Shamir. Efficient Cache Attacks on AES, and
Countermeasures. Journal of Cryptology, 23(2):37–71, 2010.
[385] Dan Tsafrir, Yoav Etsion, and Dror G. Feitelson. Secretly Monopolizing the CPU
without Superuser Privileges. In Proceedings of 16th USENIX Security Symposium
on USENIX Security Symposium, SS’07, pages 17:1–17:18, Berkeley, CA, USA, 2007.
USENIX Association.
[386] Yukiyasu Tsunoo, Teruo Saito, Tomoyasu Suzaki, Maki Shigeri, and Hiroshi Miyauchi.
Cryptanalysis of DES Implemented on Computers with Cache. In Colin D. Walter,
Çetin Kaya Koç, and Christof Paar, editors, CHES, volume 2779 of Lecture Notes in
Computer Science, pages 62–76. Springer, 2003.
[387] Yukiyasu Tsunoo, Etsuko Tsujihara, Kazuhiko Minematsu, and Hiroshi Miyauchi.
Cryptanalysis of Block Ciphers Implemented on Computers with Cache. In Interna-
tional Symposium on Information Theory and Its Applications, pages 803–806, 2002.
[388] Yukiyasu Tsunoo, Etsuko Tsujihara, Maki Shigeri, Hiroyasu Kubo, and Kazuhiko
Minematsu. Improving Cache Attacks by Considering Cipher Structure. Int. J. Inf.
Sec., 5(3):166–176, 2006.
[389] Michael Tunstall and Olivier Benoît. Efficient Use of Random Delays in Embedded
Software. In Damien Sauveron, Constantinos Markantonakis, Angelos Bilas, and Jean-
Jacques Quisquater, editors, WISTP, volume 4462 of Lecture Notes in Computer
Science, pages 27–38. Springer, 2007.
[390] Michael Tunstall, Debdeep Mukhopadhyay, and Subidh Ali. Differential Fault Anal-
ysis of the Advanced Encryption Standard Using a Single Fault. In WISTP, pages
224–233, 2011.
[391] Pim Tuyls, Geert-Jan Schrijen, Boris Škorić, Jan van Geloven, Nynke Verhaegh, and
Rob Wolters. Read-proof hardware from protective coatings. In Proc. of Cryptographic
Hardware and Embedded Systems Workshop, volume 4249 of LNCS, pages 369–383,
2006.
[392] Ulrich R ührmair, Frank Sehnke, Jan S "olter, Gideon Dror, Srinivas Devadas, and
J "urgen Schmidhuber. Modeling attacks on physical unclonable functions. In Proc. of
17th ACM conference on Computer and communications security(CCS), pages 237–
249, 2010.
[394] Bhanu C. Vattikonda, Sambit Das, and Hovav Shacham. Eliminating Fine Grained
Timers in Xen. In Christian Cachin and Thomas Ristenpart, editors, CCSW, pages
41–46. ACM, 2011.
[395] Tobias Vejda, Dan Page, and Johann Großschädl. Instruction Set Extensions for
Pairing-Based Cryptography. In Tsuyoshi Takagi, Tatsuaki Okamoto, Eiji Okamoto,
and Takeshi Okamoto, editors, Pairing, volume 4575 of Lecture Notes in Computer
Science, pages 208–224. Springer, 2007.
[396] Ingrid Verbauwhede, Senior Member, Patrick Schaumont, Student Member, and
Henry Kuo. Design and performance testing of a 2.29-gb/s rijndael processor. IEEE
Journal of Solid-State Circuits, 38:569–572, 2003.
[397] Joachim von zur Gathen and Jamshid Shokrollahi. Efficient FPGA-Based Karatsuba
Multipliers for Polynomials over F2 . In Selected Areas in Cryptography, pages 359–369,
2005.
[414] Chenglu Jin Xiaofei Guo, Debdeep Mukhopadhyay and Ramesh Karri. Nrepo:normal
basis recomputing with permuted operands. Cryptology ePrint Archive, Report
2014/497, 2014. http://eprint.iacr.org/.
[415] Debdeep Mukhopadhyay Xiaofei Guo and Ramesh Karri. Provably secure concurrent
error detection against differential fault analysis. Cryptology ePrint Archive, Report
2012/552, 2012. http://eprint.iacr.org/.
[416] Debdeep Mukhopadhyay Xiaofei Guo and Ramesh Karri. Provably secure concurrent
error detection against differential fault analysis. Cryptology ePrint Archive, Report
2012/552, 2012. http://eprint.iacr.org/.
[417] Xilinx. Using Block RAM in Spartan-3 Generation FPGAs. Application Note, XAPP-
463, 2005.
[418] Xilinx. Using Look-Up Tables as Distributed RAM in Spartan-3 Generation FPGAs.
Application Note, XAPP-464, 2005.
[419] Dai Yamamoto, Kazuo Sakiyama, Mitsugu Iwamoto, Kazuo Ohta, Takao Ochiai,
Masahiko Takenaka, and Kouichi Itoh. Uniqueness Enhancement of PUF Responses
Based on the Locations of Random Outputting RS Latches. In Proc. of 13th Interna-
tional Workshop on Cryptographic Hardware and Embedded Systems (CHES), pages
390–406, October 2011.
[420] Dai Yamamoto, Kazuo Sakiyama, Mitsugu Iwamoto, Kazuo Ohta, Masahiko Take-
naka, and Kouichi Itoh. Variety enhancement of PUF responses using the locations of
random outputting RS latches. In Journal of Cryptographic Engineering, pages 1–15,
2012.
[421] B. Yang, K. Wu, and R. Karri. Secure scan: A design-for-test architecture for crypto
chips. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions
on, 25(10):2287–2293, Oct. 2006.
[422] Bo Yang, Kaijie Wu, and Ramesh Karri. Scan based side channel attack on dedicated
hardware implementations of data encryption standard. In ITC ’04: Proceedings of
the International Test Conference, pages 339–344, Washington, DC, USA, 2004. IEEE
Computer Society.
[423] Y. Yao, M. Kim, J. Li, I. L. Markov, and F. Koushanfar. ClockPUF: Physical Unclon-
able Functions based on Clock Networks. In Design, Automation & Test in Europe
(DATE), 2013.
[425] Pengyuan Yu and Patrick Schaumont. Secure FPGA circuits using controlled place-
ment and routing. In Proceedings of International Conference on Hardware Software
Codesign (CODES+ISSS), pages 45–50. ACM, 2007.
[426] L. Yuan and G. Qu. Information hiding in finite state machine. In IH’04: Proceedings
of the International Conference on Information Hiding, IH’04, pages 340–354, 2004.
[427] Erik Zenner. Cache Timing Analysis of HC-256. In 15th Annual International Work-
shop, SAC 2008, 2008.
[428] XinJie Zhao and Tao Wang. Improved Cache Trace Attack on AES and CLEFIA by
Considering Cache Miss and S-box Misalignment. Cryptology ePrint Archive, Report
2010/056, 2010.
[429] Zhijie Jerry Shi and Xiao Yang and Ruby B. Lee. Alternative Application-Specific
Processor Architectures for Fast Arbitrary Bit Permutations. IJES, 3(4):219–228,
2008.
[430] X. Zhuang, T. Zhang, H.S. Lee, and S. Pande. Hardware assisted control flow obfus-
cation for embedded processors. In CASES ’04: Proceedings of the 2004 International
Conference on Compilers, Architecture, and Synthesis for Embedded Systems, pages
292–302, 2004.
[431] Xiaotong Zhuang, Tao Zhang, Hsien-Hsin S. Lee, and Santosh Pande. Hardware
Assisted Control Flow Obfuscation for Embedded Processors. In Mary Jane Irwin,
Wei Zhao, Luciano Lavagno, and Scott A. Mahlke, editors, CASES, pages 292–302.
ACM, 2004.
[432] Xiaotong Zhuang, Tao Zhang, and Santosh Pande. HIDE: an Infrastructure for Ef-
ficiently Protecting Information Leakage on the Address Bus. In Shubu Mukherjee
and Kathryn S. McKinley, editors, ASPLOS, pages 72–84. ACM, 2004.
A AND gates, 86
circuit optimizations in normal basis,
Abelian group, 7 91–94
Access-driven cache attacks, 194, 275–280 circuit optimizations in polynomial
access-driven attacks on block ciphers, basis, 87–91
276–277 CMOS cell libraries, 98
asynchronous access-driven attacks, 278– common subexpressions, 96–97
279 decomposition, 87
Completely Fair Scheduler, 279 field isomorphism, 90
context-switching, 279 matrix multiplication, 97
evict+time, 276 merged S-Box and inverse S-Box, 97–
fine-grained access-driven attacks on 98
AES, 279–280
merging of basis change matrix and
last round access driven attack on AES,
affine mapping, 97
278
most compact realization (polynomial
non-elimination method, 278
vs. normal basis), 94–96
pre-emptive OS scheduling, 279
NAND gate, 98
prime+probe, 276
scaling, 88
second round access-driven attack on
subfield mappings, 87
AES, 277–278
XOR operations, 86
ShiftRows operation, 278
experimental results, 109–110
signal to noise ratio, 277
slices, 279 CMOS technology, 109
spy process, 279 performances of compared cores 110
symmetrical multithreading, 279 throughput in ASIC, 109
wide-collisions, 277 throughput in FPGA, 109
Acoustic PUF, 479 MixColumns transformation, implemen-
Additive identity, 4 tation of, 98–101
Advanced Encryption Standard (AES), 36, AES S-Box and MixColumns, 99–100
351 implementing the inverse MixColumns,
block ciphers and, 36 100–101
differential fault attacks on, 211–214 irreducible polynomial, 99
multi-level attacks, 386 MixColumns architecture, 100
obfuscation and, 351 XORing, 98
round transformations, 40–43 requirements, 86
scan chain based attacks and, 198 Rijndael cryptosystem, example recon-
use of Rijndael as, 269 figurable design for, 101–108
Advanced Encryption Standard (AES), hard- back to the design, 103–105
ware design of, 85–113 Cipher Block Chain mode, 101
algorithmic optimizations, 86 clock frequency, improvement of, 102
architectural optimizations, 86 control unit, 107–109
circuit for AES S-Box, 86–98 DataScheduler block, 101
affine transformation matrix, 94 design overview, 101–105
535
LUT delay, 168 LUT delay estimate for entire ITA ar-
pipelining paths, 166 chitecture, 147
point arithmetic on the ECCP, 157–160 number of inputs, 146
clock cycle, 158 overlapped operands, 144
point addition, 159–160 threshold multipliers, 144
point doubling, 157–158 basic Karatsuba multiplier, 118
point doubling, 154 crypto system implementation, 116
register bank, 154 designing for FPGA architecture, 119–
scheduling of Montgomery algorithm, 120
168–171 combinational circuit, 120
base point, 168 compact implementation, 120
clock cycle requirement, 170 under-utilized LUT, 120
consecutive key bits, 169, 170 experimental results, 137–138
data forwarding to reduce clock cy- distributed RAM, 138
cles, 171 graph, 137, 138
scalar multiplication, 168 performance, 137
scheduling for two consecutive bits of quad circuits, 137
the scalar, 169–171 slices, 137
FPGAs (finite field arithmetic on), efficient finite field multiplier, 116–117
design of, 115–151 elements, 116
analyzing Karatsuba multipliers on FPGA elliptic curve crypto processor, 116
platforms, 120–125 polynomial product, 117
comparison of LUT utilization in mul- finite field multipliers for high perfor-
tipliers, 123 mance applications, 117–118
general Karatsuba multiplier, 121, Karatsuba multiplier, 117
122 Massey-Omura multiplier, 117
hybrid Karatsuba multiplier, 123–125 Montgomery multiplier, 117
output bits, 122 school book method, 117
recursive Karatsuba multiplier, 121 XOR gates, 118
XOR gates, 125 generalization of ITA for 2n circuit, 138–
area and delay estimations for 2n ITA, 139
141–147 hardware architecture for 2n circuit
area and delay of x variable Boolean based ITA, 140–141
function, 141–142 clock cycles, 140, 141
area estimate for entire ITA architec- Control block, 141
ture, 147 Finite State Machine, 141
area and LUT delay estimates for field Powerblock, 140
multiplier, 142–145 register bank, 141
area and LUT delay estimates of mul- high-performance finite field inversion
tiplexer, 146 architecture for FPGAs, 126–127
area and LUT delay estimates for binary Euclidean algorithm, 126
powerblock, 146–147 extended Euclidean algorithm, 126
area and LUT delay estimates for re- Itoh-Tsujii inversion algorithm, 126
duction circuit, 145–146 Montgomery inversion algorithm, 126
area and LUT delay estimates of 2n NIST Digital Signature Standard, 127
circuit, 146 Itoh-Tsujii inversion algorithm, 127–128
Boolean function, 141 addition chain, 128
Combine Multiplication Outputs block, Brauer chains, 127
144 example, 127
Karatsuba algorithm, 143 Karatsuba multiplication, 118
value change dump, 186 scan chain based attack on stream ci-
item scan chain based attacks, 196–199 phers, 335–336
Advanced Encryption Standard, 198 scan enable signal, 334
avalanche criteria, 198 seed, 335
branch number, 197 Shift Register, 335
ciphertext, 197 testability of cryptographic designs,
Design for Testability, 196 341–345
plaintext, 197 Cipher Block Chaining, 342
S-boxes, 197 finite state machine, 342
side channel attacks, 197 mirror key register, 341
working principle, 196–199 reset attack on flipped-scan mecha-
Side channel attacks, 4 nism, 342–345
Signal to noise ratio (SNR), 277, 315 security, 342
Silicon PUFs, 478, 481–486 Test Security Controller, 342
Simple Power Analysis (SPA), 185–186 theorem, 345
SNR, see Signal to noise ratio VLSI design, 342
SoCs, see System-on-chips Trivium, scan attack on, 336–340
Soft macros, 466 ascertaining bit correspondence, 338–
Software obfuscation, 352 339
SPA, see Simple Power Analysis attack on Trivium, 337–338
Spaghetti code, 352 deciphering the cryptogram, 339–340
SPNs, see Security Probe Networks description of Trivium, 336–337
Spy process Initialization Vector, 336
asynchronous attack, 279 key and IV setup, 336–337
cache attacks, 194 keystream generation, 337
Squaring operation, 25 objective of attacker, 337
SRAM PUF, 484–485 Test Security Controller, 342
TetraMax, 464, 465
State Transition Graph (STG), 351, 419, 454
Time-driven cache attacks, 194, 280–286
Steady-stream state, 361
analysis of Bernstein’s attack, 285
STG, see State Transition Graph
attack phase, 283
Substitution Permutation Networks, 248
building a timing profile, 284
Support Vector Machine (SVM), 492
collision, 282, 283
Survival of the fittest, 496
conflict misses, 285
SVM, see Support Vector Machine
distinguishing between a hit and miss
Switch block, 66
with time, 281–282
Symmetric key algorithms, 30
ExecutionTime, 281
SYNC signal, 437
extracting keys from timing profiles, 284
Synopsys Design Compiler tools, 109
Jitter, 280
System-on-chips (SoCs), 466 loaded cache variant, 285
micro-architectural effects, 285
T OpenSSL implementation, 282
profiled time-driven attack on AES,
TCP/IP sockets, 281 283–286
Testability of cryptographic hardware, 333– profiling phase, 283
346 remote timing attacks, 280–281
controllability, 333 round-trip time, 280
flip-flops, 33 second round attack variant, 285
scan chain based attacks, 334–336 second round time-driven attack on
clock event, 334 AES, 282–283
Configurable Register, 335 TCP/IP sockets, 281
U
Universal Hash Function, 490
V
Value change dump (VCD), 186
VCD, see Value change dump
Velocity saturation index, 426
Verilog code, 364
Virtual time-stamp counters (VTSC), 291
VLSI technology, 31, 342
von Neumann bottleneck, 266