11 - Chapter 3 PDF

25
CHAPTER 3
RECONFIGURABLE HIGH PERFORMANCE FAULT

DETECTABLE AES BY USING COMPOSITE FIELDS
This chapter presents the fundamental theories of the Advanced

Encryption Standard algorithm and the specification based on 128-bit
architecture. It begins with the introduction to AES algorithm and briefly
compared the specification between different AES architectures. After that, it
discusses about the background mathematics concepts associated with the
specification, follows by detailed description of all functions and
transformations in the AES algorithm.
3.1 INTRODUCTION TO THE ADVANCED ENCRYPTION

STANDARD (AES) ALGORITHM
The AES algorithm, adopted by the United State of America

government in 2001, is a block cipher transforms 128-bit data blocks under a
128-bit, 192-bit or 256-bit secret key, by means of permutation and
substitution.
In January 1997, the USA National Institute of Standards and

Technology (NIST) announced the initiation of an effort to develop the AES
and made a formal call for algorithms on September 12, 1997. After reviewed
the results of this preliminary research, the algorithms MARS, RC6TM,
Rijndael, Serpent and Twofish were selected as finalist. The NIST has
26
decided to propose Rijndael as the new Advanced Encryption Standard (AES)

on 2nd October 2000. It is expected to replace the DES and Triple DES so as
to fulfill the stricter data security requirement because of its enhanced security
levels.
Rijndael was a refinement of an earlier design by Daemen and

Rijmen. Unlike its predecessor DES, Rijndael is a substitution-permutation
network, not a Feistel network. AES is fast in both software and hardware, is
relatively easy to implement, and requires little memory.
In the summer of 2001, AES replaced the aging DES as the Federal
Information Processing Encryption Standard (FIPS). DES is seen as reaching
the end of its life, as cracking of its cipher is seen to be more tractable on
current computer hardware. The AES algorithm will be used for many
applications within the government and in the private sector. Breaking an
AES encrypted cipher text by trying all possible keys is currently
computationally infeasible.
The AES specifies the Rijndael algorithm, which is a symmetric

block cipher that processes fixed 128-bit data blocks using cipher keys with
lengths of 128, 192 and 256-bit. The original Rijndael algorithm had the
option of combining data block sizes of 128, 192 or 256-bit with any of key
lengths. However due to the hard task of verifying that all possible
combinations were secure against cryptographic attacks, only the block size
of 128-bit data and 128, 192 and 256-bit keys were recognized in the AES
standard (NIST, 2002). Starting for this section, all discussion is focused on
256-bit AES algorithm and its implementation.
27
3.2 DIFFERENT AES ARCHITECTURES
There are only 3 FIPS-recognized AES architectures: AES128,

AES192 and AES256. The difference between the above architectures is the
length of secret key, which is 128-bit, 192-bit and 256-bit length respectively.
In general, longer secret key will require more iteration in generating the
round key, as well as encrypt/decrypt the input data text.
Most of the AES128 design blocks could be directly re-used in the

AES192 and AES256, with minor changes in the Key Expansion unit which
is to add more state blocks thus to store longer secret key.
3.3 ALGORITHM NOTATIONS AND CONVENTIONS
In this section, all the notations, symbols and parameters used are
based on the convention used in the NIST FIPS-197 AES Standard. The key
parameters used in the NIST FIPS-197 are:
Nb - the length of the Cipher Text / Plain Text
Nk - the length of the Cipher Key
Nr - the rounds of text transformation
3.4 COMPARISON OF DIFFERENT AES SPECIFICATIONS
The AES specifications differ by its length of secret key, and also
the key transformation process between AES128/192 and AES256. In the
NIST FIPS-197, the only Key-Block-Round combinations that conform to the
standard are given in below Table 3.1.
28
Table 3.1 AES specifications
Key length Block size Number of

Nkwords Nb words rounds (Nr)
AES-128 4 4 10
AES-192 6 4 12
AES-256 8 4 14
3.5 BACKGROUND MATHEMATICS
This section provides a brief introduction to the fundamental

mathematical concepts of finite fields needed to understand, most of the
information in this section are depicted from the FIPS-197. For in-depth
discussion on the subject, one should refer to (Joan Daemen & Vincent
Rijmen 1999) and (FIPS-197 2001). Several operations in AES are defined at
byte level, with bytes representing elements in the finite field GF (28). Other
operations are defined in terms of 4-byte words. This section introduces the
basic mathematical concepts needed in the following section.
3.5.1 The Field GF (28)
The elements of a finite field can be represented in several different

ways. For any prime power there is a single finite field, hence all
representations of Galois Field, GF(28) are isomorphic. Despite this
equivalence, the representation has an impact on the implementation
complexity. Joan Daemen & Vincent Rijmen (1999) have chosen for the
classical polynomial representation.
A byte b, consisting of bits b7b6 b5 b4 b3 b2 b1 b0, is considered as a

polynomial with coefficient in {0,1}:
29
b7x7+ b6x6+ b5x5+ b4x4+ b3x3+ b2 x2+ b1x + b0 (3.1)
Example 3.1: The byte with hexadecimal value ‘57’ (binary 01010111)
corresponds with polynomial
x6 + x4 + x2 + x + 1 . (3.2)
3.5.2 Finite Field Addition
The addition of two finite field elements is achieved by adding the

coefficients for corresponding powers of their polynomial representations,
this addition being performed in GF (28), that is, modulo 2, so that 1 + 1 = 0.
Consequently, addition and subtraction are both equivalent to an

exclusive-or (XOR) operation on the bytes that represent field elements.
Addition operations for finite field elements will be denoted by the symbol
ْ.
Example 3.2: Steps to get result of {57} ْ {8E} {D4}
(Polynomial notation) (x6+ x4 + x2+ x + 1) + (x7+ x + 1) = x7+ x6+ x4+ x2
(Binary notation) {01010111} ْ {10000011} = {11010100}
(Hexadecimal notation) {57} ْ {8E} = {D4}
3.5.3 Finite Field Multiplication
Finite field multiplication is more difficult than addition and is

achieved by multiplying the polynomials for the two elements concerned and
collecting like powers of x in the result. Since each polynomial can have
powers of x up to 7, the result can have powers of x up to 14 and will no
longer fit within a single byte. This situation is handled by replacing the
30
result with the remainder polynomial after division by a special eight order
irreducible polynomial, which for AES is
m(x) = x8+x4+x3+x+1 (3.3)
Since this polynomial has powers of x up to 8, it cannot be

represented by a single byte and will be written as either {00011011} or {1B}
as indicated earlier.
Example 3.3
This process is illustrated in the following example product

{57}·{83} {C1}
(where · is used to represent finite field multiplication):
(x 6 + x 4 + x 2 + x +1) • (x7 + x +1)
(x 6 + x 4 + x 2 + x +1) • x7 = x13 + x11 + x9 + x8 + x7 +
(x 6 + x 4 + x 2 + x +1) • x = x7 + x5+ x 3 + x2 + x
(x 6 + x 4 + x 2 + x +1) • 1 = x6 + x4 + x2 + x + 1
x13 + x11 + x9 + x8 + x6 + x5 + x4 + x3 + 1
This intermediate result is now divided by m(x) above:
x13 + x11 + x9 + x8 + x6 + x5 + x4 + x3 + 1
(x8 + x4 + x3 + x +1) • x5 = x13 + x9 + x8 + x6 + x5
x 11 + x4 + x3 + 1
31
Subtract to give immediate remainder
(x8 + x4 + x3 + x +1) • x3 = x11 + x7 + x6 + x4 + x3
Subtract to give the final remainder x7 + x6 + 1
The final result is x7 + x6 + 1 = {C1}
3.5.4 Multiplicative Inverse
In mathematics, multiplicative inverse of a number a, is the number

which, when multiplied by x, yields 1 or
(a · x) = 1
It is denoted by 1/x or x-1. In modular arithmetic, the multiplicative

inverse of x is defined as the number such that
(a · x) mod n = 1
However, this multiplicative inverse exists only if ‘a’ and ‘n’ are
relatively prime.
Example 3.4
The multiplicative inverse of 3 modulo 11 is 4 because 4 is the

solution to (3 · x) mod 11 = 1. In hexadecimal notation, {03} mod {0B} = 1.
In calculating multiplicative inverse for a set of 8 bits numbers,

there would be a set of 256 different byte values. Multiplicative inverse is
used later in SubByte and InvSubByte transformation.
32
3.5.5 Polynomials with Coefficients in GF (2 )
Four term polynomials can be defined with coefficients that are

finite field elements as:
a(x) = a3x3+ a2 x2+ a1 x + a0 (3.4)
where the four coefficients, each represented by a byte, will be denoted as a

32-bit word in the form [a3, a2, a1, a0]. With a second polynomial:
b(x) = b3x3 + b2x2 + b1x + b0 (3.5)
addition can be performed by adding the finite field coefficients of like

powers of x, which corresponds to an XOR operation between the
corresponding bytes in each of the words or an XOR of the complete 32-bit
word values (note that the variable x here is different to that used in the
definition of individual finite field elements). Multiplication is achieved by
algebraically expanding the polynomial product and collecting like powers of
x to give:
c(x) = c6x6 + c5x5 + c4x4 + c3x3 + c2x2 + c1x + c0 (3.6)
where:
c0 = a0 · b0 c4 = a3 · b1ْ a2 · b2ْa1 · b3
c1 = a1 · b0ْ a0 · b1 c5 = a3 · b2ْ a2 · b3
c2 = a2 · b0ْ a1 · b1ْ a0 · b2 c6 = a3 · b3
c3 = a3 · b0ْ a2 · b1ْ a1 · b2ْa0 · b3

33
with · and ْ representing finite field multiplication and addition (XOR)

respectively. This result requires six bytes to represent its coefficients but it
can be reduced modulo a degree 4 polynomial to produce a result that is of
degree less than 4. In Rijndael the polynomial used is (x4 + 1) and reduction
produces the following polynomial coefficients:
d3 = a3 · b0ْa2 · b1ْa1 · b2ْa0 · b3 (3.7)
d2 = a2 · b0ْa1 · b1ْa0 · b2ْa3 · b3 (3.8)
d1 = a1 · b0ْa0 · b1ْa3 · b2ْa2 · b3 (3.9)
d0 = a0 · b0ْa3 · b1ْa2 · b2ْa1 · b3 (3.10)
If one of the polynomials is fixed, this can conveniently be written

in matrix form as:
Because (x4 + 1) is not an irreducible polynomial, not all polynomial

multiplications are invertible. For Rijndael, however, a polynomial that has an
inverse has been chosen:
a(x) ={03} x3 + {01} x2 + {01} x + {02} (3.11)
a-1(x) = {0b} x3 +{0d} x2+{09} x +{0e} (3.12)

34
This transformation is used in MixColumn and InvMixColumn.
Another polynomial that Rijndael uses has a0 = a2 = a3 = {00} and

a1 = {01}, which is the polynomial x. Inspection of above will show that its
effect is to form the output word by rotating the bytes in the input word so
that [b3, b2, b1, b0] is transformed into [b2, b1, b0, b3], with bytes moving to
higher index positions and the top byte wrapping round to the lowest position.
Higher powers of x correspond to the other cyclic permutations of the four
bytes within a 32-bit word. The ROTATE function that is used in the key
expander corresponds to x3.
3.6 ADVANCED ENCRYPTION STANDARD (AES) MAIN

MODULES
There are two main modules in AES algorithm:
i. Key Expansion
ii. Cipher / Inverse Cipher
This research designed both modules in a same hardware.
3.6.1 State Array and Cipher Key State Array
Before the explanation of AES algorithm, one needs to have a deep

understanding in internal array that is used in AES to describe all of its
functions (Joan Daemen & Vincent Rijmen 1999). The AES internal
operations are operated on an array of bytes, which is used in the Cipher (i.e.
text transformation) and Key Expansion (i.e. round key scheduler).
In the Cipher, the array is called State (denoted as S), which consists
of 4 rows of bytes, each row has 4 bytes, and each byte consists of 8 bits, thus
35
total bits of the S is 4 rows x 4 bytes x 8 bits = 128 bits. Each individual byte
has two indices; row number r with range (0 r < 4), and column number c
with range (0 c< 4), hence allowing it to be referred as Sr,c. All
transformations in the cipher are made on this State array.
In the key scheduler, the array used in data processing is called Key
State (denoted as W). Key State is a single-dimension array of bytes with 8
rows of 32-bits word, denoted as r with range (0 r< 8), hence allowing
referred as Wr . One key state array with eight W is named as round key K.
(a)
(b) (c)
Figure 3.1 (a) Initial input bytes (b) State Array and (c) Key State Array
36
As examples in explaining AES algorithm, the values for initial

input key and input data are chosen as follows (test patterns from the FIPS-
197):
Plain text : 32 43 f6 a8 88 5a 30 8d 31 31 98 a2 e0 37 07 34
Cipher key : 60 3d eb 10 15 ca 71 be 2b 73 ae f0 85 7d 77 81
Figure 3.2 Example of State Array and Key State Array
3.6.2 AES Key Expansion
The AES algorithm takes the Cipher Key, K, and performs a Key
Expansion routine to generate a key schedule. The Key Expansion generates a
total of Nb (Nr + 1) words: the algorithm requires an initial set of Nb words,
and each of the Nr rounds requires Nb words of key data. The resulting key
schedule consists of a linear array of 4-byte words, denoted [wi ], with i in the
range0 i <Nb(Nr + 1).
The expansion of the input key into the key schedule proceeds
according to the pseudo code in Figure 3.3. SubWord is a function that takes a
four-byte input word and applies the S-box to each of the four bytes to
produce an output word. The function RotWord takes a word [a0,a1,a2,a3] as
input, performs a cyclic permutation, and returns the word [a1,a2,a3,a0]. The
37
round constant word array, Rcon[i], contains the values given by

[xi-1,{00},{00},{00}], with x i-1 being powers of x (x is denoted as {02}) in
the field GF(28) (note that i starts at 1, not 0).It can be seen that the first Nk
words of the expanded key are filled with the Cipher Key. Every following
word, w[ [i] ], is equal to the XOR of the previous word, w[ [i-1] ], and the
word Nk positions earlier, w[ [i-Nk] ]. For words in positions that are a
multiple of Nk, a transformation is applied to w[ [i-1] ] prior to the XOR,
followed by an XOR with a round constant, Rcon[i]. This transformation
consists of a cyclic shift of the bytes in a word(RotWord), followed by the
application of a table lookup to all four bytes of the word (SubWord). It is
important to note that the Key Expansion routine for 256-bit Cipher Keys
(Nk = 8) is slightly different than for 128- and 192-bit Cipher Keys. If Nk =
8 and i-4 is a multiple of Nk, then SubWord is applied to w[ [i-1] ] prior to the
XOR.
Figure 3.3 Pseudo Code for Key Expansion

38
W0 W1 W2 W3
W4 W5 W6 W7
W8 W9 W10 W11
x
x
Y
X Y
Sbox(
Rot(Y))
X Y
Figure 3.4 Round key expansion
3.6.2.1 ROT transformation
ROT rotates each byte in a word one position to the left. Let’s say a
word consists of four bytes {a0, a1, a2, a3} and after ROT transformation, the
new word is {a1, a2, a3, a0}. For example in hexadecimal, if a word
[{EA},{31}, {D4}, {F0}], ROT returns the word [{31}, {D4}, {F0}, {EA}].
3.5 illustrate the ROT process.
39
Figure 3.5 ROT Transformation
3.6.2.2 SubByte transformation (invSubByte)
The SubByte transformation is a non-linear byte substitution on

every byte of the state, it is also called as SBOX.
The SBOX is a common function used in both Key Expansion and

Cipher. In the Key Expansion, only encryption mode is used even in
generating round keys for decryption transformation.
The SBOX is constructed by composing two transformations:
1. First, the multiplicative inverse in the finite field described

earlier in Section 3.4.4 (with element {00} mapped to itself).
2. Second, the affine transformation is applied over GF(28) defined

by:
b’i = biْb(i+4)mod8ْ b(i+5)mod8ْ b(i+6)mod8ْ b(i+7)mod8ْCi (3.13)
For 0 i 8, where bi is bit i of the byte and ci is bit i of a byte c

with value {63} or {01100011}. This transformation can be expressed in
matrix form as:
40
For invSubByte, the process is different with SubByte in Step 2 that

the affine transformation is replaced with:
b’i = b(i+2)mod8ْ b(i+5)mod8ْ b(i+7)mod8ْ di (3.14)
3.7 ILLUSTRATES THE EFFECT OF THE SUBBYTES

TRANSFORMATION ON THE STATE
Figure 3.6 SubByte and InvSubByte Transformation
LUT Method
The S-box used in the SubBytes transformation is presented in

hexadecimal form in Figure 3.7. Ff S1,1= {53}, then the substitution value
would be determined by the intersection of the row with index ‘5’ and the
41
column with index ‘3’ in Figure.3.7. This would result in S’1,1 having a value
of {ed}.
Figure 3.7 S-box: Substitution values for the byte xy (in hexadecimal
format)
Combinational Logic Method
In LUT based approach, the unbreakable delay of lookup tables is

greater than the other logic. By using LUT method, it is difficult to use sub
pipeline structure with two pipeline stages, which prevents the further
speedup. An alternative method is to use combinational logic, which is faster
than the LUT and can also be divided into two pipeline stages, allowing
further speedup.
Figure 3.8 Subbyte / inversesubbyte implementation

42
In non LUT method, sub bytes can be implemented by finding

multiplicative inverse followed by affine transform. Similarly inverse sub
bytes implemented by using inverse affine transform followed by
multiplicative inverse. Here multiplicative inverse is common; by taking this
advantage, a single structure can be implemented for both subbytes and
inverse subbytes which is shown in Figure 3.8. Hardware implementation
(Xinmiao Zhang 2004) of subbytes is shown in Figure 3.9.
Figure 3.9 Hardware implementation of subbytes
The multiplicative inversion in GF(28) involved in the SubBytes

/InvSubBytes is a hardware demanding operation; it takes at least 620 gates to
implement by repeat multiplications in GF(28). However, the gate count can
be reduced greatlyby using composite eld arithmetic. In the SubBytes
transformation, using substructure sharing, the isomorphic mapping function
can be implemented by 12 XOR gates with 4 XOR gates in the critical path.
Meanwhile, the combined inverse isomorphic mapping and the affine
transformation can be implemented by 19 XOR gates, and the critical path
consists of 4XOR gates also. In the composite field GF((24)2), an element can
be expressed as shx+sl, where sh, sl GF(24) and x is a root of P2(x). Using
Extended Euclidean algorithm, the multiplicative inverse of (shx+sl) modulo
P2(x) can be computed as the multiplicative inversion in GF(28) can be carried
out in GF((24)2) by the architecture illustrated in Figure 3.10.
43
(shx+sl)-1 = sh
x + (shx+sl
(3.15)
T S O S S S
2
h h l l
2 1
The multipliers in GF (24) can be further decomposed into

multipliers in GF(22) and then to GF(2), in which a multiplication is simply an
AND operation.
Figure 3.10 Implementations of individual blocks: (a) multiplier in

GF(24); (b) multiplier in GF(22); (c) squarer in GF(24);
(d) constant multiplier; and (e) constant multiplier
Figure 3.10 illustrates this decomposition, together with the other

blocks used in Figure 3.9 except the inversion in GF(24) block. A multiplier in
GF(24) can be implemented by 21 XOR gates and 9 AND gates, with 4 XOR
gates and 1 AND gate in the critical path(Satoh et al 2003). Table 3.2
summarizes the gate count and critical path of each block in the SubBytes
except the block of inversion in GF(24) in Figure 3.9.
44
Table 3.2 Gate counts and critical paths of functional blocks in the
subbytes transformation
Figure 3.11 Implementations of inversion in GF(24). (a) Square–multiply

approach. (b) Multiple decomposition approach
45
The inversion in GF(24) can be implemented by different

approaches. Squaring and multiplying approach, this approach is illustrated in
Figure 3.11(a). Another method is multiple decomposition method, explained
in Figure 3.11(b).
3.7.1 AES Round Transformation (Cipher / InvCipher)
At the start of the encryption, the cipher input is copied into the
internal state array. An initial round key is then added and the state is then
transformed by iterating a Round Transformation in a number cycles. The
number of cycles varies with the key length and block size. There are 4
functions involved in the Round Transformation:
i. AddRoundKey
ii. SubByte (invSubByte)
iii. ShiftRows (invShiftRows)
iv. MixColumn (invMixColumn)
The word in bracket represents the reversed function used in

decryption with three initial characters inv are used to indicate its inverse
function.
46
Figure 3.12 AES Round Transformation Algorithm

47
3.7.1.1 AddRoundKey transformation
In the AddRoundKey transformation, a round key is added to the

state by bitwise Exclusive-OR (XOR) operation. Figure 3.13 below illustrates
the AddRoundKey. This transformation is the same for both encryption and
decryption.
Figure 3.13 AddRoundKey Transformation
Example 3.6: AddRoundKey Transformation
Figure 3.14 Example of AddRoundkey Transformation

48
3.7.1.2 ShiftRows transformation (invShiftRows)
ShiftRows is a cyclic shift operation in each row of the State. In this

operation, the bytes in the first row of the state do not change. The second,
third, and fourth rows shift cyclically to the left one byte, two bytes, three
bytes, respectively, as illustrated in Figure 3.15. The reverse process,
invShiftRow, operates in reverse order to ShiftRows.
Figure 3.15 ShiftRows and invShiftRows Transformation
Example 3.7: ShiftRows Transformation
Figure 3.16 Example of ShiftRows Transformation

49
3.7.1.3 MixColumn transformation (invMixColumn)
The MixColumn transformation is performed independently on the

state column-by-column. Each column is considered as four term polynomial
over GF (28) and multiplied by
a(x) modulo (x4+ 1) (3.16)
where a(x) = {03}x3+ {01}x2 + {01}x + {02} (3.17)
This transformation can be expressed in matrix form as
For invMixColumn, replace a(x) = {0E}x3 + {09}x2 +

{0D}x + {0B} (3.18)
Figure 3.17 MixColumn and invMixColumn Transformation

50
Example 3.8: MixColumn Transformation
Figure 3.18 Example of MixColumn Transformation
3.8 PROPOSED ARCHITECTURE
The basic VLSI architecture of the four transformations in AES is

modified in the aspect of increasing the throughput and reliability. The overall
throughput is increased through pipelining register as shown in Figure 3.20,
used at the end of each round and further the architecture speed is increased
by introducing in sub-pipelining shown in Figure 3.21, in between each
transformation in that round. Loop Unrolling has been followed to reduce the
power and increase the speed.
51
(a) (b)
Figure 3.19 (a)Existing and (b)Proposed reconfigureureurable AES

structure
3.9 PIPELINED AES
In computing, a pipeline is a set of data processing elements

connected in series, so that the output of one element is the input of the next
one. The elements of a pipeline are often executed in parallel or in time-
sliced fashion; in that case, some amount of buffer storage is often inserted
between elements.
52
KEY
KEY
EXPANSION
UNIT
OUTPUT
INPUT
ROUND 1 ROUND 2 ROUND 3
ROUND 4 ROUND 5 ROUND 6 ROUND 7
CLK
ROUND 8 ROUND 9 ROUND 10
Figure 3.20 AES encryption with pipelining
Pipelining does not decrease the time for a single data to be

processed; it only increases the throughput of the system when processing a
stream of data.
A pipelined system typically requires more resources (circuit

elements, processing units, computer memory, etc.) than one that executes
one batch at a time, because its stages cannot reuse the resources of a previous
stage. Moreover, pipelining may increase the time it takes for an instruction
to finish.
The AES encryption for pipeline design is shown in Figure 3.20.

Here the pipeline registers is included in between every round so as to
increase the throughput. As this is a pipelined structure it takes 10 clock
cycles for getting first output and next outputs can be taken immediately after
next clock cycles.
53
3.10 SUB PIPELINED AES
Similar to the pipelining, sub pipelining can be implemented by

inserting registers in combinational logic, but registers are inserted both
between and inside each round. By using pipelining and sub pipelining
concept, multiple blocks of data can be processed simultaneously.
Figure 3.21 Sub pipelining architecture
Among these architectural optimizations, sub pipelining gives

maximum speed and better throughput. Figure 3.21 shows the sub pipelined
architecture with r sub stages. Each round unit is divided into r sub stages
with equal delays Figure 3.22 shows the sub-pipelined cutsets for single round
unit in encryption mode. Here every round is equally divided into three parts
and registers are included to get higher throughputs.
54
Figure 3.22 Sub pipelining cutsets in a single round encryption
In LUT method sub pipelining is limited to only two sub stages

whereas combinational logic can be divided into more sub stages with equal
delays. In this pipelining or sub pipelining architectures, the plain text is
received at each clock cycle through input register. A single round of
algorithm is completed depending on the number of sub stages. Round keys
are generated by using key expansion module. Generated round keys are
supplied to each round. At each clock cycle data is shifted to next stage and
final output is appeared only after the end of ((10*r)+10)th clock cycle. Here
‘r’ represents number of sub pipeline stages. The advantage of this structure is
that, the second output can be obtained immediately in the next clock cycle
after the first output. Internal design of the each round contains Sub bytes,
Shift rows, Mix columns, and add round key which are explained in previous
sections. Here the 3stage sub-pipelining is used in every round and outer
round pipelining. So for getting the first output we need initial latency of 40
clock cycles is needed and next outputs are collected immediately after next
clock cycles.
3.11 SYSTEM LEVEL MODELING
With the SoC design consideration in mind, the AES-128 processor

core was first modeled at system level, thus ease the integration of AES-128
with other cores in SoC designs.
55
At system level, the AES-128 processor core is viewed as black

box, the only consideration in the design is to identify what are the primary
signals of this core in order to communicate with the other cores. From
Chapter 3, it is clear that the AES-128 core must have following data ports:
i. dataIn - the plaintext to be encrypted or ciphertext to be decrypted
ii. CipherKey - the secret key to encrypt/decrypt the data
iii. dataOut - the encrypted plaintext or decrypted ciphertext
Besides, there must be input signals to initiate the mode of crypto

operation (encrypt/decrypt), as well as output signals to indicate the
completion of the process.
Lastly, as this processor core is a synchronous design, it will require

input clock as well as option for system reset. Figure 3.23 shows the system
level modeling of the AES-128 that fulfills the above system level
requirements:
CLK
RST
ENC AES-128
CRYPTO DATA_OUT
KEY_I
DATA_I
Figure 3.23 System Level Modeling of the AES-128 Processing Core

56
3.11.1 Input and Output Signals
Table 3.3 lists the primary input and output signals for the AES-128
core, which are essential to select the AES specification, operation mode, data
/ key input as well as generated output.
Table 3.3 Input and output signals
Signal Width
Type Description
Name (bit)
Clk 1 Input Processor main clock signal
Rst 1 Input Processor main reset signal
0 – normal operation; 1 – System reset
Enc 1 Input Processor mode of operation signal
0 – Decryption mode; 1 – Encryption
mode
Key_in 128 Input The Secret Key to be used by AES Key
Expander to expand all round keys.
Data_in 128 Input The initial data block to be encrypted or
decrypted
Data_out 128 Output Final result of AES transformation
3.12 FAULT TOLERANT XOR GATE
There are five major components which decide the throughput, area,
power of AES encryption and decryption. They are inverters, 2:1
multiplexers, XOR gates, D flip-flops, and totally self-checking two-rail
checkers. All of these components should be made sure to be faulty free and
produce two-rail outputs for a valid two rail input. When the inputs are valid
the output is valid and correct, and when an input is non-valid the output is
non-valid. Using truth table in Table 3.4 it is seen that an input set that yields
a non-valid output in the presence of every possible stuck-at fault; hence, the
XOR cell is totally self-checking for all single stuck-at faults and non-valid
57
inputs. Normally, Pseudo-nMOS technology has not been used because of its
more static power consumption than CMOS technology but it is preferred
because the devices are fast and the short between power and ground makes
the output predictable in the presence of a fault as shown in the Figure 3.24.
Figure 3.24 Fault Tolerant XOR gate
Figure 3.25 Functional waveform of XOR gate

58
Figure 3.26 Functional waveform of 2:1 MUX gate
Table 3.4 Truth Table for Fault Detection
Sl.No Inputs A A’ B B’ A XOR B A XNOR B

Valid Inputs 0 1 0 1 0 1
0 1 1 0 1 0
1 0 0 1 1 0
1 0 1 0 0 1
Non Valid 0 0 X x 1 1
inputs 1 1 X x 0 0
X x 0 0 1 1
X x 1 1 0 0
Faults
1 Stuck at’0’ X x 0 1 1 1
Stuck at’1’ X x 1 0 0 0
2 Stuck at’0’ X x 1 0 1 1
Stuck at’1’ X x 0 1 0 0
3 Stuck at’0’ 0 1 0 1 1 1
Stuck at’1’ 1 0 0 1 0 0
4 Stuck at’0’ 1 0 0 1 1 1
Stuck at’1’ 0 1 0 1 0 0
5 Stuck at’0’ 1 0 1 0 1 1
Stuck at’1’ 0 1 1 0 0 0
6 Stuck at’0’ 0 1 1 0 1 1
Stuck at’1’ 1 0 1 0 0 0
59
Table 3.5 Delay and Power report of the gates used in Fault Detection
Gate Delay ( ps) Power (Watts)

A to xor o/p 9.994E-9
XOR
B to xor o/p 14.99E-9
95.652 X 10-5
A to xnor o/p -4.965E-9
XNOR
B to xnor o/p -9.965E-9
D to Q 20.0E-9
DFF 26.44 X 10-5
Clk to Q 4.706E-9
A to Y 19.35E-12
MUX 2:1 89.208 X 10-5
B to Y 19.35E-12
3.13 TOOLS USED
1. For simulating the Verilog code, Modelsim Altera6.3g is used.
2. For synthesizing the design, Cadence RTL compiler v9.10 is

used.
3. Cadence SOC Encounter is used for layout extraction of the

complete design.
4. Xilinxs ISE Design suit 12.1 is used for FPGA implementation.

60
Figure 3.27 Block diagram for efficient AES architecture
3.14 RESULTS AND COMPARISON
This chapter discusses about ASIC and FPGA implementation

methodologies. In the work proposed and reported in the thesis two type of
designs are implemented. One is by using pipelined architecture and other is
by using sub-pipelined architecture. In pipelined architecture sub module
subbytes is implemented by using LUT method, whereas in sub-pipelined
architecture subbytes is implemented by using combinational method to
reduce the area requirements.
61
For VLSI (hardware) implementation two different methodologies

were used namely ASIC design and FPGA design. For ASIC design the
architecture is modeled in VERILOG HDL and the functional simulation is
done in MODELSIM, synthesis is carried out in CADENCE RTL
COMPILER and the physical design is carried out in CADENCE SOC
ENCOUNTER.
For FPGA design the architecture is modeled in VERILOG HDL

and functional simulation is done in MODELSIM, synthesis is carried out in
XILINX ISE DESIGNSUIT 12.1 and the target board is XILINX
XC5VLX110T-1. Following sections describe the implementation
methodology and results obtained in ASIC and FPGA.
3.14.1 ASIC Design Methodology
Application Specific Integrated Circuit (ASIC) Design, as the name

suggests this design focuses on the development of a hardware module which
is completely dedicated to that particular application or process. This type of
design helps in the economical usage of silicon and also has a good speed
compared to the other implementations such as FPGA and CPLD devices.
ASIC design flow can be seen in Figure 3.28, and the discussion of each step
is done in following sections.
62
SYSTEM
PARTITIONING
DESIGN ENTRY
SIMULATION
SYNTHESIS
FLOOR PLAN
PLACEMENT
LAYOUT SIMULATION
ROUTING
EXTRACTION
Figure 3.28 ASIC Design Flow
3.14.1.1 Simulation results
Subbytes
Figure 3.29 Subbytes - simulation result

63
Analysis: The above waveform shows the simulation results of the Subbyte
module. Here signal ‘in’ is 8-bit input for this module and signal ‘out’ is the
8-bit output. In this subbytes operation one 8-bit value substituted with
another 8-bit value with the help of lookup tables.
Invsubbytes
Figure 3.30 Invsubbytes - simulation results
Analysis: The above waveform shows the simulation results of the

invsubbyte module. Here signal ‘a’ is 8-bit input for this module and signal‘d’
is the 8-bit output. This is an inverse operation of Subbytes. In this
Invsubbytes operation, one 8-bit value substituted with another 8-bit value
with the help of lookup tables.
Key Expansion
Analysis: The above waveform shows the simulation results of the key
expansion module. Here signal ‘key’ is 128-bit input and signals ‘w0 to w43’
are outputs. By using 128 bit input key we are generating total of 10 round
keys which are used in every round operation.
64
Figure 3.31 Key expansion - simulation result

65
Single round Encryption operation
Figure 3.32 Single round Encryption operation – simulation result
Analysis: The above waveform shows the simulation results of the single
round encryption operation. Here signal ‘round_in’ is 128-bit input,
‘w0,w1,w2,w3’is combinely treated as round key and ‘round_out’ is 128-bit
output.
Single round Decryption operation
Figure 3.33 Single round decryption operation – simulation results
Analysis: The above waveform shows the simulation results of the single
round decryption operation. Here signal ‘round_in’ is 128-bit input,
‘w0,w1,w2,w3’is combinely treated as round key and ‘round_out’ is 128-bit
output.
66
Encryption operation
Figure 3.34 Encryption operation – simulation results
Analysis: Above figure shows the simulation results of encryption operation.

Signals clk, key, enc and in are the inputs. ‘out’ is an output signal. As this is
a pipelined design input can be given at every clock cycle and output can be
taken continuously from the 11th clock cycle.
Key = 128’h000102030405060708090a0b0c0d0e0f; enc = 1;
Input1 = 128’h00112233445566778899aabbccddeeff;
Input2 = 128’h10112233445566778899aabbccddeeff;
Output1 = 128’h69c4e0d86a7b0430d8cdb78070b4c55a;
Output2 = 128’h0761adfd2febd4d105b1ac2ff88171b3;
Decryption operation
Figure 3.35 Decryption operation – simulation results

67
Analysis: Above figure shows the simulation results of decryption operation.

Signals clk, key, enc and in are the inputs. ‘out’ is an output signal. As this is
a pipelined design input can be given at every clock cycle and output can be
taken continuously from the 11th clock cycle.
Key = 128’h000102030405060708090a0b0c0d0e0f; enc = 0;
Input1 = 128’h0761adfd2febd4d105b1ac2ff88171b3;
Input2 = 128’h69c4e0d86a7b0430d8cdb78070b4c55a;
Output1 =128’h10112233445566778899aabbccddeeff;
Output2 =128’h00112233445566778899aabbccddeeff;
Schematic obtained in cadence
Figure 3.36 Schematic of AES

68
3.14.1.2 Synthesis results
This section consists of the synthesis results of the design codes in

Cadence RTL Compiler.
1. Area Report of pipelined design
2. Area Report of sub-pipelined design
3. Power Report of pipelined design

69
4. Power Report of sub-pipelined design
5. Timing Report of pipelined design
6. Timing Report of sub-pipelined design
3.14.1.3 ASIC synthesis summary
Table 3.6 Synthesis results (ASIC)
AES(Look Proposed AES Proposed AES

Design
Up Table) (Sub Pipelining) (Sub Pipelining)
Technology 90nm 90nm 180nm
Area (um2) 740870 564036 2258469
Power (mw) 136.995 147.78 655.5
Critical path 3.9ns 2.2ns 4.2ns
Fmax (Mhz) 256.4 454.5 238
Throughput (Gbps) 32.82 58.18 30.47
70
3.14.2 FPGA Methodology
Before developing the ASIC, AES was prototyped and validated on

FPGA. In this way, AES was developed in Verilog Hardware Description
Language (Verilog HDL) at the Register Transfer Level (RTL). Therefore
AES was prototyped by Xilinx ISE (place and route) and validated on a
Xilinx XC5VLX110T-1 FPGA.
The FPGA hardware implementation is performed in the following way:
1. AES Verilog codes (RTL) are synthesized in Xilinx ISE design

suit 12.1 for the Xilinx XC5VLX110T-1 FPGA.
2. The synthesis netlist is placed and routed by Xilinx ISE.
3. Bit file is generated by Xilinx ISE.
4. The bit file is downloaded into the XC5VLX110T-1 FPGA.
5. With the help of Chip scope pro analyzer software we can verify
our output on monitor.
3.14.2.1 FPGA results
1. Validating design on XC5VLX110T
Figure 3.37 FPGA validation screenshot

71
Analysis: Above figure shows the validation of AES processor on

XC5VLX110T-1 FPGA using ChipScope pro analyser. Due to the limited
number of switches and LED’s available on FPGA boards, it is necessary to
go for ChipScope pro analyser. Here signals ‘SyncIn, AsyncOut, AsyncOut1’
are decrypted output, input text and key respectively.
2. Device utilization summary
Table 3.7 Device utilization summary of pipelined architecture.
Selected device: xc5vlx110t-1

Number of slices 4611
Number of slice Flip flops 1096
Number of LUT’S 14358
Number of BRAM’S 60
Number of bonded IOB’S 386
Maximum Frequency 103.42Mhz
Table 3.8 Device utilization summary of sub-pipelined architecture.
Selected device: xc5vlx110t-1

Number of slices 8896
Number of slice Flip flops 12409
Number of LUT’S 26808
Number of BRAM’S 0
Number of bonded IOB’S 386
Maximum Frequency 202.26Mhz
72
3.15 SUMMARY AND CONCLUSION
The hardware implementation of efficient pipeline AES architecture

with re-configurability includes both encryption and decryption Process. The
sub pipelining architecture helped us to get higher throughput than earlier
implementations. This proposed VLSI architecture is enhanced with the
facility of fault detectable basic gates used for cryptographic architecture.
Normally in most of the previous works Subbyte implementation is done
using lookup table method, but in the proposed architecture both lookup table
and combinational logic method are used. Compared to lookup table method,
combinational method occupied lesser area. Furthermore combinational logic
helped us for making inner round pipelining (sub-pipelining) in an efficient
way.
The design is modeled using Verilog HDL and simulated with the
help of Modelsim and Cadence NCsim. Synthesis is done by using RTL
Compiler and physically designed with SOC Encounter. The transistor level
design is being done by Cadence ADE and the simulation is carried out using
SPECTRE. In the proposed architecture throughput increase to 32.32 Gbps
with 180nm TSMC technology library. The design has also been targeted on
FPGA, which achieved a throughput of 31.9Gbps on Xilinx xc5vlx110t-1
device which is faster and more effective than the fastest previous FPGA
implementations known to date.

11 - Chapter 3 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

11 - Chapter 3 PDF

Uploaded by

Copyright:

Available Formats

25

RECONFIGURABLE HIGH PERFORMANCE FAULT

This chapter presents the fundamental theories of the Advanced

3.1 INTRODUCTION TO THE ADVANCED ENCRYPTION

The AES algorithm, adopted by the United State of America

In January 1997, the USA National Institute of Standards and

decided to propose Rijndael as the new Advanced Encryption Standard (AES)

Rijndael was a refinement of an earlier design by Daemen and

The AES specifies the Rijndael algorithm, which is a symmetric

3.2 DIFFERENT AES ARCHITECTURES

There are only 3 FIPS-recognized AES architectures: AES128,

Most of the AES128 design blocks could be directly re-used in the

3.3 ALGORITHM NOTATIONS AND CONVENTIONS

Nb - the length of the Cipher Text / Plain Text

Nk - the length of the Cipher Key

Nr - the rounds of text transformation

3.4 COMPARISON OF DIFFERENT AES SPECIFICATIONS

Table 3.1 AES specifications

Key length Block size Number of

3.5 BACKGROUND MATHEMATICS

This section provides a brief introduction to the fundamental

3.5.1 The Field GF (28)

The elements of a finite field can be represented in several different

A byte b, consisting of bits b7b6 b5 b4 b3 b2 b1 b0, is considered as a

b7x7+ b6x6+ b5x5+ b4x4+ b3x3+ b2 x2+ b1x + b0 (3.1)

3.5.2 Finite Field Addition

The addition of two finite field elements is achieved by adding the

Consequently, addition and subtraction are both equivalent to an

Example 3.2: Steps to get result of {57} ْ {8E}  {D4}

(Polynomial notation) (x6+ x4 + x2+ x + 1) + (x7+ x + 1) = x7+ x6+ x4+ x2

(Binary notation) {01010111} ْ {10000011} = {11010100}

(Hexadecimal notation) {57} ْ {8E} = {D4}

3.5.3 Finite Field Multiplication

Finite field multiplication is more difficult than addition and is

m(x) = x8+x4+x3+x+1 (3.3)

Since this polynomial has powers of x up to 8, it cannot be

This process is illustrated in the following example product

(where · is used to represent finite field multiplication):

(x 6 + x 4 + x 2 + x +1) • (x7 + x +1) 

(x 6 + x 4 + x 2 + x +1) • x7 = x13 + x11 + x9 + x8 + x7 +

This intermediate result is now divided by m(x) above:

(x8 + x4 + x3 + x +1) • x5 = x13 + x9 + x8 + x6 + x5

Subtract to give immediate remainder

(x8 + x4 + x3 + x +1) • x3 = x11 + x7 + x6 + x4 + x3

Subtract to give the final remainder x7 + x6 + 1

The final result is x7 + x6 + 1 = {C1}

3.5.4 Multiplicative Inverse

In mathematics, multiplicative inverse of a number a, is the number

It is denoted by 1/x or x-1. In modular arithmetic, the multiplicative

The multiplicative inverse of 3 modulo 11 is 4 because 4 is the

In calculating multiplicative inverse for a set of 8 bits numbers,

3.5.5 Polynomials with Coefficients in GF (2 )

Four term polynomials can be defined with coefficients that are

a(x) = a3x3+ a2 x2+ a1 x + a0 (3.4)

where the four coefficients, each represented by a byte, will be denoted as a

b(x) = b3x3 + b2x2 + b1x + b0 (3.5)

addition can be performed by adding the finite field coefficients of like

c(x) = c6x6 + c5x5 + c4x4 + c3x3 + c2x2 + c1x + c0 (3.6)

c3 = a3 · b0ْ a2 · b1ْ a1 · b2ْa0 · b3

with · and ْ representing finite field multiplication and addition (XOR)

d3 = a3 · b0ْa2 · b1ْa1 · b2ْa0 · b3 (3.7)

d2 = a2 · b0ْa1 · b1ْa0 · b2ْa3 · b3 (3.8)

d1 = a1 · b0ْa0 · b1ْa3 · b2ْa2 · b3 (3.9)

Example 3.2: Steps to get result of {57} ْ {8E} {D4}

(x 6 + x 4 + x 2 + x +1) • (x7 + x +1)

For 0 i 8, where bi is bit i of the byte and ci is bit i of a byte c