You are on page 1of 48

25

CHAPTER 3

RECONFIGURABLE HIGH PERFORMANCE FAULT


DETECTABLE AES BY USING COMPOSITE FIELDS

This chapter presents the fundamental theories of the Advanced


Encryption Standard algorithm and the specification based on 128-bit
architecture. It begins with the introduction to AES algorithm and briefly
compared the specification between different AES architectures. After that, it
discusses about the background mathematics concepts associated with the
specification, follows by detailed description of all functions and
transformations in the AES algorithm.

3.1 INTRODUCTION TO THE ADVANCED ENCRYPTION


STANDARD (AES) ALGORITHM

The AES algorithm, adopted by the United State of America


government in 2001, is a block cipher transforms 128-bit data blocks under a
128-bit, 192-bit or 256-bit secret key, by means of permutation and
substitution.

In January 1997, the USA National Institute of Standards and


Technology (NIST) announced the initiation of an effort to develop the AES
and made a formal call for algorithms on September 12, 1997. After reviewed
the results of this preliminary research, the algorithms MARS, RC6TM,
Rijndael, Serpent and Twofish were selected as finalist. The NIST has
26

decided to propose Rijndael as the new Advanced Encryption Standard (AES)


on 2nd October 2000. It is expected to replace the DES and Triple DES so as
to fulfill the stricter data security requirement because of its enhanced security
levels.

Rijndael was a refinement of an earlier design by Daemen and


Rijmen. Unlike its predecessor DES, Rijndael is a substitution-permutation
network, not a Feistel network. AES is fast in both software and hardware, is
relatively easy to implement, and requires little memory.

In the summer of 2001, AES replaced the aging DES as the Federal
Information Processing Encryption Standard (FIPS). DES is seen as reaching
the end of its life, as cracking of its cipher is seen to be more tractable on
current computer hardware. The AES algorithm will be used for many
applications within the government and in the private sector. Breaking an
AES encrypted cipher text by trying all possible keys is currently
computationally infeasible.

The AES specifies the Rijndael algorithm, which is a symmetric


block cipher that processes fixed 128-bit data blocks using cipher keys with
lengths of 128, 192 and 256-bit. The original Rijndael algorithm had the
option of combining data block sizes of 128, 192 or 256-bit with any of key
lengths. However due to the hard task of verifying that all possible
combinations were secure against cryptographic attacks, only the block size
of 128-bit data and 128, 192 and 256-bit keys were recognized in the AES
standard (NIST, 2002). Starting for this section, all discussion is focused on
256-bit AES algorithm and its implementation.
27

3.2 DIFFERENT AES ARCHITECTURES

There are only 3 FIPS-recognized AES architectures: AES128,


AES192 and AES256. The difference between the above architectures is the
length of secret key, which is 128-bit, 192-bit and 256-bit length respectively.
In general, longer secret key will require more iteration in generating the
round key, as well as encrypt/decrypt the input data text.

Most of the AES128 design blocks could be directly re-used in the


AES192 and AES256, with minor changes in the Key Expansion unit which
is to add more state blocks thus to store longer secret key.

3.3 ALGORITHM NOTATIONS AND CONVENTIONS

In this section, all the notations, symbols and parameters used are
based on the convention used in the NIST FIPS-197 AES Standard. The key
parameters used in the NIST FIPS-197 are:

Nb - the length of the Cipher Text / Plain Text

Nk - the length of the Cipher Key

Nr - the rounds of text transformation

3.4 COMPARISON OF DIFFERENT AES SPECIFICATIONS

The AES specifications differ by its length of secret key, and also
the key transformation process between AES128/192 and AES256. In the
NIST FIPS-197, the only Key-Block-Round combinations that conform to the
standard are given in below Table 3.1.
28

Table 3.1 AES specifications

Key length Block size Number of


Nkwords Nb words rounds (Nr)
AES-128 4 4 10
AES-192 6 4 12
AES-256 8 4 14

3.5 BACKGROUND MATHEMATICS

This section provides a brief introduction to the fundamental


mathematical concepts of finite fields needed to understand, most of the
information in this section are depicted from the FIPS-197. For in-depth
discussion on the subject, one should refer to (Joan Daemen & Vincent
Rijmen 1999) and (FIPS-197 2001). Several operations in AES are defined at
byte level, with bytes representing elements in the finite field GF (28). Other
operations are defined in terms of 4-byte words. This section introduces the
basic mathematical concepts needed in the following section.

3.5.1 The Field GF (28)

The elements of a finite field can be represented in several different


ways. For any prime power there is a single finite field, hence all
representations of Galois Field, GF(28) are isomorphic. Despite this
equivalence, the representation has an impact on the implementation
complexity. Joan Daemen & Vincent Rijmen (1999) have chosen for the
classical polynomial representation.

A byte b, consisting of bits b7b6 b5 b4 b3 b2 b1 b0, is considered as a


polynomial with coefficient in {0,1}:
29

b7x7+ b6x6+ b5x5+ b4x4+ b3x3+ b2 x2+ b1x + b0 (3.1)

Example 3.1: The byte with hexadecimal value ‘57’ (binary 01010111)
corresponds with polynomial

x6 + x4 + x2 + x + 1 . (3.2)

3.5.2 Finite Field Addition

The addition of two finite field elements is achieved by adding the


coefficients for corresponding powers of their polynomial representations,
this addition being performed in GF (28), that is, modulo 2, so that 1 + 1 = 0.

Consequently, addition and subtraction are both equivalent to an


exclusive-or (XOR) operation on the bytes that represent field elements.
Addition operations for finite field elements will be denoted by the symbol
ْ.

Example 3.2: Steps to get result of {57} ْ {8E}  {D4}

(Polynomial notation) (x6+ x4 + x2+ x + 1) + (x7+ x + 1) = x7+ x6+ x4+ x2

(Binary notation) {01010111} ْ {10000011} = {11010100}

(Hexadecimal notation) {57} ْ {8E} = {D4}

3.5.3 Finite Field Multiplication

Finite field multiplication is more difficult than addition and is


achieved by multiplying the polynomials for the two elements concerned and
collecting like powers of x in the result. Since each polynomial can have
powers of x up to 7, the result can have powers of x up to 14 and will no
longer fit within a single byte. This situation is handled by replacing the
30

result with the remainder polynomial after division by a special eight order
irreducible polynomial, which for AES is

m(x) = x8+x4+x3+x+1 (3.3)

Since this polynomial has powers of x up to 8, it cannot be


represented by a single byte and will be written as either {00011011} or {1B}
as indicated earlier.

Example 3.3

This process is illustrated in the following example product


{57}·{83}  {C1}

(where · is used to represent finite field multiplication):

(x 6 + x 4 + x 2 + x +1) • (x7 + x +1) 

(x 6 + x 4 + x 2 + x +1) • x7 = x13 + x11 + x9 + x8 + x7 +

(x 6 + x 4 + x 2 + x +1) • x = x7 + x5+ x 3 + x2 + x

(x 6 + x 4 + x 2 + x +1) • 1 = x6 + x4 + x2 + x + 1

x13 + x11 + x9 + x8 + x6 + x5 + x4 + x3 + 1

This intermediate result is now divided by m(x) above:

x13 + x11 + x9 + x8 + x6 + x5 + x4 + x3 + 1

(x8 + x4 + x3 + x +1) • x5 = x13 + x9 + x8 + x6 + x5

x 11 + x4 + x3 + 1
31

Subtract to give immediate remainder

(x8 + x4 + x3 + x +1) • x3 = x11 + x7 + x6 + x4 + x3

Subtract to give the final remainder x7 + x6 + 1

The final result is x7 + x6 + 1 = {C1}

3.5.4 Multiplicative Inverse

In mathematics, multiplicative inverse of a number a, is the number


which, when multiplied by x, yields 1 or

(a · x) = 1

It is denoted by 1/x or x-1. In modular arithmetic, the multiplicative


inverse of x is defined as the number such that

(a · x) mod n = 1

However, this multiplicative inverse exists only if ‘a’ and ‘n’ are
relatively prime.

Example 3.4

The multiplicative inverse of 3 modulo 11 is 4 because 4 is the


solution to (3 · x) mod 11 = 1. In hexadecimal notation, {03} mod {0B} = 1.

In calculating multiplicative inverse for a set of 8 bits numbers,


there would be a set of 256 different byte values. Multiplicative inverse is
used later in SubByte and InvSubByte transformation.
32

3.5.5 Polynomials with Coefficients in GF (2 )

Four term polynomials can be defined with coefficients that are


finite field elements as:

a(x) = a3x3+ a2 x2+ a1 x + a0 (3.4)

where the four coefficients, each represented by a byte, will be denoted as a


32-bit word in the form [a3, a2, a1, a0]. With a second polynomial:

b(x) = b3x3 + b2x2 + b1x + b0 (3.5)

addition can be performed by adding the finite field coefficients of like


powers of x, which corresponds to an XOR operation between the
corresponding bytes in each of the words or an XOR of the complete 32-bit
word values (note that the variable x here is different to that used in the
definition of individual finite field elements). Multiplication is achieved by
algebraically expanding the polynomial product and collecting like powers of
x to give:

c(x) = c6x6 + c5x5 + c4x4 + c3x3 + c2x2 + c1x + c0 (3.6)

where:

c0 = a0 · b0 c4 = a3 · b1ْ a2 · b2ْa1 · b3

c1 = a1 · b0ْ a0 · b1 c5 = a3 · b2ْ a2 · b3

c2 = a2 · b0ْ a1 · b1ْ a0 · b2 c6 = a3 · b3

c3 = a3 · b0ْ a2 · b1ْ a1 · b2ْa0 · b3


33

with · and ْ representing finite field multiplication and addition (XOR)


respectively. This result requires six bytes to represent its coefficients but it
can be reduced modulo a degree 4 polynomial to produce a result that is of
degree less than 4. In Rijndael the polynomial used is (x4 + 1) and reduction
produces the following polynomial coefficients:

d3 = a3 · b0ْa2 · b1ْa1 · b2ْa0 · b3 (3.7)

d2 = a2 · b0ْa1 · b1ْa0 · b2ْa3 · b3 (3.8)

d1 = a1 · b0ْa0 · b1ْa3 · b2ْa2 · b3 (3.9)

d0 = a0 · b0ْa3 · b1ْa2 · b2ْa1 · b3 (3.10)

If one of the polynomials is fixed, this can conveniently be written


in matrix form as:

Because (x4 + 1) is not an irreducible polynomial, not all polynomial


multiplications are invertible. For Rijndael, however, a polynomial that has an
inverse has been chosen:

a(x) ={03} x3 + {01} x2 + {01} x + {02} (3.11)

a-1(x) = {0b} x3 +{0d} x2+{09} x +{0e} (3.12)


34

This transformation is used in MixColumn and InvMixColumn.

Another polynomial that Rijndael uses has a0 = a2 = a3 = {00} and


a1 = {01}, which is the polynomial x. Inspection of above will show that its
effect is to form the output word by rotating the bytes in the input word so
that [b3, b2, b1, b0] is transformed into [b2, b1, b0, b3], with bytes moving to
higher index positions and the top byte wrapping round to the lowest position.
Higher powers of x correspond to the other cyclic permutations of the four
bytes within a 32-bit word. The ROTATE function that is used in the key
expander corresponds to x3.

3.6 ADVANCED ENCRYPTION STANDARD (AES) MAIN


MODULES

There are two main modules in AES algorithm:

i. Key Expansion

ii. Cipher / Inverse Cipher

This research designed both modules in a same hardware.

3.6.1 State Array and Cipher Key State Array

Before the explanation of AES algorithm, one needs to have a deep


understanding in internal array that is used in AES to describe all of its
functions (Joan Daemen & Vincent Rijmen 1999). The AES internal
operations are operated on an array of bytes, which is used in the Cipher (i.e.
text transformation) and Key Expansion (i.e. round key scheduler).

In the Cipher, the array is called State (denoted as S), which consists
of 4 rows of bytes, each row has 4 bytes, and each byte consists of 8 bits, thus
35

total bits of the S is 4 rows x 4 bytes x 8 bits = 128 bits. Each individual byte
has two indices; row number r with range (0  r < 4), and column number c
with range (0  c< 4), hence allowing it to be referred as Sr,c. All
transformations in the cipher are made on this State array.

In the key scheduler, the array used in data processing is called Key
State (denoted as W). Key State is a single-dimension array of bytes with 8
rows of 32-bits word, denoted as r with range (0  r< 8), hence allowing
referred as Wr . One key state array with eight W is named as round key K.

(a)

(b) (c)

Figure 3.1 (a) Initial input bytes (b) State Array and (c) Key State Array
36

As examples in explaining AES algorithm, the values for initial


input key and input data are chosen as follows (test patterns from the FIPS-
197):

Plain text : 32 43 f6 a8 88 5a 30 8d 31 31 98 a2 e0 37 07 34

Cipher key : 60 3d eb 10 15 ca 71 be 2b 73 ae f0 85 7d 77 81

Figure 3.2 Example of State Array and Key State Array

3.6.2 AES Key Expansion

The AES algorithm takes the Cipher Key, K, and performs a Key
Expansion routine to generate a key schedule. The Key Expansion generates a
total of Nb (Nr + 1) words: the algorithm requires an initial set of Nb words,
and each of the Nr rounds requires Nb words of key data. The resulting key
schedule consists of a linear array of 4-byte words, denoted [wi ], with i in the
range0  i <Nb(Nr + 1).

The expansion of the input key into the key schedule proceeds
according to the pseudo code in Figure 3.3. SubWord is a function that takes a
four-byte input word and applies the S-box to each of the four bytes to
produce an output word. The function RotWord takes a word [a0,a1,a2,a3] as
input, performs a cyclic permutation, and returns the word [a1,a2,a3,a0]. The
37

round constant word array, Rcon[i], contains the values given by


[xi-1,{00},{00},{00}], with x i-1 being powers of x (x is denoted as {02}) in
the field GF(28) (note that i starts at 1, not 0).It can be seen that the first Nk
words of the expanded key are filled with the Cipher Key. Every following
word, w[ [i] ], is equal to the XOR of the previous word, w[ [i-1] ], and the
word Nk positions earlier, w[ [i-Nk] ]. For words in positions that are a
multiple of Nk, a transformation is applied to w[ [i-1] ] prior to the XOR,
followed by an XOR with a round constant, Rcon[i]. This transformation
consists of a cyclic shift of the bytes in a word(RotWord), followed by the
application of a table lookup to all four bytes of the word (SubWord). It is
important to note that the Key Expansion routine for 256-bit Cipher Keys
(Nk = 8) is slightly different than for 128- and 192-bit Cipher Keys. If Nk =
8 and i-4 is a multiple of Nk, then SubWord is applied to w[ [i-1] ] prior to the
XOR.

Figure 3.3 Pseudo Code for Key Expansion


38

W0 W1 W2 W3

W4 W5 W6 W7

W8 W9 W10 W11

x
x
Y
X Y
Sbox(
Rot(Y))
X Y
Figure 3.4 Round key expansion

3.6.2.1 ROT transformation

ROT rotates each byte in a word one position to the left. Let’s say a
word consists of four bytes {a0, a1, a2, a3} and after ROT transformation, the
new word is {a1, a2, a3, a0}. For example in hexadecimal, if a word
[{EA},{31}, {D4}, {F0}], ROT returns the word [{31}, {D4}, {F0}, {EA}].
3.5 illustrate the ROT process.
39

Figure 3.5 ROT Transformation

3.6.2.2 SubByte transformation (invSubByte)

The SubByte transformation is a non-linear byte substitution on


every byte of the state, it is also called as SBOX.

The SBOX is a common function used in both Key Expansion and


Cipher. In the Key Expansion, only encryption mode is used even in
generating round keys for decryption transformation.

The SBOX is constructed by composing two transformations:

1. First, the multiplicative inverse in the finite field described


earlier in Section 3.4.4 (with element {00} mapped to itself).

2. Second, the affine transformation is applied over GF(28) defined


by:

b’i = biْb(i+4)mod8ْ b(i+5)mod8ْ b(i+6)mod8ْ b(i+7)mod8ْCi (3.13)

For 0  i  8, where bi is bit i of the byte and ci is bit i of a byte c


with value {63} or {01100011}. This transformation can be expressed in
matrix form as:
40

For invSubByte, the process is different with SubByte in Step 2 that


the affine transformation is replaced with:

b’i = b(i+2)mod8ْ b(i+5)mod8ْ b(i+7)mod8ْ di (3.14)

3.7 ILLUSTRATES THE EFFECT OF THE SUBBYTES


TRANSFORMATION ON THE STATE

Figure 3.6 SubByte and InvSubByte Transformation

LUT Method

The S-box used in the SubBytes transformation is presented in


hexadecimal form in Figure 3.7. Ff S1,1= {53}, then the substitution value
would be determined by the intersection of the row with index ‘5’ and the
41

column with index ‘3’ in Figure.3.7. This would result in S’1,1 having a value
of {ed}.

Figure 3.7 S-box: Substitution values for the byte xy (in hexadecimal
format)

Combinational Logic Method

In LUT based approach, the unbreakable delay of lookup tables is


greater than the other logic. By using LUT method, it is difficult to use sub
pipeline structure with two pipeline stages, which prevents the further
speedup. An alternative method is to use combinational logic, which is faster
than the LUT and can also be divided into two pipeline stages, allowing
further speedup.

Figure 3.8 Subbyte / inversesubbyte implementation


42

In non LUT method, sub bytes can be implemented by finding


multiplicative inverse followed by affine transform. Similarly inverse sub
bytes implemented by using inverse affine transform followed by
multiplicative inverse. Here multiplicative inverse is common; by taking this
advantage, a single structure can be implemented for both subbytes and
inverse subbytes which is shown in Figure 3.8. Hardware implementation
(Xinmiao Zhang 2004) of subbytes is shown in Figure 3.9.

Figure 3.9 Hardware implementation of subbytes

The multiplicative inversion in GF(28) involved in the SubBytes


/InvSubBytes is a hardware demanding operation; it takes at least 620 gates to
implement by repeat multiplications in GF(28). However, the gate count can
be reduced greatlyby using composite eld arithmetic. In the SubBytes
transformation, using substructure sharing, the isomorphic mapping function
can be implemented by 12 XOR gates with 4 XOR gates in the critical path.
Meanwhile, the combined inverse isomorphic mapping and the affine
transformation can be implemented by 19 XOR gates, and the critical path
consists of 4XOR gates also. In the composite field GF((24)2), an element can
be expressed as shx+sl, where sh, sl GF(24) and x is a root of P2(x). Using
Extended Euclidean algorithm, the multiplicative inverse of (shx+sl) modulo
P2(x) can be computed as the multiplicative inversion in GF(28) can be carried
out in GF((24)2) by the architecture illustrated in Figure 3.10.
43

(shx+sl)-1 = sh
x + (shx+sl
(3.15)

T S O  S S  S
2
h h l l
2 1

The multipliers in GF (24) can be further decomposed into


multipliers in GF(22) and then to GF(2), in which a multiplication is simply an
AND operation.

Figure 3.10 Implementations of individual blocks: (a) multiplier in


GF(24); (b) multiplier in GF(22); (c) squarer in GF(24);
(d) constant multiplier; and (e) constant multiplier

Figure 3.10 illustrates this decomposition, together with the other


blocks used in Figure 3.9 except the inversion in GF(24) block. A multiplier in
GF(24) can be implemented by 21 XOR gates and 9 AND gates, with 4 XOR
gates and 1 AND gate in the critical path(Satoh et al 2003). Table 3.2
summarizes the gate count and critical path of each block in the SubBytes
except the block of inversion in GF(24) in Figure 3.9.
44

Table 3.2 Gate counts and critical paths of functional blocks in the
subbytes transformation

Figure 3.11 Implementations of inversion in GF(24). (a) Square–multiply


approach. (b) Multiple decomposition approach
45

The inversion in GF(24) can be implemented by different


approaches. Squaring and multiplying approach, this approach is illustrated in
Figure 3.11(a). Another method is multiple decomposition method, explained
in Figure 3.11(b).

3.7.1 AES Round Transformation (Cipher / InvCipher)

At the start of the encryption, the cipher input is copied into the
internal state array. An initial round key is then added and the state is then
transformed by iterating a Round Transformation in a number cycles. The
number of cycles varies with the key length and block size. There are 4
functions involved in the Round Transformation:

i. AddRoundKey

ii. SubByte (invSubByte)

iii. ShiftRows (invShiftRows)

iv. MixColumn (invMixColumn)

The word in bracket represents the reversed function used in


decryption with three initial characters inv are used to indicate its inverse
function.
46

Figure 3.12 AES Round Transformation Algorithm


47

3.7.1.1 AddRoundKey transformation

In the AddRoundKey transformation, a round key is added to the


state by bitwise Exclusive-OR (XOR) operation. Figure 3.13 below illustrates
the AddRoundKey. This transformation is the same for both encryption and
decryption.

Figure 3.13 AddRoundKey Transformation

Example 3.6: AddRoundKey Transformation

Figure 3.14 Example of AddRoundkey Transformation


48

3.7.1.2 ShiftRows transformation (invShiftRows)

ShiftRows is a cyclic shift operation in each row of the State. In this


operation, the bytes in the first row of the state do not change. The second,
third, and fourth rows shift cyclically to the left one byte, two bytes, three
bytes, respectively, as illustrated in Figure 3.15. The reverse process,
invShiftRow, operates in reverse order to ShiftRows.

Figure 3.15 ShiftRows and invShiftRows Transformation

Example 3.7: ShiftRows Transformation

Figure 3.16 Example of ShiftRows Transformation


49

3.7.1.3 MixColumn transformation (invMixColumn)

The MixColumn transformation is performed independently on the


state column-by-column. Each column is considered as four term polynomial
over GF (28) and multiplied by

a(x) modulo (x4+ 1) (3.16)

where a(x) = {03}x3+ {01}x2 + {01}x + {02} (3.17)

This transformation can be expressed in matrix form as

For invMixColumn, replace a(x) = {0E}x3 + {09}x2 +


{0D}x + {0B} (3.18)

Figure 3.17 MixColumn and invMixColumn Transformation


50

Example 3.8: MixColumn Transformation

Figure 3.18 Example of MixColumn Transformation

3.8 PROPOSED ARCHITECTURE

The basic VLSI architecture of the four transformations in AES is


modified in the aspect of increasing the throughput and reliability. The overall
throughput is increased through pipelining register as shown in Figure 3.20,
used at the end of each round and further the architecture speed is increased
by introducing in sub-pipelining shown in Figure 3.21, in between each
transformation in that round. Loop Unrolling has been followed to reduce the
power and increase the speed.
51

(a) (b)

Figure 3.19 (a)Existing and (b)Proposed reconfigureureurable AES


structure

3.9 PIPELINED AES

In computing, a pipeline is a set of data processing elements


connected in series, so that the output of one element is the input of the next
one. The elements of a pipeline are often executed in parallel or in time-
sliced fashion; in that case, some amount of buffer storage is often inserted
between elements.
52

KEY
KEY
EXPANSION
UNIT

OUTPUT

INPUT

ROUND 1 ROUND 2 ROUND 3

ROUND 4 ROUND 5 ROUND 6 ROUND 7

CLK

ROUND 8 ROUND 9 ROUND 10

Figure 3.20 AES encryption with pipelining

Pipelining does not decrease the time for a single data to be


processed; it only increases the throughput of the system when processing a
stream of data.

A pipelined system typically requires more resources (circuit


elements, processing units, computer memory, etc.) than one that executes
one batch at a time, because its stages cannot reuse the resources of a previous
stage. Moreover, pipelining may increase the time it takes for an instruction
to finish.

The AES encryption for pipeline design is shown in Figure 3.20.


Here the pipeline registers is included in between every round so as to
increase the throughput. As this is a pipelined structure it takes 10 clock
cycles for getting first output and next outputs can be taken immediately after
next clock cycles.
53

3.10 SUB PIPELINED AES

Similar to the pipelining, sub pipelining can be implemented by


inserting registers in combinational logic, but registers are inserted both
between and inside each round. By using pipelining and sub pipelining
concept, multiple blocks of data can be processed simultaneously.

Figure 3.21 Sub pipelining architecture

Among these architectural optimizations, sub pipelining gives


maximum speed and better throughput. Figure 3.21 shows the sub pipelined
architecture with r sub stages. Each round unit is divided into r sub stages
with equal delays Figure 3.22 shows the sub-pipelined cutsets for single round
unit in encryption mode. Here every round is equally divided into three parts
and registers are included to get higher throughputs.
54

Figure 3.22 Sub pipelining cutsets in a single round encryption

In LUT method sub pipelining is limited to only two sub stages


whereas combinational logic can be divided into more sub stages with equal
delays. In this pipelining or sub pipelining architectures, the plain text is
received at each clock cycle through input register. A single round of
algorithm is completed depending on the number of sub stages. Round keys
are generated by using key expansion module. Generated round keys are
supplied to each round. At each clock cycle data is shifted to next stage and
final output is appeared only after the end of ((10*r)+10)th clock cycle. Here
‘r’ represents number of sub pipeline stages. The advantage of this structure is
that, the second output can be obtained immediately in the next clock cycle
after the first output. Internal design of the each round contains Sub bytes,
Shift rows, Mix columns, and add round key which are explained in previous
sections. Here the 3stage sub-pipelining is used in every round and outer
round pipelining. So for getting the first output we need initial latency of 40
clock cycles is needed and next outputs are collected immediately after next
clock cycles.

3.11 SYSTEM LEVEL MODELING

With the SoC design consideration in mind, the AES-128 processor


core was first modeled at system level, thus ease the integration of AES-128
with other cores in SoC designs.
55

At system level, the AES-128 processor core is viewed as black


box, the only consideration in the design is to identify what are the primary
signals of this core in order to communicate with the other cores. From
Chapter 3, it is clear that the AES-128 core must have following data ports:

i. dataIn - the plaintext to be encrypted or ciphertext to be decrypted

ii. CipherKey - the secret key to encrypt/decrypt the data

iii. dataOut - the encrypted plaintext or decrypted ciphertext

Besides, there must be input signals to initiate the mode of crypto


operation (encrypt/decrypt), as well as output signals to indicate the
completion of the process.

Lastly, as this processor core is a synchronous design, it will require


input clock as well as option for system reset. Figure 3.23 shows the system
level modeling of the AES-128 that fulfills the above system level
requirements:

CLK

RST

ENC AES-128
CRYPTO DATA_OUT

KEY_I

DATA_I

Figure 3.23 System Level Modeling of the AES-128 Processing Core


56

3.11.1 Input and Output Signals

Table 3.3 lists the primary input and output signals for the AES-128
core, which are essential to select the AES specification, operation mode, data
/ key input as well as generated output.

Table 3.3 Input and output signals

Signal Width
Type Description
Name (bit)
Clk 1 Input Processor main clock signal
Rst 1 Input Processor main reset signal
0 – normal operation; 1 – System reset
Enc 1 Input Processor mode of operation signal
0 – Decryption mode; 1 – Encryption
mode
Key_in 128 Input The Secret Key to be used by AES Key
Expander to expand all round keys.
Data_in 128 Input The initial data block to be encrypted or
decrypted
Data_out 128 Output Final result of AES transformation

3.12 FAULT TOLERANT XOR GATE

There are five major components which decide the throughput, area,
power of AES encryption and decryption. They are inverters, 2:1
multiplexers, XOR gates, D flip-flops, and totally self-checking two-rail
checkers. All of these components should be made sure to be faulty free and
produce two-rail outputs for a valid two rail input. When the inputs are valid
the output is valid and correct, and when an input is non-valid the output is
non-valid. Using truth table in Table 3.4 it is seen that an input set that yields
a non-valid output in the presence of every possible stuck-at fault; hence, the
XOR cell is totally self-checking for all single stuck-at faults and non-valid
57

inputs. Normally, Pseudo-nMOS technology has not been used because of its
more static power consumption than CMOS technology but it is preferred
because the devices are fast and the short between power and ground makes
the output predictable in the presence of a fault as shown in the Figure 3.24.

Figure 3.24 Fault Tolerant XOR gate

Figure 3.25 Functional waveform of XOR gate


58

Figure 3.26 Functional waveform of 2:1 MUX gate

Table 3.4 Truth Table for Fault Detection

Sl.No Inputs A A’ B B’ A XOR B A XNOR B


Valid Inputs 0 1 0 1 0 1
0 1 1 0 1 0
1 0 0 1 1 0
1 0 1 0 0 1
Non Valid 0 0 X x 1 1
inputs 1 1 X x 0 0
X x 0 0 1 1
X x 1 1 0 0
Faults
1 Stuck at’0’ X x 0 1 1 1
Stuck at’1’ X x 1 0 0 0
2 Stuck at’0’ X x 1 0 1 1
Stuck at’1’ X x 0 1 0 0
3 Stuck at’0’ 0 1 0 1 1 1
Stuck at’1’ 1 0 0 1 0 0
4 Stuck at’0’ 1 0 0 1 1 1
Stuck at’1’ 0 1 0 1 0 0
5 Stuck at’0’ 1 0 1 0 1 1
Stuck at’1’ 0 1 1 0 0 0
6 Stuck at’0’ 0 1 1 0 1 1
Stuck at’1’ 1 0 1 0 0 0
59

Table 3.5 Delay and Power report of the gates used in Fault Detection

Gate Delay ( ps) Power (Watts)


A to xor o/p 9.994E-9
XOR
B to xor o/p 14.99E-9
95.652 X 10-5
A to xnor o/p -4.965E-9
XNOR
B to xnor o/p -9.965E-9
D to Q 20.0E-9
DFF 26.44 X 10-5
Clk to Q 4.706E-9
A to Y 19.35E-12
MUX 2:1 89.208 X 10-5
B to Y 19.35E-12

3.13 TOOLS USED

1. For simulating the Verilog code, Modelsim Altera6.3g is used.

2. For synthesizing the design, Cadence RTL compiler v9.10 is


used.

3. Cadence SOC Encounter is used for layout extraction of the


complete design.

4. Xilinxs ISE Design suit 12.1 is used for FPGA implementation.


60

Figure 3.27 Block diagram for efficient AES architecture

3.14 RESULTS AND COMPARISON

This chapter discusses about ASIC and FPGA implementation


methodologies. In the work proposed and reported in the thesis two type of
designs are implemented. One is by using pipelined architecture and other is
by using sub-pipelined architecture. In pipelined architecture sub module
subbytes is implemented by using LUT method, whereas in sub-pipelined
architecture subbytes is implemented by using combinational method to
reduce the area requirements.
61

For VLSI (hardware) implementation two different methodologies


were used namely ASIC design and FPGA design. For ASIC design the
architecture is modeled in VERILOG HDL and the functional simulation is
done in MODELSIM, synthesis is carried out in CADENCE RTL
COMPILER and the physical design is carried out in CADENCE SOC
ENCOUNTER.

For FPGA design the architecture is modeled in VERILOG HDL


and functional simulation is done in MODELSIM, synthesis is carried out in
XILINX ISE DESIGNSUIT 12.1 and the target board is XILINX
XC5VLX110T-1. Following sections describe the implementation
methodology and results obtained in ASIC and FPGA.

3.14.1 ASIC Design Methodology

Application Specific Integrated Circuit (ASIC) Design, as the name


suggests this design focuses on the development of a hardware module which
is completely dedicated to that particular application or process. This type of
design helps in the economical usage of silicon and also has a good speed
compared to the other implementations such as FPGA and CPLD devices.
ASIC design flow can be seen in Figure 3.28, and the discussion of each step
is done in following sections.
62

SYSTEM
PARTITIONING

DESIGN ENTRY

SIMULATION

SYNTHESIS

FLOOR PLAN

PLACEMENT
LAYOUT SIMULATION

ROUTING

EXTRACTION

Figure 3.28 ASIC Design Flow

3.14.1.1 Simulation results

Subbytes

Figure 3.29 Subbytes - simulation result


63

Analysis: The above waveform shows the simulation results of the Subbyte
module. Here signal ‘in’ is 8-bit input for this module and signal ‘out’ is the
8-bit output. In this subbytes operation one 8-bit value substituted with
another 8-bit value with the help of lookup tables.

Invsubbytes

Figure 3.30 Invsubbytes - simulation results

Analysis: The above waveform shows the simulation results of the


invsubbyte module. Here signal ‘a’ is 8-bit input for this module and signal‘d’
is the 8-bit output. This is an inverse operation of Subbytes. In this
Invsubbytes operation, one 8-bit value substituted with another 8-bit value
with the help of lookup tables.

Key Expansion

Analysis: The above waveform shows the simulation results of the key
expansion module. Here signal ‘key’ is 128-bit input and signals ‘w0 to w43’
are outputs. By using 128 bit input key we are generating total of 10 round
keys which are used in every round operation.
64

Figure 3.31 Key expansion - simulation result


65

Single round Encryption operation

Figure 3.32 Single round Encryption operation – simulation result

Analysis: The above waveform shows the simulation results of the single
round encryption operation. Here signal ‘round_in’ is 128-bit input,
‘w0,w1,w2,w3’is combinely treated as round key and ‘round_out’ is 128-bit
output.

Single round Decryption operation

Figure 3.33 Single round decryption operation – simulation results

Analysis: The above waveform shows the simulation results of the single
round decryption operation. Here signal ‘round_in’ is 128-bit input,
‘w0,w1,w2,w3’is combinely treated as round key and ‘round_out’ is 128-bit
output.
66

Encryption operation

Figure 3.34 Encryption operation – simulation results

Analysis: Above figure shows the simulation results of encryption operation.


Signals clk, key, enc and in are the inputs. ‘out’ is an output signal. As this is
a pipelined design input can be given at every clock cycle and output can be
taken continuously from the 11th clock cycle.

Key = 128’h000102030405060708090a0b0c0d0e0f; enc = 1;

Input1 = 128’h00112233445566778899aabbccddeeff;

Input2 = 128’h10112233445566778899aabbccddeeff;

Output1 = 128’h69c4e0d86a7b0430d8cdb78070b4c55a;

Output2 = 128’h0761adfd2febd4d105b1ac2ff88171b3;

Decryption operation

Figure 3.35 Decryption operation – simulation results


67

Analysis: Above figure shows the simulation results of decryption operation.


Signals clk, key, enc and in are the inputs. ‘out’ is an output signal. As this is
a pipelined design input can be given at every clock cycle and output can be
taken continuously from the 11th clock cycle.

Key = 128’h000102030405060708090a0b0c0d0e0f; enc = 0;

Input1 = 128’h0761adfd2febd4d105b1ac2ff88171b3;

Input2 = 128’h69c4e0d86a7b0430d8cdb78070b4c55a;

Output1 =128’h10112233445566778899aabbccddeeff;

Output2 =128’h00112233445566778899aabbccddeeff;

Schematic obtained in cadence

Figure 3.36 Schematic of AES


68

3.14.1.2 Synthesis results

This section consists of the synthesis results of the design codes in


Cadence RTL Compiler.

1. Area Report of pipelined design

2. Area Report of sub-pipelined design

3. Power Report of pipelined design


69

4. Power Report of sub-pipelined design

5. Timing Report of pipelined design

6. Timing Report of sub-pipelined design

3.14.1.3 ASIC synthesis summary

Table 3.6 Synthesis results (ASIC)

AES(Look Proposed AES Proposed AES


Design
Up Table) (Sub Pipelining) (Sub Pipelining)
Technology 90nm 90nm 180nm
Area (um2) 740870 564036 2258469
Power (mw) 136.995 147.78 655.5
Critical path 3.9ns 2.2ns 4.2ns
Fmax (Mhz) 256.4 454.5 238
Throughput (Gbps) 32.82 58.18 30.47
70

3.14.2 FPGA Methodology

Before developing the ASIC, AES was prototyped and validated on


FPGA. In this way, AES was developed in Verilog Hardware Description
Language (Verilog HDL) at the Register Transfer Level (RTL). Therefore
AES was prototyped by Xilinx ISE (place and route) and validated on a
Xilinx XC5VLX110T-1 FPGA.

The FPGA hardware implementation is performed in the following way:

1. AES Verilog codes (RTL) are synthesized in Xilinx ISE design


suit 12.1 for the Xilinx XC5VLX110T-1 FPGA.

2. The synthesis netlist is placed and routed by Xilinx ISE.

3. Bit file is generated by Xilinx ISE.

4. The bit file is downloaded into the XC5VLX110T-1 FPGA.

5. With the help of Chip scope pro analyzer software we can verify
our output on monitor.

3.14.2.1 FPGA results

1. Validating design on XC5VLX110T

Figure 3.37 FPGA validation screenshot


71

Analysis: Above figure shows the validation of AES processor on


XC5VLX110T-1 FPGA using ChipScope pro analyser. Due to the limited
number of switches and LED’s available on FPGA boards, it is necessary to
go for ChipScope pro analyser. Here signals ‘SyncIn, AsyncOut, AsyncOut1’
are decrypted output, input text and key respectively.

2. Device utilization summary

Table 3.7 Device utilization summary of pipelined architecture.

Selected device: xc5vlx110t-1


Number of slices 4611
Number of slice Flip flops 1096
Number of LUT’S 14358
Number of BRAM’S 60
Number of bonded IOB’S 386
Maximum Frequency 103.42Mhz

Table 3.8 Device utilization summary of sub-pipelined architecture.

Selected device: xc5vlx110t-1


Number of slices 8896
Number of slice Flip flops 12409
Number of LUT’S 26808
Number of BRAM’S 0
Number of bonded IOB’S 386
Maximum Frequency 202.26Mhz
72

3.15 SUMMARY AND CONCLUSION

The hardware implementation of efficient pipeline AES architecture


with re-configurability includes both encryption and decryption Process. The
sub pipelining architecture helped us to get higher throughput than earlier
implementations. This proposed VLSI architecture is enhanced with the
facility of fault detectable basic gates used for cryptographic architecture.
Normally in most of the previous works Subbyte implementation is done
using lookup table method, but in the proposed architecture both lookup table
and combinational logic method are used. Compared to lookup table method,
combinational method occupied lesser area. Furthermore combinational logic
helped us for making inner round pipelining (sub-pipelining) in an efficient
way.

The design is modeled using Verilog HDL and simulated with the
help of Modelsim and Cadence NCsim. Synthesis is done by using RTL
Compiler and physically designed with SOC Encounter. The transistor level
design is being done by Cadence ADE and the simulation is carried out using
SPECTRE. In the proposed architecture throughput increase to 32.32 Gbps
with 180nm TSMC technology library. The design has also been targeted on
FPGA, which achieved a throughput of 31.9Gbps on Xilinx xc5vlx110t-1
device which is faster and more effective than the fastest previous FPGA
implementations known to date.

You might also like