You are on page 1of 4

AES block cipher implementations with

AMBA-AHB interface
Paola Ceminari Ariel Arelovich Martı́n Di Federico
UNS/ INTI - CMNB INTI - CMNB INTI - CMNB
Bahı́a Blanca Bahı́a Blanca Bahı́a Blanca
pceminari@inti.gob.ar ariela@inti.gob.ar martind@inti.gob.ar

Abstract—The aim of this work is to describe three different between the plain text block and first round key, while general
architectural designs for AES cipher, which is a symmetric block rounds consist in four transformations called SubBytes (SB),
encryption standard. The three architectures are oriented to dif- ShiftRows (SR), MixColumns (MC) and AddRoundKey (ARK).
ferent applications and are designed using different approaches,
like pipeline structures and resource sharing. They also include The final round is similar to general rounds, except that it
an AMBA AHB interface, which is an open standard that defines does not include the MixColumns transformation. All input
the interconnection of blocks in a System-on-Chip (SoC). and output data blocks, as well as the intermediate results
during encryption, can be visualized as a 4 × 4 matrix, called
I. I NTRODUCTION
state matrix, which elements corresponds to the sixteen bytes
Advanced Encryption Standard (AES) resulted from an in a 128-bits data. Next sections describe the transformations
initiative carried out by National Institute of Standards and used during encryption.
Technology (NIST) in 2000 [1]. The algorithm selection was
an open process and the winner algorithm was Rijndael [2],
a substitution-permutation block cipher. Originally, AES was
destined to sensitive information protection in United States
governmental institutions, but some years later it became a
global de facto standard. One of the main AES features
is its capability to be implemented efficiently over different
hardware and software platforms [3] [4] [5].
This paper presents three architectures for AES cipher. They
represent solutions for different applications and include a
standard communication protocol to allow their incorporation
in more complex systems. It is organized as follows: the
AES algorithm is described in Section II, considering all
the transformations that take place during encryption and
key expansion processes. In Section III the design strategies
for each architecture are presented. Section IV details the
AMBA AHB protocol and its interface with AES ciphers.
Finally, Sections V and VI show simulation and synthesis
results.
II. AES ALGORITHM DESCRIPTION
AES operates over 128 bits data blocks; while the secret
key size can be 128, 192 or 256 bits. The encryption process
consist in the iterated application of invertible transforms,
denominated rounds, as shown in Fig. 1. At first place an initial
round is performed, followed by Nr − 1 iterations of a general
Fig. 1. AES encryption diagram.
round, and one iteration for a final round. The number of round
iterations, Nr , depends on secret key length: Nr = 10 for a
128 bits key, Nr = 12 for a 192 bits key and Nr = 14 for a A. SubBytes
256 bits key. Each round requires a sub-key (also denominated This transformation corresponds to the substitution layer in
round key), which is generated from secret key by an expan- the algorithm and is the only nonlinear operation in AES
sion algorithm. The initial round consists in a XOR operation standard. It consists in the application of a transformation
978-1-5090-3963-0/17/$31.00 2017
c IEEE
(called S-BOX) to each one of the state matrix elements. Math- E. Operation modes
ematically, the S-BOX is an inversion in Galois Finite Field The operation modes define how a plain text whose length is
GF {28 }, using the irreducible polynomial x8 +x4 +x3 +x+1, longer than the block size is sent to the cipher. The most used
followed by an affine mapping. This mapping increments operation modes are detailed in recommendations emitted by
the transformation complexity and avoids fixed and opposite NIST [8]. The basic modes are Electronic Codebook (ECB),
points, i.e no byte is mapped to itself or its inverse value. Cipher Block Chaining (CBC), Cipher Feedback (CFB), Out-
The S-BOX transformation can be implemented by arith- put Feedback (OFB) and Counter (CTR). The architectures
metic operations in finite fields [6] [7], or by lookup tables presented in this work are designed in ECB mode, in which
(LUTs). the plain text is divided in 128-bits segments which are
B. ShiftRows encrypted independently. This operation mode is weak from a
cryptographic point of view, because it preserves the plain text
This transformation consists in a shift in the state matrix
statistic properties. However, it is implemented in this work
rows. The first row is not shifted, the second row is shifted
because it is base for the other operation modes and, as it is
circularly one byte to the left, the third is shifted circularly
not a feedback mode, admits the encryption of multiple data
two bytes to the left, and the last one is shifted circularly
blocks simultaneously (parallelism).
three bytes to the left.
III. D ESIGN
C. MixColumns
The architectures developed in this work have a modular
This transformation is a linear operation that is applied in- structure that consists in three blocks: key expansion, en-
dependently over each column in state matrix. Mathematically cryption and control. The expansion block generates round
corresponds to the multiplication between polynomials over fi- keys from the secret key, following the expansion algorithm
nite field GF {28 }. Each column is considered as a polynomial described in the standard. In the other hand, the control
with coefficients in GF {28 } field and are multiplied, modulo block consists in a FSM that manages the communication
the irreducible polynomial x4 + 1, with a constant polynomial between expansion and encryption blocks, avoiding errors and
defined in the standard as 03x3 + 01x2 + 01x + 02. Equation 1 ensuring that data encryption starts once the key expansion is
shows this operation for one column. Superscript 0 is used completed. The control block also signals when the module
to represent a state matrix element after a transformation is can receive a new plain text block and the output data
applied. is valid. The three architectures presented are called basic,
 0  
pipeline and compact. The difference between them lies in the
  
S0,j 02 03 01 01 S0,j
0 
S1,j implementation of their blocks, mainly the encryption one.
 0  = 01 02 03 01 × S1,j 
   
S2,j  01 01 02 03 S2,j  (1)
A. Basic architecture
0 03 01 01 02 S3,j
S3,j All the internal buses in this architecture are 128 bits wide,
D. Key Expansion and one round is executed per clock cycle. Fig. 2 shows
A recursive algorithm is used to obtain the sub-keys needed a diagram in which can be seen that the encryption block
during encryption. This algorithm depends on the secret key consists in the hardware implementation of one general round,
size. For a 128 bits secret key, the first sub-key is identical whose output is feedback to the input by a register. The general
to the secret key. Then, every remaining sub-key word is round consists of sixteen instances of the LUT that conforms
computed by the algorithm shown in Eq. 2. Function g() the SubBytes transformation (one for each byte); four instances
is nonlinear and consists in a circular shift followed by a of MixColumns transformation (one for each word), and six-
substitution, using the same S-BOX as in encryption process, teen 8-bits XORs that implement AddRoundKey. ShiftRows
and a XOR operation between the most significant byte and an transformation is carried out by the proper addressing between
8 bits constant, called RC, whose value is different for each SubBytes output and MixColumns input. The multiplexer is
iteration. used to differentiate between a general round and a final round.

W [4i] = W [4(i − 1)] + g(W [4i − 1]) (2)


W [4i + j] = W [4i + j − 1] + W [4(i − 1) + j] (3)
Independently of the expansion algorithm, there are ba-
sically two ways to handle round keys. In the first one,
called precompute, all the sub-keys are generated before the
encryption process starts. On the other hand, the on-the-fly
approach is based on the sub-keys generation as they are
required. The architectures presented in this work are based Fig. 2. Encryption block implementation in basic architecture.
in a 128-bits secret key with a precompute approach, because
it allows a parallelism in the encryption process without the Because the sub-keys are pre-calculated, there are no strong
need for parallelism in the key expansion. constraints over the delay in key expansion. In the basic
architecture, the key expansion block calculates one sub-key round ends, the input and output banks interchange its roles
per clock cycle, resulting in a trade-off between area and before the next round begins. The ShiftRows transformation in
throughput. The resulting block is shown in Fig. 3. Once this architecture is carried out by the proper addressing when
this block calculates the keys, it sends them to the encryption reading memories. Each color in Fig. 5 identifies the bytes
block, where the keys are stored in an array. read and written in both memory banks at each clock cycle
during one round.

Fig. 3. Key expansion implementation in basic architecture.

B. Pipeline architecture Fig. 5. Encryption block implementation in compact architecture.

The main goal of this architecture is to obtain a higher In compact architecture, the key expansion block also has
throughput when compared with the basic module. This is 32-bits internal buses [9]. One sub-key word is calculated
accomplished by carrying out multiple rounds simultaneously. per clock cycle, as shown in Fig. 6. Each calculated word
The encryption block consists in nine general rounds and a is sent to the encryption block, where round keys are stored
final round, as it is shown in Fig. 4. Every general round has in four memory blocks (one for each key matrix row). These
sixteen SubBytes and AddRoundKey instances, besides four blocks are accessed every time AddRoundKey transformation
MixColumns instances. Rounds are connected by registers, is carried out.
allowing the processing of ten data blocks at the same time
(fully-pipelined structure). The throughput is maximum in this
architecture once its internal structure is complete, obtaining
one cipher text block at the output at each clock cycle. The
key expansion block in this architecture is identical to the one
presented for basic architecture.

Fig. 6. Key expansion implementation in compact architecture.

IV. AMBA INTERFACE


The AMBA AHB interface was implemented by wrappers
that can be added to the the proposed AES architectures.
Fig. 4. Encryption block implementation in pipeline architecture.
Wrappers for basic and compact architectures have 32-bits dat-
apath (HRDATA and HWDATA buses in AMBA). For pipeline,
C. Compact architecture 128-bits buses were implemented. This bus width is accepted
The goal for compact architecture is to reduce the hardware by the AMBA AHB specification and supports sending a plain
resources needed to encrypt data. It has 32-bits internal buses, text block per clock cycle to the cipher. If narrower data paths
so 128-bits data is entered to the module as four 32-bit blocks. are used in pipeline architecture, its maximum throughput can
The intermediate results during encryption are stored in four never be reached.
memory blocks, one for each state matrix row [9]. The encryp- AHB Master sends data (plain text and secret key) to the
tion block also has four SubBytes and AddRoundKey instances, cipher by registers. In 32-bits datapath wrappers, this registers
one MixColumns, shown in Fig. 5. The memory blocks are must be written by AHB Master in a specific order (from most
divided in two banks, one for round input data and the other significant word to least significant word), because the data
for round output data. The encryption is carried out by reading is transferred to AES cipher once the least significant word
a word from input bank and processing the read data by the register is written.
combinatorial logic that implements one round. The result is In basic and compact architectures, HREADYOUT signal
stored in the output bank in the next clock cycle. When a is set to a low value while the encryption or key expansion
processes are being carried on, indicating that they can not TABLE II
receive new data. In pipeline architecture, HREADYOUT is S YNTHESIS RESULTS
set to a low value only while key expansion process is being Spartan 6 XC6SLX45
carried, because of its pipeline structure encrypts multiple data Architecture Fclk Slices LUTs Registers
simultaneously. Basic AHB 125 MHz 962 2238 2495
Pipeline AHB 166.6 MHz 2633 8211 3404
Once the encryption process finishes, the three architectures Compact AHB 113.6M Hz 404 1008 446
store the resulted cipher text in registers (four 32-bits registers Design Compiler
for basic and compact and one 128-bits register for pipeline). Architecture Fclk Total Area
This registers should be read by AHB Master before they are Basic AHB 333 MHz 51456.08
Pipeline AHB 357.14 MHz 177309.27
written with a new cipher text block. Because AMBA AHB Compact AHB 400 MHz 38428.78
Master can not write and read Slaves at the same time, to
obtain maximum throughput in pipeline it may be necessary to
store cipher text blocks in a FIFO or another memory structure basic and pipeline, and represents an important percentage of
to avoid data loss. hardware resources.
V. S IMULATION RESULTS VII. C ONCLUSIONS
The proposed architectures were simulated using Xilinx In this work three architectures for AES cipher implementa-
ISim 14.6. The test bench acts as AHB Master that has access tion in FPGA were presented. The obtained results show that
to text files that contain 1000 plain text blocks and its 1000 pipeline architecture presents a notable increase in throughput,
corresponding cipher text blocks, obtained from a reference as long as its internal structure is complete. Thus, it presents
model developed in C. The master sends the secret key and, benefits in applications in which high volumes of data need to
once the key expansion process ends, it sends plain text blocks be encrypted in a consecutive way, where non feedback modes
when the cipher is ready to receive them. When the encryption are used and the secret key does not change frequently (the
process ends, the test bench compares the value returned by internal pipeline structure must be cleared by reset in order to
the cipher with the corresponding cipher text in the file. accept a new key).
Latency and throughput values were documented for each The inclusion of a standard interface protocol like AMBA
architecture. The results are shown in Table I. AHB adds versatility an allows the integration of the cipher
modules in more complex systems, like SoCs.
TABLE I
S IMULATION RESULTS R EFERENCES
Architecture Latency Throughput [1] NIST, “Federal Information Processing Standards (FIPS)
Basic 11 × Tclk 128bits/(12 × Tclk ) Publication 197: Advanced Encryption Standard (AES),” Available:
Pipeline 12 × Tclk 128bits/(Tclk ) http://csrc.nist.gov/publications/fips/fips197/fips-197.pdf, Nov. 2001.
Compact 51 × Tclk 128bits/(55 × Tclk ) [2] J. Daemen and V. Rijmen, “AES Proposal: Rijndael,” Aug. 1998.
[3] P. Chodowiec, “Comparison of the Hardware Performance of the AES
Candidates Using Reconfigurable Hardware,” Master’s thesis, George
The increment in latency for pipeline architecture when Mason University, 2002.
compared with basic is due to the presence of a register [4] A. J. Elbirt, W. Yip, and C. Paar, “An FPGA Implementation and
Performance Evaluation of the AES Block Cipher Candidate Algorithm
between initial and general round for pipeline, while in the Finalists,” IEEE Transactions on Very Large Scale Integration (VLSI)
basic architecture initial and first general round are executed Systems, vol. 9, no. 4, pp. 545–557, Aug. 2001.
in the same clock cycle. It also can be seen that latency value [5] T. Ichikawa, T. Kasuya, and M. Matsui, “Hardware Evaluation of the
AES Finalists ,” in The Third Advanced Encryption Standard Candidate
for compact architecture is approximately five times higher Conference, 2000, pp. 279–285.
that the other two, when four was expected. The extra cycle [6] A. Satoh, M. Kohji, K. Takano, and S. Munetoh, “A Compact Rijndael
corresponds to a wait state in which the memory banks are Hardware Architecture with S-Box Optimization,” In Advances in Cryp-
tology - ASIACRYPT 2001, pp. 239–254, 2001.
interchanged. [7] D. Canright, “A Very Compact Rijndael S-box ,” Naval Postgraduate
School Monterey, Tech. Rep., 2005.
VI. S YNTHESIS RESULTS [8] NIST, “Special Publication 800-38A: Recommendation for
The proposed architectures were synthesized for two dif- Block Cipher Modes of Operation ,” Available from:
http://csrc.nist.gov/publications/nistpubs/800-38a/sp800-38a.pdf, 2001.
ferent platforms: FPGA and ASIC. The logic synthesis was [9] P. Chodowiec and K. Gaj, “Very Compact FPGA Implementation of the
carried on by Xilinx ISE 14.6 and Synopsys DC Compiler AES Algorithm,” in Cryptographic Hardware and Embedded Systems-
tools. The target technologies were Spartan 6 XC6SLX45 CHES 2003, 2003, pp. 319–333.
FPGA and Tower TSL 0.18 µm library, respectively. Speed
and area results are shown in Table II.
When the results are compared it can be noticed that
pipeline does not need ten times more resources than basic as
it was expected. This a consequence of the ciphers structure:
besides the encryption blocks shown in Figs 2 and 4, both
modules also store the key table, which is equal for both

You might also like