You are on page 1of 9

1751861x, 2020, 6, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/iet-cdt.2019.0179 by INASP/HINARI - PAKISTAN, Wiley Online Library on [10/03/2023].

See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
IET Computers & Digital Techniques

Research Article

High throughput and area-efficient FPGA ISSN 1751-8601


Received on 4th July 2019

implementation of AES for high-traffic


Revised 15th May 2020
Accepted on 9th June 2020
E-First on 10th September 2020
applications doi: 10.1049/iet-cdt.2019.0179
www.ietdl.org

Karim Shahbazi1, Seok-Bum Ko1


1Department of Electrical and Computer Engineering, University of Saskatchewan, Saskatoon, Canada
E-mail: seokbum.ko@usask.ca

Abstract: This study presents a high throughput field-programmable gate array (FPGA) implementation of advanced encryption
standard-128 (AES-128). AES is a well-known symmetric key encryption algorithm with high security against different attacks
that are widely used in different applications. The main goal of this study is to design a high throughput and FPGA efficiency
(FPGA-Eff) cryptosystem for high-traffic applications. To achieve high throughput, loop-unrolling, inner and outer pipelining
techniques are employed. In AES, substitution bytes (Sub-Bytes) is one of the costly functions that occupy a large number of
resources and has a large delay. To reduce the area of Sub-Bytes, new-affine-transformation, which is the combination of
inverse isomorphic and affine transformation, is proposed and employed. Besides that, AES has been modified according to the
proposed architecture. For the first nine rounds, Shift-Rows and Sub-Bytes have been exchanged, and Shift-Rows is merged
with Add-Round-Key. To make an equal latency between stages, Mix-Columns is divided into two different stages. AES is
implemented in counter mode on Xilinx Virtex-5 using VHDL. The proposed implementation achieves a throughput of 79.7
Gbps, FPGA-Eff of 13.3 Mbps/slice, and frequency of 622.4 MHz. Compared to the state-of-the-art work, the proposed design
has improved data throughput by 8.02% and FPGA-Eff by 22.63%.

1 Introduction platforms, hardware implementation of cryptography algorithms,


which are application-specific integrated circuit (ASIC) and field-
The advent of technologies including the internet and smartphones programmable gate arrays (FPGAs), provides higher security and
has made people's lives easier. Nowadays, people get used to throughput. FPGA implementation has some advantages over
digital applications for e-business, communicating with others, and ASIC implementation, such as the re-configurability and low
sending or receiving sensitive messages. Sending secure data design cost. AES is used in broad applications and different
across the private network or the internet is an open concern for security networks, such as Wi-Fi protected access 2 system, secure
every person. Cryptography plays an important role in privacy, sockets layer, automated teller machines, internet of things, and
security, and confidentiality against adversaries. Cryptography digital video recorders. In addition to the security of data
algorithms are divided into two different categories, symmetric and transmission in many high-traffic applications that transfer a huge
asymmetric. Asymmetric cryptography algorithms, such as elliptic amount of data, fast encryption/decryption and transmission are
curve cryptography and Rivest–Shamir–Adleman, use two also required, such as high-traffic servers. In this work, a high-
different keys for encryption and decryption data. In comparison speed cryptosystem for high-traffic application is presented.
with the asymmetric cryptography algorithms, the symmetric When designing a high-speed cryptosystem, there is a
cryptography algorithms using the same keys for encryption and relationship between security of the design, speed that is
decryption data and have less computation and occupy a fewer area determined by throughput, and cost in terms of area and
that makes them very hardware friendly. The advanced encryption fabrication, which is shown in Fig. 1. It is worth to mention that for
standard (AES) is one of the most secure, fast, and most used high-speed cryptosystems, the high priority is security and speed.
symmetric algorithms that have a good performance in both In terms of security, the cryptographic algorithm and architecture
software and hardware platforms. In comparison with software play an important role. Three important factors that affect the
security of an algorithm are key length, number of rounds, and the
degree of confusion and diffusion. More rounds and longer key
lengths provide a safer algorithm. Confusion and diffusion make a
complex relation between plaintext, keys, and cipher. By
employing the diffusion and confusion, changing one bit of the
plaintext will completely change the cipher-text, which the attacker
cannot pre-calculate what the plaintext is. In AES, an excellent
confusion and diffusion, Mix-Columns and substitution bytes (Sub-
Bytes) are employed and repeated in several rounds that make AES
secure against different attacks. The speed of the cryptosystem is
mainly determined by throughput. For a pipeline design,
throughput is measured by frequency and number of proceed bits
in every clock cycle, and for a serial architecture, the total number
of clock cycles has a high impact on throughput. A pipeline
architecture, compared to serial one, with a loop-unrolling
technique, boosts speed. Finally, the cost of a cryptosystem is
directly related to the security and the speed of the design. For
Fig. 1 Illustration of the relationship between priorities when designing a
instance, AES-256 provides more security than AES-128 that
high-speed cryptosystem and the design factors that affect them
needs more rounds for encryption plaintext that leads to an increase
in the cost of the design. Also, the serial designs occupy fewer

IET Comput. Digit. Tech., 2020, Vol. 14 Iss. 6, pp. 344-352 344
© The Institution of Engineering and Technology 2020
1751861x, 2020, 6, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/iet-cdt.2019.0179 by INASP/HINARI - PAKISTAN, Wiley Online Library on [10/03/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
• The AES algorithm is modified in which Sub-Bytes and Shift-
Rows are exchanged for the first nine rounds; Add-Round-Key
and Shift-Rows are merged into one stage.
• The Sub-Bytes is optimised in which inverse isomorphic and
affine transformation (Af) are combined together, and new-
affine-transformation, hereafter referred to as New-Aff, is
proposed.
• The critical path of the composite field arithmetic for
multiplicative inverters in GF(28) is considered; and the delay of
Sub-Bytes is reduced by inserting registers in optimal places.
• Mix-Columns are implemented based on logic gates and into
two different stages. To make an equal delay between these two
stages, the first stage is added to Sub-Bytes.
• Full loop unrolling, inner (inserting registers between rounds)
and outer (inserting registers among functions) pipeline stages
are used.

The rest of the paper is organised as follows: Section 2 discusses a


brief introduction to AES; Section 3 presents cryptography modes;
AES functions’ implementations and hardware implementation of
Fig. 2 Block diagram of AES-128 encryption the AES are in Sections 4 and 5; the implementation results and
comparison are explained in Section 6; and finally the conclusion
gates than parallel and pipeline ones. Needless to say, stronger and will be in Section 7.
faster cryptosystems would require more costs.
Cryptography algorithms are implemented by different types of 2 Introduction to AES algorithm
modes. There are two different cryptography modes, in which each
cryptography algorithm is executed in different modes. In general, AES is one of the high secure symmetric cryptography algorithms
the cryptography modes are divided into two categories: with 128-bit input, 128-bit output, and 128, 192, and 256 key size.
independent and dependent. The main difference between them is a The input is called the state and contains 16 bytes. AES has four
dependency of the previous cipher-text; for encrypting the block main functions that are executed in several rounds over the states.
‘n’ in the dependent mode, the cipher-text block ‘n − 1’ should be The number of rounds is according to the key size that would be
available. On the other hand, there is no relation for encrypting 10, 12, and 14 for key size 128, 192, and 256, respectively. The
blocks in the independent mode, which includes counter (CTR) AES's functions are Add-Round-Key, Shift-Rows, Mix-Columns,
mode and electronic codebook (ECB). Thus, this mode is suitable and Sub-Bytes. Add-Round-Key is a bitwise XOR between input
for a high throughput pipeline design. Between these two and keys. Shift-Rows is a cyclically shift to left through the row's
independence modes, CTR and ECB, which are used for high- state that provides diffusion property to the AES algorithm. The
speed design, CTR is more secure than ECB and is recommended first row of the state does not shift; the second, third, and fourth
for employing in various applications. rows are shifted by one, two, and three, respectively. Mix-Columns
There are various methods and ways to implement AES. In is a matrix multiplication through the column of the state. Sub-
general, the works on the AES algorithm contain fast and high Bytes is a non-linear transition on each byte of the state. The
throughput implementation, optimisation of the hardware design, diagram of the AES is shown in Fig. 2. As is observed from Fig. 2,
low area, and low-power implementation. The main function of the the first round contains only Add-Round-Key; the last round does
AES algorithm in both area occupation and latency is the Sub- not execute the Mix-Columns; all of the four functions are
Bytes function that uses an 8-bit substitution box (S-box). There executed in the remaining rounds.
are different ways to implement the Sub-Bytes function. One of the
straightforward ways for implementing Sub-Bytes is to use look-up 3 Cryptography mode
table (LUT) [1–3]. Since LUT is an easy way, it has a high delay
To process the data and encrypt them, there are different
and requires an enormous amount of memory that is not suitable
cryptography modes. In general, the cryptography modes can be
for high throughput design. The other way is to simplify the S-box
divided into two different categories: independent and dependent
table by using techniques such as sum of products (SoPs). In [4],
modes. The dependent mode contains cipher block chaining
the Sub-Bytes has 16 modules that are derived by using Boolean
(CBC), propagating CBC (PCBC), cipher feedback (CFB), and
simplification based on the Karnaugh map with a 16 to 1
output feedback (OFB). In the dependent cryptography modes, the
multiplexer. The first 4-bit input is the input of the modules, and
current block depends on the previous cipher-text. In some cases,
another 4-bit is the selection input of the multiplexer. Since 16 to 1
the previous cipher-text is used as the input of the current block or
multiplexer is big, the Sub-Bytes suffers a high and unbreakable
one simple operation, most of the time XOR should be executed
delay. In [5], instead of using a 16 to 1 multiplexer, five 4 to 1
with the current plaintext and the previous cipher-text. Fig. 3 shows
multiplexers were used. Another way for calculating Sub-Bytes is
the dependent cryptography modes. For increasing the security,
to use composite field arithmetic [6–8]. Besides that, there are
CFB and OFB use an initial vector as input.
various architecture designs for AES. AESs based on co-processor
On the other hand, the independent cryptography modes contain
and specific processor are implemented [9–12]. The authors of [13]
CTR mode and ECB. Fig. 4 shows the CTR and ECB
described pipelined implementations of AES and International
configuration. These two cryptography modes are very suitable for
Data Encryption Algorithm (IDEA), the other symmetric
pipelines and high throughput design [17]. ECB is the easiest way
cryptography algorithm, with CTR mode and LUT for Sub-Bytes
to use because every plaintext is encrypted without any need for
on FPGA. In [5], different pipelined AES implementations in ECB
previous data. However, ECB does not have enough security in
and CTR mode by using combinational logic circuits and the
comparison with other cryptography modes and is not suitable for
simplified of the S-box were presented. Some other pipelined
sequential data blocks. The CTR mode, instead of using a constant
designs are available in [14–16].
initial vector, uses a sequence vector as input for each block that is
The main goal of this paper is to design and implement a high
increased by one for every block. Using a different initial value for
throughput AES-128 algorithm for high-traffic applications. The
each plaintext increases the security of the design. One advantage
following methods and techniques are considered to achieve the
of the CTR mode is that the encryption and decryption can be
aforementioned goal:
performed by the same structure. The difference between using
CTR and ECB modes for image encryption is shown in Fig. 5. As

IET Comput. Digit. Tech., 2020, Vol. 14 Iss. 6, pp. 344-352 345
© The Institution of Engineering and Technology 2020
1751861x, 2020, 6, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/iet-cdt.2019.0179 by INASP/HINARI - PAKISTAN, Wiley Online Library on [10/03/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
calculation is not very complicated, it is possible to calculate the
values by incoming data according to (1)

a′ 02 03 01 01 a
b′ 01 02 03 01 b
= ×
c′ 01 01 02 03 c
d′ 03 01 01 02 d
a′ = 2 ⋅ a + 3 ⋅ b + c + d
b′ = a + 2 ⋅ b + 3 ⋅ c + d
→ (1)
c′ = a + b + 2 ⋅ c + 3 ⋅ d
d′ = 3 ⋅ a + b + c + 2 ⋅ d
a′ = 2 ⋅ a + (2 ⋅ b + b) + c + d
Fig. 3 Different dependent cryptography modes
(a) CBC, (b) PCBC, (c) CFB, (d) OFB (IV stands for initialisation vector) b′ = a + 2 ⋅ b + (2 ⋅ c + c) + d

c′ = a + b + 2 ⋅ c + (2 ⋅ d + d)
d′ = (2 ⋅ a + a) + b + c + 2 ⋅ d

In (1), ‘·’ is a multiplication modulo, the irreducible polynomial


m x = x8 + x4 + x3 + x + 1, and ‘+’ is XOR. By simplifying 3·α to
2·α + α, the goal is to calculate the multiplication with 2. Here
there are two different ways to calculate the 2·α. The multiplication
by 2 can be written by
Fig. 4 Independent cryptography modes
(a) ECB, (b) CTR
(2 ⋅ a) mod (x8 + x4 + x3 + x + 1)
a7a6a5a4a3a2a1a0
⟶ (a7 x7 + a6 x6 + a5 x5 + a4 x4
+a3 x3 + a2 x2 + a1 x1 + a0) × (x1)
(2)
mod (x8 + x4 + x3 + x + 1)
→ (a7 x8 + a6 x7 + a5 x6 + a4 x5 + a3 x4 + a2 x3 + a1 x2 + a0 x)
mod (x8 + x4 + x3 + x + 1)

Now, the reduction over the module m(x) should be calculated. As


the element of the x8 is α7, m(x) is multiplied by α7 so
Fig. 5 The difference between encryption of an image using CTR and ECB
modes
(a) Main image, (b) Encrypted image in CTR, (c) Encrypted image in ECB (a7 x8 + a6 x7 + a5 x6 + a4 x5 + a3 x4 + a2 x3 + a1 x2 + a0 x)
+a7(x8 + x4 + x3 + x + 1)
→ (a7 + a7)x8 + a6 x7 + a5 x6 + a4 x5
(3)
+(a3 + a7)x4 + (a2 + a7)x3 + a1 x2 + (a0 + a7)x + a7
→ a6 x7 + a5 x6 + a4 x5 + (a3 + a7)x4 + (a2 + a7)x3
+a1 x2 + (a0 + a7)x + a7

Thus, 2·α can be calculated by

x7 = a6
x6 = a5
Fig. 6 Proposed Mix-Columns-1 with Shift, NOT, and Mix 2 × 1
x5 = a4
is observed in Fig. 5, the edge of the encrypted image in ECB can x4 = a7 ⊕ a3
be easily detected; and by doing that the main features of the plain- x = (2 ⋅ a) mod (x8 + x4 + x3 + x + 1) → (4)
image are distinguishable. In comparison with ECB, the encrypted x3 = a7 ⊕ a2
image in CTR is uniform and indistinguishable. x2 = a1
x1 = a7 ⊕ a0
4 AES functions implementation x0 = a7
This section describes efficient techniques for implementation of
the Mix-Columns and Sub-Bytes, and express the employed As 2·α is a shift bit to left, the other way to calculate 2·α using
method in the proposed design. one-bit shift to left, one Mux 2 × 1, and a NOT logic gate. In (4),
when a7 = 0, 2·α is equal to the one-bit shift to the left of α, and
4.1 Mix-Columns when a7 = 1, 2·α is calculated by one-bit shift to the left of α XOR
by x4 + x3 + x + 1 , which is equal to ‘1b’ in hex. As ‘1b’ is fixed,
There are different ways and methods for calculating the Mix-
the result can be transferred to NOT gate. The diagram of
Columns. As the Mix-Columns has fixed values, one way is to pre-
calculating 2·α by Mux, Shift, and NOT is shown in Fig. 6. In the
calculate values and store them as a LUT. This method, however, is
proposed design, the Mix-Columns is executed in two different
not suitable because of high area consumption and high latency and
stages: in one stage, which is called ‘Mix-Columns-1’, 2·α is
is not efficient for area limited applications. As the Mix-Columns
calculated; in another stage, which is called ‘Mix-Columns-2’, the
final values of (1) is computed.
346 IET Comput. Digit. Tech., 2020, Vol. 14 Iss. 6, pp. 344-352
© The Institution of Engineering and Technology 2020
1751861x, 2020, 6, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/iet-cdt.2019.0179 by INASP/HINARI - PAKISTAN, Wiley Online Library on [10/03/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
1 0 1 0 0 0 0 0 x7
1 1 0 1 1 1 1 0 x6
1 0 1 0 1 1 0 0 x5
1 0 1 0 1 1 1 0 x4
δ×x= × (6)
1 1 0 0 0 1 1 0 x3
1 0 0 1 1 1 1 0 x2
Fig. 7 Implementation of Sub-Bytes using composite field with the
0 1 0 1 0 0 1 0 x1
combination of Af and δ−1
0 1 0 0 0 0 1 1 x0
4.2 Sub-Bytes
1 1 1 0 0 0 1 0 x7
Sub-Bytes is the most costly transformation in AES, in both time
0 1 0 0 0 1 0 0 x6
and area aspects. The first step for finding the values of this
function is to calculate the multiplicative inverse of inputs over 0 1 1 0 0 0 1 0 x5
GF(28). The multiplicative inverse of f(x) is g(x) that 0 1 1 1 0 1 1 0 x4
(7)
−1
f x ⋅ g x mod x8 + x4 + x3 + x + 1 = 1. The second step is δ ×x= ×
0 0 1 1 1 1 1 0 x3
computation over an Af. The important effect of applying an 1 0 0 1 1 1 1 0 x2
invertible Af before and after the Sub-Bytes is that the resistance of
S-boxes against most attacks remains unchanged [18]. Equation (5) 0 0 1 1 0 0 0 0 x1
is the Af, which g x = (a7′a6′a5′a4′a3′a2′a1′a0′) is the multiplicative 0 1 1 1 0 1 0 1 x0
inverse of f(x), and Y is the Sub-Bytes of the f(x)
GF(22) → GF(2)
Y = Af(g(x)) + Φ
GF((22)2) → GF(22) →
1 1 1 1 1 0 0 0 a′7 0 2 2 2 2
GF(((2 ) ) ) → GF((2) )
0 1 1 1 1 1 0 0 a′6 1 (8)
0 0 1 1 1 1 1 0 a′5 1 P0(x): x2 + x + 1
0 0 0 1 1 1 1 1 a′4 0 (5) P1(x): x2 + x + ϕ, where φ = {10}2
= × ⊕ 2
1 0 0 0 1 1 1 1 a′3 0 P2(x): x + x + λ, where λ = {1100}2
1 1 0 0 0 1 1 1 a′2 0
One of the advantages of using the composite field arithmetic is to
1 1 1 0 0 0 1 1 a′1 1
reduce the area of Sub-Bytes as well as it is an efficient method for
1 1 1 1 0 0 0 1 a′0 1 pipeline design, while the hardware complexity is its disadvantage.
As it is mentioned, the Sub-Bytes consists of two steps: in the first
There are different hardware implementations of Sub-Bytes step, the multiplicative inverse in GF(28) is calculated; the second
function. Direct calculation of the multiplicative inverse is not step is to apply the Af. Let ‘Mul’ and ‘S’ define as multiplicative
easy. As all of the values are available, one of the straightforward inverse and Sub-Bytes. Thus, Sub-Bytes can be written as follows:
implementations of the S-box is pre-calculating the values and
storing them in a LUT read only memory. The most significant part S( f (x)) = Af(g(x)) + Φ
of the input determines the row of the table and the least significant
part determines the column of the table. The read and write from S( f (x)) = Af(δ−1(Mul(δ ⋅ f (x)))) + Φ
(9)
the LUT to registers need high latency and more areas. Thus, LUT S( f (x)) = Af ⋅ δ−1(Mul(δ ⋅ f (x))) + Φ
is not an efficient method for high-speed application and pipeline
S( f (x)) = γ(Mul(δ ⋅ f (x))) + Φ
design.
The other way is the Boolean simplification map for the pre-
calculation of Sub-Bytes using truth table based on the Karnaugh The ‘γ’ is calculated by: ‘γ = Af·δ−1’; thus, instead of using two
map to make a direct relation between the parameter of the Sub- different matrices in two different stages, one matrix is used in the
Bytes function. The simplification of the whole S-box table is not design. γ is called New-Aff. New-Aff is written as
an easy way. The authors of [4] divided the S-box table into 16
tables and for each sub-table the SoP is calculated as a module 1 0 0 0 1 1 0 0 x7
logic function. The 16 to 1 multiplexer was used to achieve the 1 1 1 1 0 0 0 0 x6
output. The most significant 4-bit are the inputs of the modules and 1 0 0 0 0 1 0 0 x5
the least significant 4-bit are used as the selector of the multiplexer.
Although this method reduces the latency of the S-box by LUT, it 1 0 0 1 0 0 1 1 x4
γ = Af ⋅ δ−1 = × (10)
needs a large area too. 0 0 0 0 0 1 1 1 x3
Instead of working with the S-box, the multiplicative inverse is 0 1 1 1 1 1 0 1 x2
calculated by composite field arithmetic and followed by the Af.
1 0 0 0 0 0 0 1 x1
The composite field arithmetic is based on GF(21), GF(22), and
GF((22)2). As a result, the element that is on GF(28) should be 1 1 0 0 0 1 1 1 x0
mapped to its composite field representation. The isomorphic
function, ‘δ’, and its inverse, ‘δ−1’ are used to transfer between Fig. 7 shows the employed multiplicative inverse GF(28) in the
GF(28) and its composite field and vice versa ((6) and (7)). After proposed architecture. In Fig. 7, ‘δ’ is the isomorphic mapping to
mapping to the composite field, the elements in GF(28) are composite fields, ‘x2’ is the squarer in GF(24), ‘λ’ is the
decomposed to the lower order fields of GF((22)2), GF(22), and multiplication with constant ‘λ’ in the GF(24), ‘x−1’ is the
GF(21) by the irreducible polynomials (8) multiplicative inversion in GF(24), and ‘X’ is the multiplication
operation in GF(24).
In Fig. 7, λx2 is squarer in GF(24) followed by multiplication
with constant λ in GF(24). Every binary number in the Galois field
‘q’ can be represented by bx + c, where ‘b’ is the upper half-term
(‘qH’) and ‘c’ is the lower half-term (‘qL’). By considering this idea

IET Comput. Digit. Tech., 2020, Vol. 14 Iss. 6, pp. 344-352 347
© The Institution of Engineering and Technology 2020
1751861x, 2020, 6, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/iet-cdt.2019.0179 by INASP/HINARI - PAKISTAN, Wiley Online Library on [10/03/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Fig. 8 Implementations of inversion in GF(24) by square and multiplication approach

Table 1 Multiplicative inverse over GF(24)


x 0 1 2 3 4 5 6 7 8 9 a b c d e f
x−1 0 1 3 2 f c 9 b a 6 8 7 5 e d 4

gα + β. The multiplicative inversion in GF(24) for ‘x’ is x−1 that


x ⋅ x−1 = x15 → x−1 = x14. Also, x14 can be written by
((x2)2)2 ⋅ (x2)2 ⋅ x2. Thus, the inversion in GF(24) is achieved by
using the square of ‘x’ and multiplication. Fig. 8 shows the circuit
for multiplicative inverse in GF(24).
The other way to compute x−1 over GF(24) is by decomposing
the GF(24) to the lower order fields, such as GF(28), of GF(22) and
GF(21) by the irreducible polynomials. The third way is to pre-
calculate the elements and store them. In comparison with GF(28),
GF(24) has only 16 elements and does not occupy a lot of
resources, which is used for the design. The pre-calculation results
of GF(24) are shown in Table 1.

5 Proposed architecture and security analysis


In the previous section, the methods of the hardware
implementation of the Sub-Bytes and Mix-Columns are explained.
In this part, the modified AES algorithm, the employed
architecture, and security analysis are discussed. For the proposed
architecture, the CTR mode is employed, which increases security
and is the best option for the pipeline implementation of sequential
data blocks.
Fig. 9 The process of modifying the AES algorithm based on the proposed
design
(a) Main AES algorithm, (b) AES algorithm with exchanging Sub-Bytes and Shift-
5.1 Modified AES algorithm
Rows, (c) Modified AES algorithm with merging Add-Round-Key and Shift-Rows According to Fig. 1, the AES algorithm has: Add-Round-key,
followed by iteratively Sub-Bytes, Shift-Rows, Mix-Columns, and
and (8), parameters in Fig. 7 can be derived. For example, to Add-Round-Key in sequence and execute for nine rounds, and
compute the squaring in GF(24), A = B2, which ‘A’ and ‘B’ are finally, Sub-Bytes, Shift-Rows, and Add-Round-key are executed
elements in GF(24), is equal to the (AH x + AL) = BH x + BL 2, and (Fig. 9a). Since Shift-Rows and Sub-Bytes are executed on each
by computation (11) is achieved byte of the input, these two functions can be exchanged without
any effect on final results (Fig. 9b). By moving Shift-Rows to the
A3 = B3 previous round, the new design of the AES algorithm is obtained.
In the modified version, which is shown in Fig. 9c, Add-Round-
A2 = B3 ⊕ B2 Key and Shift-Rows are merged to one block, which is called
(11)
A1 = B2 ⊕ B1 ARK/Shift, for the first nine rounds; and the last round does not
A0 = B3 ⊕ B1 ⊕ B0 have Shift-Rows function. The advantages of performing this
modification are as follows:
Multiplication with constant λ can be performed by the same way
• Considering the Shift-Rows and Add-Round-Keys in a separate
as squarer. By using the achieved equations for multiplication and
pipeline stage occupies more area, by merging these two
squarer, the parameter λx2 is simplified and calculated by functions into one block, the number of pipeline stages and
occupied area have been reduced.
y3 = x0 + x1 + x2
• In comparison with Mix-Columns and Sub-Bytes, Add-round-
y2 = x0 + x3 Key and Shift-Rows are low latency functions of the algorithm,
y = λx2 → (12) and also merging these two functions does not have any effect
y1 = x3
on the critical path of the design.
y0 = x2 + x3 • One part of Mix-Columns, Mix-Columns-1, can be transferred
to Sub-Bytes. By doing that the latency of the Mix-Columns is
Similar to the multiplication inverse in GF(28), there are different reduced.
ways of implementing the multiplication inverse for GF(24). In a
Galois field, each non-zero element ‘x’ in GF(qm) can be 5.2 Proposed hardware structure
represented by x = gn, where ‘g’ is a primitive element and ‘n’ is a
The main contribution of this design is achieving high throughput
unique number with 0 ≤ n ≤ qm − 2. Furthermore, gq − 1 is equal to
m
and efficient hardware implementation. To achieve this goal, loop
{01}. For the two elements a = g and b = g , a ⋅ b is equal to
α β
unrolling, inner pipelining, and outer pipelining are employed. The

348 IET Comput. Digit. Tech., 2020, Vol. 14 Iss. 6, pp. 344-352
© The Institution of Engineering and Technology 2020
1751861x, 2020, 6, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/iet-cdt.2019.0179 by INASP/HINARI - PAKISTAN, Wiley Online Library on [10/03/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Fig. 10 The proposed block diagram of AES with pipeline and loop unrolled technique
(a) Loop unrolled and outer pipelined stages, (b) Inner pipelined stages

is the total number of iterations (i.e. for AES-128, the number of


rounds is 10). Thus, the number of rounds that need to be iterated
to encrypt one block of data is ‘R = T/r’. If ‘R = 1’, the design is a
full loop unrolling; if ‘R > 1’, it is a partial loop unrolling [19].
The proposed architecture is a full loop unrolling design (‘r = T
= 10’). Two advantages of using loop unrolling technique are: first,
it mitigates and modifies the critical path by removing all rounds
and loops in the design; second, the implementation can easily be
extended to outer-round pipelining. The loop unrolling technique
converts the design to the pipeline one by inserting registers
between the unrolled rounds. The pipeline design is a high-
performance hardware implementation and a wise way to
implement high throughput and fast architecture. In comparison
with the non-pipeline design, such as serial, that the plaintext
should feed to the design after getting the previous plaintext, the
pipeline design feeds by plaintext in each clock cycle. Thus, the
Fig. 11 Pipeline stage through the multiplication operation
rate of encryption will be increased. Besides that, the pipeline
(a) Multiplication operation in GF(24), (b) Multiplication operation in GF(22) design increases the security against classical cryptanalysis such as
differential power analysis [20]. The block diagram of AES with
loop unrolled, outer pipelined, and inner pipelined is shown in
Fig. 10.
Besides the outer pipeline, the inner pipeline is used for each
round of the algorithm, which decreases the latency of the design.
The diagram for one round of the inner pipelined AES
implementation is shown in Fig. 10b. The inner pipeline is used in
inside Sub-Bytes and Mix-Columns. As is mentioned, Mix-
Columns and Sub-Bytes have high latency for the design. To
achieve the low latency and high frequency of the design, these
two functions are divided into some sub-stages with approximately
equal delay.
Ever since using the composite field for Sub-Bytes, the Sub-
Bytes function has been a high critical path; thus, this unit may
need to be divided into more sub-pipeline stages to achieve low
latency. The critical path for Sub-Bytes is multiplication operation
in GF(24). The multiplication operation in GF(24) is shown in
Fig. 11a (‘×’ in Fig. 11a is the multiplication operation in GF(22)).
The multiplication operation in GF(22) is shown in Fig. 11b.
Execution of the Mix-Columns in one clock has a high latency.
As is explained in the previous section, the Mix-Columns contains
two parts, in one part 2·α is calculated and then the inputs are
XORed to each other according to the Mix-Columns matrix. To
reduce the delay, Mix-Columns, which is shown in Fig. 12, is
executed into two different stages: in one stage 2·α of each byte of
the state is calculated by (4), in the other stage, the output is
Fig. 12 Execution of Mix-Columns in two stages
computed according to (1). Mix-Columns-1 is transferred to the
last stage of Sub-Bytes.
main idea of loop unrolling is to iterate the design in a number of
In AES-128, the keys are 128-bit (such as input state, keys have
times. For the loop unrolling, the loop unrolling factor, ‘r’, is the
16 bytes for each round), which is expanded for the next rounds.
number of loops that are unrolled to implement the design, and ‘T’
IET Comput. Digit. Tech., 2020, Vol. 14 Iss. 6, pp. 344-352 349
© The Institution of Engineering and Technology 2020
1751861x, 2020, 6, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/iet-cdt.2019.0179 by INASP/HINARI - PAKISTAN, Wiley Online Library on [10/03/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Table 2 Implementation results and comparison
References Devices Frequency, Slices Slices Occupied BRAM Throughput, FPGA-Eff, Mbps/slice
MHz register LUT Slices Gbps LUT Occupied
this work Virtex-5 XC5VLX85 622.4 19,123 14,966 5974 0 79.7 5.3 13.3
[5]a Virtex-5 XC5VLX85 348.8 30,806 0 178.62 5.8
[23] Virtex-5 XC5VLX85 576.0 — 22,994 — 0 73.7 3.2 —
[24] Virtex-5 xc5vfx70t 460.0 — — 9756 0 60.0 — 6.1b
[25] Virtex-5 XC5VLX85 528.4 — 3557 — 0 67.6 19.0 —
[13] XC2V6000-6 194.7 — — 3720 24.9 — 6.7
[26] Virtex-5 91.6 — 2030 — 28 0.9 0.5 —
XC5VFX70T
[27] Virtex-5 XC5VLX50 425.0 922 564 303 10 1.3 2.3b 4.4
[28] Virtex-5 XC5VLX50 242.2 — 5256 1745 0 3.1 0.6b 1.8b
[29] Virtex-5 XC5VLX50 339.1 — 1338 399 0 4.3 3.2b 10.8b
[30] Virtex-5 xc5vlx110t 250.0 — — — 0 31.2 — —
aIt has four cores; to make a fair comparison and according to [5], the throughput should be divided by four. The throughput for a single core is 178.62/4 = 44.7 Gbps.
bManually calculated.

AES-128 is executed ten times that need 4 × 10 = 44 columns of some other comparable FPGA implementations. The first five
keys. In the proposed design, keys are expanded before running the implementations, including our work, are on the same platform.
algorithm and are stored into registers. The experiments performed in [5, 13, 23] are in CTR mode. In
[5], four 128-bit AES cores are executed in parallel; as the authors
5.3 Security analysis of the proposed cryptosystem of [5] have mentioned that the throughput of a single core is a
quarter of their four designs, the throughput for one core is equal to
The security of a cryptosystem contains two parts: the algorithm 44.7 Gbps. The authors of [5] did not mention which number of
and the implementation. In the proposed architecture, AES is slices they had used for calculation FPGA-Eff. Since the reported
implemented without any alteration on the algorithm's functions. slice number in [5] is closed to the total number of Slices-LUT in
Since the content of AES has not been changed, it is expected the FPGA, it is assumed that the reported number would be the number
design keeps the same level of confidentiality and security. of Slices-LUT. The encryption part of the AES algorithm was
In the implementation side, there are some attacks and threats designed by Farashahi et al. [25] that included a feedback loop and
that depend on the implementation design. A side-channel attack is iterative architecture, which occupied fewer areas and required
implementation-dependent. The target of side-channel attacks is more clock cycles. To make a fair comparison with [25], AT = area
exploiting the relationship between physical attributes, such as × time has been calculated, which is a common and wide
power consumption and timing behaviour, and the manipulated measurement in literature, including [31]. As the proposed
data. Loop unrolling is a recommended technique for increasing cryptosystem has been tested by 688 × 576 pixels images, the
the security against side-channel attacks [21]. Loop unrolling, encryption time in [25] has been manually calculated, and it is
inner-round pipelining, and outer-round pipelining are used for the equal to 507 µs (for the proposed design it is equal to 40 µs). Then,
design, which guarantees reasonable protection from side-channel AT is 1.8 and 0.6 s × slices for [25] and the proposed design,
attacks. Also, the proposed architecture is secure against timing respectively, which has 66% improvement through encryption over
attacks because (i) there is no conditional branches and dependency the same task. As a result, although the design in [25] occupied a
between the inputs and cipher-text; (ii) for encrypting each few LUTs, it is not suitable for encrypting sequential and
plaintext, the proposed architecture executes in a constant number dependent large data, such as images. Although the method
of clock cycles; and (iii) the critical path delay of the proposed proposed in [13] was implemented in different FPGAs, it is
architecture is constant during the execution [22]. pipelined with CTR mode, which is very similar to our work. As is
observed from Table 2, the proposed design has the highest
6 Implementation results, simulation, and throughput, 79.7 Gbps. The throughput of the proposed design is
comparison improved by 8.02%, compared with the best previous work [23],
for implementing on the mentioned FPGA and same cryptography
The results of the implementation are reported in this section. The mode. By considering the number of occupied slices, the proposed
target FPGA is Xilinix Vertix-5 (XC5VLX85-FF676-3) by using design's FPGA-Eff is 13.3 Mbps/slice by 22.63% improvement
ISE 14.7 and ISim 14.7 for synthesis, simulation, and post- over [29].
placement timing results with VHDL. In the pipelined AES The proposed cryptosystem can encrypt and decrypt different
implementations, plaintext blocks are accepted in each clock cycle. plaintexts and plaint-images. Owing to two reasons, the proposed
As the design is pipelined, the throughput is calculated by the design is tested on some medical images: first, images have
number of processed bits per cycle, which is 128-bit for the design, sequential and dependent data blocks; second, the transmission of
multiplied by the frequency. Thanks to the pipelined medical images over the network has been increased rapidly. To
implementation, reducing the latency of the design by inserting the show the independency of inputs, however, coloured images with
pipeline stage in optimised placement, and exchanging and different sizes have been tested by the proposed cryptosystem.
merging some part of the AES, the achieved clock cycle is 1.607 ns Block random access memories (BRAM) are used to store image
and the frequency is 622.4 MHz. The throughput of the design is data when testing encryption or decryption using the proposed
79.7 Gbps. Moreover, the FPGA efficiency (FPGA-Eff) is a design. Afterwards, the data is sent to the cryptosystem, and the en/
parameter to consider the hardware resources of a design and it is decrypted results are then sent back to the host PC. Table 2 lists the
calculated by the design's throughput divided by the number of results of our proposed AES cryptosystem and offers a direct
slices (in Mb/s per slice). However, different articles calculated comparison to related works.
FPGA-Eff by a different number of slices that contains number of Histogram of an image is a statistical feature that represents the
slices LUT and number of occupied slices (the detail of total distribution of pixel values in a digital image. A secure and
implementation results are written in Table 2), in which the total effective cryptosystem should produce a uniform cipher histogram
number in Vertix-5 (XC5VLX85) is 51,840 and 12,960, distribution, which guarantees the resistance against statistical
respectively. Table 2 shows the result of the proposed work and attacks. Fig. 13 depicts the main tested images, the encryption (in

350 IET Comput. Digit. Tech., 2020, Vol. 14 Iss. 6, pp. 344-352
© The Institution of Engineering and Technology 2020
1751861x, 2020, 6, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/iet-cdt.2019.0179 by INASP/HINARI - PAKISTAN, Wiley Online Library on [10/03/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Fig. 13 Encryption, decryption, and histogram of coloured images with different size by using the proposed cryptosystem
(a) Main images, (b) Histograms of main images, (c) Encrypted images, (d) Histograms of encrypted images

noiseless status) of images, and the histogram of the images and terms of throughput, maximum frequency, and efficiency provides
their corresponding ciphers. the best performance.
The histogram of a coloured image is produced by taking the
histograms over each of its three channels (red, green, and blue). 8 Acknowledgments
As can be seen in Fig. 13, the histograms of the encrypted images
by the proposed cryptosystem are uniform for every encrypted This work was supported by the University of Saskatchewan
image and different from the input images. The proposed Dean's Scholarship and NSERC.
cryptosystem encrypts a grey-scale image with 688 × 576 pixels in
40 µs. Every coloured image has three channels (red, green, and 9 References
blue); and to encrypt (or decrypt) a coloured image, every channel
[1] Granado-Criado, J.M., Vega-Rodríguez, M.A., Sánchez-Pérez, J.M., et al.: ‘A
should be encrypted (or decrypted) independently and finally new methodology to implement the AES algorithm using partial and dynamic
merged together. A coloured image with 688 × 576 × 3 pixels is reconfiguration’, Integration, 2010, 43, (1), pp. 72–80
encrypted in 120 µs. [2] Shahbazi, K., Eshghi, M.: ‘Design of a specific instructions set processor for
AES algorithm’, Int. J. Comput. Appl., 2014, 93, (4), pp. 36–40
[3] Ali, L., Aris, I., Hossain, F.S., et al.: ‘Design of an ultra high speed AES
7 Conclusion processor for next generation IT security’, Comput. Electr. Eng., 2011, 37, (6),
pp. 1160–1170
In this work, a high throughput and high FPGA-Eff [4] Ahmad, N., Hasan, R., Jubadi, W.M.: ‘Design of AES S-box using
implementation of AES-128 algorithm in CTR mode for high- combinational logic optimization’. IEEE Symp. on Industrial Electronics &
traffic applications have been presented. In this design, loop Applications (ISIEA), Penang, Malaysian, October 2010, pp. 696–699
[5] Soltani, A., Sharifian, S.: ‘An ultra-high throughput and fully pipelined
unrolling, inner pipelining, and outer pipelining techniques are implementation of AES algorithm on FPGA’, Microprocess. Microsyst., 2015,
employed. Using the pipeline and loop unrolling techniques have 39, (7), pp. 480–493
increased the security against hardware-dependent attack. The [6] Oukili, S., Bri, S., Kumar, A.V.S.: ‘High speed efficient FPGA
main important factor for achieving the high throughput and implementation of pipelined AES S-box’. 4th IEEE Int. Colloquium on
Information Science and Technology (CiSt), Marrakesh, Morocco, October
efficiency are: reducing the design's delay by exchanging and 2016, pp. 901–905
merging AES functions; optimising the Sub-Bytes, by employing [7] Rachh, R.R., Anami, B.S., Mohan, P.V.A.: ‘Efficient implementations of S-
New-Aff, and Mix-Columns; reducing the critical path of the box and inverse S-box for AES algorithm’. TENCON, IEEE Region 10 Int.
composite field arithmetic. To achieve the low delay followed by a Conf., Singapore, January 2009, pp. 1–6
[8] Shastry, P.V.S., Agnihotri, A., Kachhwaha, D., et al.: ‘A combinational logic
high frequency of the design, the efforts have been devoted to implementation of S-box of AES’. IEEE 54th Int. Midwest Symp. on Circuits
make an approximate equal delay between different sub-pipeline and Systems (MWSCAS), Seoul, South Korea, August 2011, pp. 1–4
stages. The target FPGA was Vertix-5 (XC5VLX85-FF676-3) and [9] Shahbazi, K., Eshghi, M., Mirzaee, R.F.: ‘Design and implementation of an
the achieved frequency, throughput, and FPGA-Eff were 622.4 ASIP-based cryptography processor for AES, IDEA, and MD5’, Eng. Sci.
Technol. Int. J., 2017, 20, (4), pp. 1308–1317
MHz, 79.7 Gbps, and 13.3 Mbps/slice, respectively. In comparison
with others, the results show that this implementation of AES in

IET Comput. Digit. Tech., 2020, Vol. 14 Iss. 6, pp. 344-352 351
© The Institution of Engineering and Technology 2020
1751861x, 2020, 6, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/iet-cdt.2019.0179 by INASP/HINARI - PAKISTAN, Wiley Online Library on [10/03/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
[10] Soliman, M.I., Abozaid, G.Y.: ‘FPGA implementation and performance [21] Bhasin, S., Graba, T., Danger, J.-L., et al.: ‘A look into SIMON from a side-
evaluation of a high throughput crypto coprocessor’, J. Parallel Distrib. channel perspective’. IEEE Int. Symp. on Hardware-Oriented Security and
Comput., 2011, 71, (8), pp. 1075–1084 Trust (HOST), Arlington, USA, May 2014, pp. 56–59
[11] Li, Z., Zhuang, Y., Zhang, C., et al.: ‘Low-power and area-optimized VLSI [22] Kocher, P.C.: ‘Timing attacks on implementations of Diffie-Hellman, RSA,
implementation of AES coprocessor for Zigbee system’, J. China Univ. Posts DSS, and other systems’. Annual Int. Cryptology Conf., Santa Barbara, USA,
Telecommun., 2009, 16, (3), pp. 89–94 August 1996, pp. 104–113
[12] Anwar, H., Daneshtalab, M., Ebrahimi, M., et al.: ‘FPGA implementation of [23] Qu, S., Shou, G., Hu, Y., et al.: ‘High throughput, pipelined implementation
AES-based crypto processor’. IEEE 20th Int. Conf. on Electronics, Circuits, of AES on FPGA’. Int. Symp. on Information Engineering and Electronic
and Systems (ICECS), Abu Dhabi, United Arab Emirates, December 2013, Commerce, Ternopil, Ukraine, May 2009, pp. 542–545
pp. 369–372 [24] Kouzehzar, H., Moghadam, M.N., Torkzadeh, P.: ‘A high data rate pipelined
[13] Granado-Criado, J.M., Vega-Rodríguez, M.A.: ‘Hardware coprocessors for architecture of AES encryption/decryption in storage area networks’. Iranian
high-performance symmetric cryptography’, J. Supercomput., 2017, 73, (6), Conf. on Electrical Engineering (ICEE), Mashhad, Iran, May 2018, pp. 23–28
pp. 2456–2482 [25] Farashahi, R.R., Rashidi, B., Sayedi, S.M.: ‘FPGA based fast and high-
[14] Yoo, S.-M., Kotturi, D., Pan, D.W., et al.: ‘An AES crypto chip using a high- throughput 2-slow retiming 128-bit AES encryption algorithm’,
speed parallel pipelined architecture’, Microprocess. Microsyst., 2005, 29, (7), Microelectron. J., 2014, 45, (8), pp. 1014–1025
pp. 317–326 [26] Li, H., Ding, J., Pan, Y.: ‘Cell array reconfigurable architecture for high-
[15] Guo, Z., Li, G., Liu, Y.: ‘Dynamic reconfigurable implementations of AES efficiency AES system’, Microelectron. Reliab., 2012, 52, (11), pp. 2829–
algorithm based on pipeline and parallel structure’. The 2nd Int. Conf. on 2836
Computer and Automation Engineering (ICCAE), Singapore, February 2010, [27] El Maraghy, M., Hesham, S., El Ghany, M.A.A.: ‘Real-time efficient FPGA
pp. 257–260 implementation of AES algorithm’. IEEE Int. SOC Conf., Erlangen,
[16] Wang, C., Heys, H.M.: ‘Using a pipelined S-box in compact AES hardware Germany, September 2013, pp. 203–208
implementations’. Proc. 8th IEEE Int. NEWCAS Conf., Montreal, Canada, [28] Rais, M.H., Qasim, S.M.: ‘FPGA implementation of Rijndael algorithm using
June 2010, pp. 101–104 reduced residue of prime numbers’. IEEE 4th Int. Design and Test Workshop
[17] Canright, D.: ‘A very compact S-box for AES’. Int. Workshop on (IDT), Riyadh, Saudi Arabia, November 2009, pp. 1–4
Cryptographic Hardware and Embedded Systems, Edinburgh, United [29] Rais, M.H., Qasim, S.M.: ‘Efficient hardware realization of advanced
Kingdom, August 2005, pp. 441–455 encryption standard algorithm using Virtex-5 FPGA’, Int. J. Comput. Sci.
[18] Aldabbagh, S.S.M., Al Shaikhli, I.F.T.: ‘Security of present S-box’. Int. Conf. Netw. Secur., 2009, 9, (9), pp. 59–63
on Advanced Computer Science Applications and Technologies (ACSAT), [30] Banu, J.S., Vanitha, M., Vaideeswaran, J., et al.: ‘Loop parallelization and
Kuala Lumpur, Malaysia, November 2012, pp. 219–222 pipelining implementation of AES algorithm using OpenMP and FPGA’.
[19] Abed, S., Jaffal, R., Mohd, B.J., et al.: ‘FPGA modeling and optimization of a IEEE Int. Conf. on Emerging Trends in Computing, Communication and
SIMON lightweight block cipher’, Sensors, 2019, 19, (4), pp. 913–940 Nanotechnology (ICECCN), Tirunelveli, India, March 2013, pp. 481–485
[20] Mazumdar, B., Ali, S.S., Sinanoglu, O.: ‘A compact implementation of [31] Ebrahimi, S., Bayat-Sarmadi, S., Mosanaei-Boorani, H.: ‘Post-quantum
Salsa20 and its power analysis vulnerabilities’, ACM Trans. Des. Autom. cryptoprocessors optimized for edge and resource-constrained devices in
Electron. Syst., 2016, 22, (1), pp. 1–26 IoT’, IEEE Internet of Things J., 2019, 6, (3), pp. 5500–5507

352 IET Comput. Digit. Tech., 2020, Vol. 14 Iss. 6, pp. 344-352
© The Institution of Engineering and Technology 2020

You might also like