You are on page 1of 12

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2019.2911278, IEEE
Transactions on Parallel and Distributed Systems

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, MANUSCRIPT ID 1

Fast AES Implementation: A High-throughput


Bitsliced Approach
*Omid Hajihassani, *Saleh Khalaj Monfared, Seyed Hossein Khasteh, Saeid Gorgin

Abstract— In this work, a high-throughput bitsliced AES implementation is proposed, which builds upon a new data
representation scheme that exploits the parallelization capability of modern multi/many-core platforms. This representation
scheme is employed as a building block to redesign all of the AES stages to tailor them for multi/many-core AES implementation.
With the proposed bitsliced approach, each parallelization unit processes an unprecedented number of thirty-two 128-bit input
data. Hence, a high order of prallelization is achieved by the proposed implementation technique. Based on the characteristics of
this new implementation model, the ShiftRows stage can be implicitly handled through input rearrangement and is simplified to
the point where its computing process can be neglected. In this implementation, costly Byte-wise operations are performed
through register shift and swapping. In addition, the need for look-up table based I/O operations, which are used by the Substitute
Bytes stage is eliminated through using S-box logic circuit. The S-box logic circuit is optimized to simultaneously process 32
chunks of 128-bit input data. We develop high-throughput CTR and ECB AES encryption/decryption on 6 CUDA-enabled GPUs,
which achieve 1.47 and 1.38 Tbps of encryption throughput on Tesla V100 GPU, respectively.

Index Terms— AES, CTR, ECB, GPU, Data Representation, CUDA, High-performance

—————————— u ——————————

1 INTRODUCTION

T HE emergence of high-throughput parallel processing


units offers incentives such as gaining better execution
performance to different domains of compute-intensive
(AES) [7] is a complex symmetric block cipher designed to
succeed the Data Encryption Standard (DES & 3DES) [6].
AES is designed specifically to be not prone to the attacks
applications. The trend of applications migrating from that DES has shown vulnerability to such as the brute-force
crude sequential execution on conventional hardware to attack [10].
high-performance implementation on massively parallel Inseparable features of these days’ digital world are
platforms has hiked in popularity among various research bulky and large data sets having four traits of volume, va-
domains. Among these, medical imaging [1], machine riety, velocity, and veracity referred to as the big data [11],
learning [2], [3], [4], and data science [5] are a few number [12]. Unfortunately, most of the encryption standards and
of domains that attract the attention of different scholars algorithms at hand were not in any way designed or im-
and researchers to parallel programming. plemented to meet the demands of the aforementioned fea-
Data encryption is an indispensable part of today’s dig- tures of big data [13], [14]. So, hereby, we strive to propose
ital world guaranteeing protection and confidentiality of a data representation scheme that is integrated into the
data transfer and storage amongst a diverse user group bitsliced implementation of Electronic Codebook (ECB)
ranging from conventional users to renowned enterprises mode and Counter (CTR) mode AES [15] that enables them
and companies. Throughout the years, different encryp- to score the highest published and reported encryption
tion techniques and methods have been introduced to meet throughput, which is achievable on conventional GPUs
the aforementioned need for security among which are and is also scalable to high-performing GPUs. Through
several symmetric and asymmetric data encryption algo- this, we meet the needs of a wide range of users by accom-
rithms [6], [7], [8], [9]. The Advanced Encryption Standard modating tools of high-throughput AES cryptography.
In order to achieve this goal, we set off to design a
———————————————— bitsliced AES implementation that eliminates the compu-
• O. Hajihassani is with the Institute for Research in Fundamental Sciences tationally intensive operations in all the stages involved in
(IPM) and University of Alberta, No. 1, Shahid Farbin Alley, Shahid Lav- the AES. Moreover, it empowers each processing unit to
asani st., Tehran, Iran, P.O. Box 19395-5531. E-mail: hajihass@ual- process more input data at the same time. Firstly, all of the
berta.ca.
• S. Khalaj Monfared is with the Institute for Research in Fundamental Sci- AES stages embed in themselves numerous instructions
ences (IPM) and K. N. Toosi University of Technology, No. 1, Shahid Far- that require frequent shift and mask operations at the Byte
bin Alley, Shahid Lavasani st., Tehran, Iran, P.O. Box 19395-5531. E-mail: granularity. In platforms that do not support fine-grain
monfared@email.kntu.ac.ir.
• S.H. Khasteh is an assistant professor at the K. N. Toosi University of
Byte-wise operations, these instructions greatly reduce the
Technology, Faculty of Computer Engineering, Seyed Khandan, Shariati encryption and decryption throughput. In our proposed
Ave, Tehran, Iran. P.O. Box 16315-1355. E-mail: Khasteh@kntu.ac.ir. implementation model, we introduce a data representation
• S. Gorgin is affiliated with the Electrical Engineering and Information scheme and use it in the alteration of the AES stages so that
Technology Department of Iranian Research Organization for Science and
Technology (IROST) and the Institute for Research in Fundamental Sci- they have far less costly Byte-wise operations. This data
ences (IPM), Tehran, Iran, P.O. Box 19395-5531. E-mail: gorgin@ipm.ir. representation model exploits register shuffle and
* These authors contributed equally to this work
xxxx-xxxx/0x/$xx.00 © 200x IEEE Published by the IEEE Computer Society

1045-9219 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2019.2911278, IEEE
Transactions on Parallel and Distributed Systems

2 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, MANUSCRIPT ID

swapping instead of the costly Byte-wise shift and mask the encryption key and is referred to as the AES-128, AES-
operations. This is done by changing the representation of 192, and AES-256 for the 128-bit, 192-bit, and 256-bit en-
the data from the commonly used row-major to the col- cryption key sizes, respectively. In the original AES docu-
umn-major representation to achieve our predefined goal mentation and implementation, the 128-bit input is di-
of high-throughput encryption. Moreover, in our bitsliced vided into a 4´4 array of 8-bit elements. This original two-
implementation, AES ShiftRows stage is implicitly han- dimensional representation scheme is employed in refer-
dled through argument rearrangement in the MixColumn ences to illustrate the AES internal stages and is referred to
stage inputs. Secondly, with the previously mentioned data as the state array. The organization of the input data in the
representation model, the proposed bitsliced implementa- state array is illustrated in Figure 1, where 𝑆",$ equals the
tion enables each executing thread to operate on 32 num- (4 × 𝑚 + 𝑛),- data Byte segment. AES, unlike DES, is not a
bers of 128-bit input data at the same time. By combining Feistel structure in the sense that all of the input is pro-
this feature with the diminished need for excessive Byte- cessed at each round of AES instead of the common one-
wise operations, the proposed bitsliced AES implementa- half at a time processing in the Feistel-like block ciphers.
tion achieves unprecedented orders of speedup. 128-bit Plaintext (16 Bytes)
Here, the ECB and CTR modes of operation for AES are 0 1 2 3 4 5 6 7 8 9 10 11 12 14 15
implemented with our proposed bitsliced technique. The
proposed representation model is similar to what we have S0,0 S0,1 S0,2 S0,3
used to implement the bitsliced DES cryptanalysis [10].
Our GPU implementation that is discussed in details, fruit- S1,0 S1,1 S1,2 S1,3
fully employs the proposed data representation model and
S2,0 S2,1 S2,2 S2,3
all of the GPU resources such as the shared memory result-
ing in peak encryption throughput. In our work, each GPU S3,0 S3,1 S3,2 S3,3
thread simultaneously operates on a batch of 32´128-bit
data. It is worth noting that the proposed data representa- Fig. 1. Organization of 128-bit input into the state array.
tion model is not limited to a predefined number of GPUs The encryption in AES includes the expansion of key
or the Compute Unified Device Architecture (CUDA) [16] and a series of permutations and substitutions at the Byte-
and can be fully incorporated in implementations aimed at level. In the Substitute Bytes stage, an S-box Look Up Table
other available parallel hardware platforms. (LUT) is used for Byte-wise substitution of the blocks.
The remaining of this paper is structured as follows: We Many implementations normally use S-box LUT. However,
give a background on AES in Section 2. Section 3 discusses due to the original substitution stage being based on a firm
the related literature on the proposed methods to increase mathematical foundation, one can calculate the substitute
the AES encryption throughput and their published and Byte for any given input Byte by using the mathematical
reported cryptography evaluation results. In Section 4, our formulae and operations in 𝐺𝐹(21 ). In this paper, instead
proposed bitsliced technique is discussed in depth and its of employing, the I/O bound S-box LUT; we set off to em-
incorporation into all of the AES stages is elaborated on. In ploy the mathematical approach proposed in [17] by incor-
section 5, different AES modes of operation are reviewed porating our proposed data representation into it. By doing
and the underlying reason for implementing only the ECB so, our parallelization units will no longer be imposed to
and the CTR modes is outlined. Section 6 gives the ECB the burden of costly I/O instructions. This approach to the
and CTR modes implementation results on different S-box calculation immunes our AES implementation to
CUDA-capable GPUs and compares these results to previ- timing attacks. Another stage of AES is the ShiftRows that
ous works, furthermore evaluation and comments are includes simple Byte-level shifts and follows simple per-
given for the AES decryption procedure. Finally, Section 7 mutation steps. The permutations employed by ShiftRows
concludes the paper and outlines the intended future is illustrated in Figure 2.
work.
S0,0 S0,1 S0,2 S0,3 S0,0 S0,1 S0,2 S0,3
2 BACKGROUND
S1,0 S1,1 S1,2 S1,3 ShiftRows S1,1 S1,2 S1,3 S1,0
The basics of the Advanced Encryption Standard was in-
troduced by Rijmen and Daemen in their proposal to the S2,0 S2,1 S2,2 S2,3 S2,2 S2,3 S2,0 S2,1
National Institute of Standards and Technology (NIST) in
a call for a superseder of the Data Encryption Standard S3,0 S3,1 S3,2 S3,3 S3,3 S3,0 S3,1 S3,2
(DES & 3DES) [6], which is proved to be prone and vulner-
able to many attacks such as the brute-force attack [10]. Fig. 2. ShiftRows stage of AES.
NIST published AES in 2001 and since then AES has been The MixColumn stage operates on each column of the
implemented and used by many different cryptography li-
state array. In each column, every single element (i.e. Byte)
braries and packages [7].
is substituted by a value that is a function of all the ele-
AES is a symmetric block cipher algorithm, which takes
ments in that column. The MixColumn function is handled
in 128-bit plain/ciphertext input and encrypts/decrypts it by multiplication of the specific MixColumn matrix with
with differently sized encryption keys of 16, 24, or 32 Bytes the state array, which is illustrated in Figure 3. The multi-
(i.e. 128, 192, 256 bits). AES operates based on the size of
plication operations are in Galois Field (𝐺𝐹(21 )).

1045-9219 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2019.2911278, IEEE
Transactions on Parallel and Distributed Systems

OMID HAJIHASSANI ET AL.: FAST AES IMPLEMENTATION: A HIGH-THROUGHPUT BITSLICED APPROACH 3

2     3     1     1 S0,0   S0,1   S0,2    S0,3 S'0,0   S'0,1   S'0,2    S'0,3 AES encryption to a vast range of applications.
The authors of [25] proposed a technique for an efficient
1     2     3     1 S1,0    S1,1   S1,2    S1,3 S'1,0    S'1,1   S'1,2    S'1,3
design of the MixColumn stage on FPGA. This technique
1     1     2     3 S2,0    S2,1   S2,2    S2,3 S'2,0    S'2,1   S'2,2    S'2,3 reduces the multiplication by 3 of each element to a single
multiplication by 2 and an addition in GF(21 ), instead of
3     1     1     2 S3,0    S3,1   S3,2    S3,3 S'3,0    S'3,1   S'3,2    S'3,3
calculating a separate multiplication by 3 for every element
Fig. 3. MixColumn Stage performed by a matrix multiplication in in the state array. In our implementation, we utilize and
GF(21 ). modify this MixColumn technique.
For the GPU, there have been many works and imple-
The AddRoundKey stage simply XORs each round’s mentations to utilize the high-throughput parallel compu-
key with its input block. The round key size is the same in tation power for the AES algorithm. Among all of these
all the types of AES and is built from the expanded encryp- works, for instance, [26] gives an implementation of AES
tion key. AES encryption begins with a single AddRound- in CBC mode, relying on the fact that the decryption pro-
Key stage followed by 9, 11, 13 (AES-128, AES-192, and cedure in this mode could be implemented in parallel
AES-256) rounds including all the AES internal stages which results in much faster run (roughly 35 Gbps) on the
(Substitute bytes, ShiftRows, MixColumn, and Ad- GPU compared to the AES-NI [23]. Furthermore, [27] gives
dRoundKey). This is continued by a single round includ- an implementation based on T-box look-up table, with
ing only three stages, excluding the MixColumn stage. round keys and T-boxes stored in the shared memory. This
work investigates the impact of different input sizes on the
overall performance of the GPU. [28] presents an optimi-
3 RELATED WORKS
zation on the AES by accommodating 32 Bytes per thread
Since its introduction, AES was the subject of different im- in GTX 1080, achieving high-throughput speed of 280 Gbps
provement and enhancement techniques targeted at its de- in the ECB encryption.
sign and implementation. In the second version of the The bitsliced implementation was employed by Eli Bi-
Rijndael document [18], Rijmen and Daemen proposed the ham in DES cryptography [29]. In this approach, a 64-bit
use of T-box look-up table instead of the initially impro- CPU was viewed as 64´1-bit CPUs working in SIMD man-
vised S-box look-up table employed in the Substitute Bytes ner on the input data. Therefore, the costly shift and mask
stage of AES. The proposed T-box approach fuses together instructions in permutation and expansion operations
the Substitute Bytes and the MixColumn stages from the were replaced by simply referencing different registers.
original AES and undertakes them by the use of prepro- Many adapted this representation view and proposed the
cessed look-up tables. Moreover, it is possible that one can bitsliced AES implementation [30], [31]. Due to the evolu-
implement and employ mathematical formulae in Galois tion of GPU hardware, it is now possible to implement the
finite field to directly calculate the tables instead of LUTs. bitsliced AES on different GPUs. Reference [30] proposed
In the direct mathematical implementation of the Sub- a bitsliced ECB-AES implementation on GPU with differ-
stitute Bytes stage, many strived to reduce the size of the ent and flexible number of plaintext blocks operated on by
8-bit Galois field S-box to smaller 4 and 2-bit calculations a single thread. The aforementioned work roughly
[19], [20]. The approach that we built upon and modified achieves 606 Gbps in the configuration where each thread
to fit our proposed data representation scheme is discussed packs four plaintext chunks into eight 64-bit variables re-
in [17]. This work gives a compact implementation of S-box ferred to as the Bs64. Works from [32] and [33] have imple-
by employing the “tower field” approach from [21]. This mented the CTR operation mode for AES on modern
approach to S-box by using subfield arithmetic and logic GPUs. In Table 1, compilations of all these works alongside
gate optimization can be implemented requiring fewer other look-up table-based implementations [34], [35] are
logic gates and hence requiring fewer instructions in soft- given. Furthermore, Appendix G gives more detailed in-
ware design. formation for each of these works.
AES is implementable on different platforms such as TABLE 1
microcontrollers, CPU, GPU, and FPGAs. For instance, PROPOSED WORKS ON AES
OpenSSL is a fully commercial cryptography toolkit for Throughput
TLS and SSL protocols [22] which accommodates different Ref. Mode Platform Year
(Gbps)
AES modes of operation such as electronic codebook [26] CBC 34 GTX 580 2011
(ECB), cipher block chaining (CBC), and output feedback [35] ECB 60 C2050 2012
(OFB) supporting a wide range of platforms including 32 [32] CTR 72 C2050 2012
and 64-bit ARM and Intel CPUs. Due to the popularity of [34] ECB 205 HD 7970 2014
AES, Intel designed a specific instruction set dedicated to [33] CTR 149.5 GTX 980 2016
the AES cryptography [23]. This set of instructions is re- [31] ECB 78.6 GTX 480 2016
ferred to as AES-NI. On a 2.7 GHz, Intel Core i5-5257U [30] ECB 605.9 P100 2017
CPU, OpenSSL using AES-NI outperforms the OpenSSL [27] ECB 79 K40m 2017
naïve implementation by roughly six-fold. AMD offers the [28] ECB 280 GTX 1080 2017
Secure Memory Encryption (SME) [24], which performs
encryption on the data stored in DRAM memories. SME The problem with the already proposed bitsliced ap-
secures data from different attacks offering fast in-memory proaches is their view on representing the plaintext input.

1045-9219 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2019.2911278, IEEE
Transactions on Parallel and Distributed Systems

4 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, MANUSCRIPT ID

The utilized representation in these works demands costly


Byte-level shift and mask operations at different stages in- R0 R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 R14 R15

cluding the ShiftRows. Our proposed representation, on


the other hand, eliminates the need for the aforesaid costly
operations throughout all of the AES stages. In addition,
stages like Substitute Bytes and MixColumn are fully tai- Fig. 4. The 16 bits LFSR for pseudo random number generator.
lored to fit the maximum throughput criteria of our work. This, of course, is the naïve implementation of the LFSR.
Each of the GPU thread, in our work, encrypts on an order In the more sophisticated parallel SIMD implementation
of 32´128-bit chunks of input data in SIMD manner. This which is illustrated in Figure 5, by employing 16 threads
combined with the fact that, in our suggested method, the and one register each (which are carefully initialized), the
number of the instructions does not linearly increase with number of clock-cycles and shift operations could be re-
the number of the data elements, will greatly increase the duced to 64 per each register which adds up to 16×64 (i.e.
implementation throughput. 1024). In this parallel implementation approach, the need
for shift and costly bit-level access is still mandatory.
4 PROPOSED BITSLICED APPROACH
Thread 1 R0 R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 R14 R15

In this section, we introduce our novel bitsliced AES im- R0 R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 R14 R15
Thread 2
plementation by first, proposing our bitsliced column ma-
jor representation and explaining how it could be applied
to potential applications by showing a simple example.
Then, new data representation model for AES and its ad- Thread 16 R0 R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 R14 R15

vantages compared to the previous techniques is pre-


sented. Afterwards, we incorporate the introduced repre-
sentation scheme in the implementation of all the AES Fig.5. Parallel LFSR Implementation.
stages to show its efficiency.
Figure 6 represents the bitsliced implementation ap-
4.1 Bitsliced SIMD Vectorization proach to highly vectorized LFSR. In this approach, 𝑅? rep-
resents the 𝑖 ,- register containing 16 bits from 16 different
Bitslicing technique initially was introduced for crypto-
data chunks in the identical bit position. Changing the rep-
graphic implementations to accelerate the procedure of the
resentation in this manner (column-major) results in full
encryption/decryption for different cryptographic algo-
utilization of the XOR operator as all of the 16 bits of regis-
rithms [29]. Recently bitslicing is utilized to simplify the
ters are XOR-ed in a single instruction (in the 16-bit archi-
circuit architecture and reduce hardware cost for quantum
tecture). Now, each execution thread operates on 16 bits at
microprocessors [36].
each clock. In this way instead of XORing and operating on
Armed with the fact that the costly shift and rotation op-
16×1-bit registers at each time by 16 threads, just one
erations can be reduced to simple swapping in the bitsliced
thread operates on 1×16-bit register, which improves the
representation, column major representation could in-
processor data path utilization.
crease the utilization of computation units in specific algo-
rithms.
it
-b
16

R0 R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 R14 R15


Before getting into the detailed proposed implementa-
tion of the AES, in this section, we show that our proposed
bitslicing approach could be successfully applied to appli-
cations beyond cryptography. To emphasize this, here we Fig.6. The bitsliced SIMD approach with column-major representa-
investigate a simple example, by applying our bitsliced ap- tion for LFSR circuit.
proach to a Linear Feedback Shift Register (LFSR).
A simple 16-bit LFSR is illustrated in Figure 4. Every In our bitsliced approach, the previously stated parallel
new pseudo-random bit is generated at every clock (MSB SIMD scenario could be reduced to 16 registers with no
of register R). Assuming a 16-bit width architecture (16-bit need for costly bit-level operations and bit-level shifting in
registers) for a hypothetical circuit, at each clock to pro- each thread. Also, by employing our proposed approach
duce a new bit, the register needs to be shifted to the right the calculation of the LSB at each cycle is reduced to a sin-
and the Least Significant Bit (LSB i.e. 𝑅3 ) is to be generated gle register-level operation for all 16 LSB bits (𝑅3 ). This re-
in bit level according to the feedback polynomial function sults a huge speed-up over the conventional SIMD parallel
𝑔(𝑥) = 𝑥78 + 𝑥79 + 𝑥7: + 𝑥73 . The LSB would be updated implementation. By Employing 16 threads, to have a com-
at every clock by accessing the R in bit-level granularity as parison to the previous SIMD scenario, the 1024-bit
is shown in Equation (1). pseudo-random bits can be easily generated in only 4 clock
cycles. Note that in this approach 16 registers are used in a
𝑅3 = 𝑅78 ⨁ 𝑅79 ⨁ 𝑅7: ⨁ 𝑅73 (1) single thread, which accommodates total number of 16×16
registers. As a side note, to generate pseudo-random bits,
Note that ⨁ represents the bit-wise XOR. In this archi- the registers should be initialized carefully with non-iden-
tecture to generate 273 = 1024 pseudo random bits, the R tical and preferably random values (initial states) to pre-
should be shifted to the right 1024 times and at each clock vent the possibility of collision.
the LSB should be updated.

1045-9219 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2019.2911278, IEEE
Transactions on Parallel and Distributed Systems

OMID HAJIHASSANI ET AL.: FAST AES IMPLEMENTATION: A HIGH-THROUGHPUT BITSLICED APPROACH 5

4.2 Data Representation Scheme b127 b126 b125 b124 b1 b0


Based on the related works on the implementation of AES, D0 0 0 0 0 0 0
emerging GPU technology makes it possible for the D1 0 1 0 1 1 1 D31 D30 D29 D28 D1 D0
bitsliced AES to be accelerated on the massively parallel D2 1 0 1 0 0 0 R0 0 0 0 0 1 0
platform of GPUs. The previous bitsliced methods of AES D3 1 1 1 1 1 1 R1 1 1 0 0 1 0
utilize a data representation model, which is in need of nu- D4 1 1 0 0 0 0 R2 1 1 1 1 0 0
merous shift and mask operations in different AES stages. D5 0 1 0 1 1 1 R3 1 1 1 1 0 0
In our suggested technique, by introducing a new repre- Representation
sentation scheme to store the plaintext or ciphertext data Alteration

chunks, we strive to eliminate the entire shift and mask in- D29 1 1 1 1 0 0 R126 1 0 1 1 1 0
structions in all of the AES stages. By doing so, we reach an D30 1 0 1 0 1 0 R127 1 1 1 1 0 0
unprecedented order of AES encryption and decryption D31 1 1 1 1 1 0
throughput. By altering the representation of data from the Fig. 7. The change in representation from the conventional row-major
conventional row-major to the column-major representa- (Left) to the column-major representation (Right).
tion, a number of 32 blocks of 128-bit data are en-
crypted/decrypted at any instance of time by each paral- changing the reference of the registers through the input
lelization unit. By the row-major representation, we refer arguments of function calls, the need for extra temporary
to the data representation in which each 128-bit chunk of registers is diminished.
data is stored in four 32-bit registers. By conventional rep-
resentation, we are not referring to the representation used 4.3 Substitute Bytes
in [30], [31]. The conventional representation indicates the In this section, based on the proposed data representation
way that the data is stored in common programming prac- model, we implement the Substitute Bytes stage of AES to
tices. simultaneously operate on 32 chunks of 128-bit data. The
We illustrate the effect of the change in the representa- original Substitute Bytes stage of the AES takes in an 8-bit
tion on 32´128-bit input data. In the conventional represen- segment from the 128-bit data and substitutes that 8-bit
tation, 32´4´32-bit registers are used, this means that each segment with a different Byte. First, we illustrate the use of
of the thirty-two 128-bit data chunks, is stored in 4´32-bit LUTs to undertake the Substitute Bytes stage. Then we will
registers. In the new column-major representation, to store thoroughly show that by using the proposed representa-
the same input data, 128´32-bit registers are needed. In this tion, if we have utilized LUTs to handle the Substitute
implementation, the proposed column-major representa- Bytes stage of the AES, we had to use numerous shift and
tion suggests a new way to store the data chunks in a set of mask operations to both extract and substitute the data.
registers in which 128 number of 32-bit registers are em- In Figure 8, Substitute Bytes is handled by the utiliza-
ployed. This way, 𝑅3 instead of holding 𝑏3,3 , 𝑏3,7, 𝑏3,: , tion of a preprocessed look-up table. For the LUT based
𝑏3,9 , … , 𝑏3,7:C from the first data (𝑏$," indicates the 𝑚 ,- bit implementation, the input data size is equal to 8 bits.
from the 𝑛,- data), holds 𝑏3,3 , 𝑏7,3 , … , 𝑏97,3, which are the 1. Extract
least significant bits from all of the 32 data chunks. Now,
the first register having an index of zero stores and saves
the least significant bits from all of the 32 chunks of 128-bit D31 D30 D29 D28 D1 D0
data. R0 0 0 0 0 1 0
In this representation, each register stores bits that have R1 1 1 0 0 1 0
an identical position from all the 32 data chunks. In other R2 1 1 1 1 0 0
S-box
words, 𝑅3 stores the least significant bits, 𝑅7 stores the sec- R3 1 1 1 1 0 0 Look-up
ond bits of all the 32 data chunks, this goes on and stays Table
the same as 𝑅7:C stores the 127,- bits from all the 32 data 1 0
R126 1 0 1 1
chunks. The alteration from the row-major to the column- 0 0
R127 1 1 1 1
major representation is illustrated in Figure 7. The left part
of Figure 7 gives the representation of the data in the row-
major format and the set of registers on the right side of the 2. Substitute
figure indicates the column-major representation. Fig. 8. Using Look-up table-based S-box with the proposed column-
The benefit of the proposed data representation scheme major representation.
is that a single GPU thread processes 32 chunks of 128-bit
data at each instance of time in SIMD manner. In addition, Note that the LUT in this figure takes in and returns 8-
the storage of all the bits of data in the column-major rep-
bit data. Moreover, the transformation of representation is
resentation eliminates the need for iterative shift and mask
handled in the same way as is handled in Figure 7. In this
operations involved in different stages of AES. For exam-
approach to Substitute Bytes:
ple, if we want to replace and shift 𝑏3 with 𝑏7 , we simply
change the reference of 𝑅3 and 𝑅7 registers. By doing so and 1. The data needs to be extracted from all of the cor-
by knowing that 𝑅3 stores the LSB of all the 32 data chunks, responding registers.
now, the shift operation is performed on all the 32 data 2. The old 8-bit value is replaced with the new value.
chunks in a single instance of time. Moreover, by only These two functionalities require costly shift and mask

1045-9219 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2019.2911278, IEEE
Transactions on Parallel and Distributed Systems

6 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, MANUSCRIPT ID

operations. For example, in Figure 8, as the S-box look-up CUDA implemented of the S-Box with our bitsliced ap-
table takes in 8-bit inputs each time, the data of 𝐷93 is read proach is given in Appendix A.
from eight registers from the total of 128 registers for 16
times by the use of several Byte-level shift and mask oper- 4.4 MixColumn
ations and the new value is written back to the correspond- The MixColumn stage of AES is handled by a matrix mul-
ing eight registers each time. tiplication of the state array with the MixColumn matrix.
Generally, for all the 32 data chunks, the 8-bit value This matrix multiplication is in 𝐺𝐹(21 ) and is shown in Fig-
needs to be extracted and then replaced with its new value. ure 3. In the conventional representation of the state array,
Therefore, by using LUTs our proposed representation im- each element represents 8 bits from a single 128-bit data. In
poses a heavy cost of the required shift and mask opera- Figure 3, 𝑆3,3 represents 𝑏3, 𝑏7 , 𝑏:, …, 𝑏C. This can be ex-
tions on the execution time. Moreover, the use of LUTs is pressed by the following formula. In Equation (2), 𝑆$," in-
costly due to the use of extensive I/O bound instructions dicates the element in 𝑚 ,- column and 𝑛,- row of the state
needed for accessing LUTs per each 8-bit segment of the matrix and 𝑖𝑛𝑑𝑒𝑥 equals 32 × 𝑚 + 8 × 𝑛
total 128-bit input data. This means that to perform the
Substitute Bytes stage on 32 data chunks, the I/O opera- 𝑆",$ = 𝑏?$KLMN3 , 𝑏?$KLMN7 , … , 𝑏?$KLMNC (2)
tions increases 32 times as well and also many more shift
and mask operations need to be used. Hence, by employ- In our implementation, we change the state array with
ing this approach, the SIMD execution feature of our im- our proposed representation in a way that each element of
plementation will be undermined, which is because the the state array instead of representing 8 bits from a single
number of instructions in the code is linearly scaled with data chunk, represents 8 registers accommodating identi-
the number of presented data elements. As the number of cal bits from 32´128-bit data chunks. By identical bits, we
operations increases linearly, there is no gain from the in- are implying that each register stores and represents bits
creased number of processed data elements. with identical indices from all the 32 data chunks. Figure
To eliminate the cost of the I/O bound operations and 10 demonstrates the organization of all the 128 registers
the cost of extensive shift and mask operations in the afore- into the new state array.
mentioned extraction and replacement tasks, we seek an Equation (3) gives the relationship between 𝑚, 𝑛 and the
alternative approach to implement the Substitute Bytes batch of corresponding registers comprising each 𝑅",$ .
stage. So, we build upon and modify the compact ap- 𝑖𝑛𝑑𝑒𝑥 is parameterized in the same way as is initialized in
proach to Substitute Bytes introduced in [17]. This work Equation (2).
gives a very compact algorithm for S-box calculation hav-
ing 253 logic gates. The algorithm given in [17] uses sub- 𝑅",$ = 𝑅?$KLMN3 , 𝑅?$KLMN7 , … , 𝑅?$KLMNC (3)
fields of 4 and 2-bit for the 8-bit S-box calculations.
128 Registers
The implementation of the S-box stage by the use of [17]
and our proposed representation is given in Figure 9. Here,
the S-box takes in values of 8-bit data. Unlike the LUT ap-
proach, instead of inputting single 8-bit inputs for 16 times
per each of the 128-bit data chunks, 𝑅3 , 𝑅7 , 𝑅: , …, and 𝑅7:C R0 R1 R2 R127

are passed into the S-box logic circuit eight registers at a


time and the logic gates affect all the 32 data chunks at R0,0 R0,1 R0,2 R0,3
once. This way, without a linear increase in the number of
the instructions corresponding to the number of the data R1,0 R1,1 R1,2 R1,3
elements, 32 chunks of data are operated on in parallel. R2,0 R2,1 R2,2 R2,3
D31 D30 D29 D28 D1 D0
R3,0 R3,1 R3,2 R3,3
R0 0 0 0 0 1 0
R1 1 1 0 0 1 0 Fig. 10. The organization of all the 128 registers into the register-
R2 1 1 1 1 0 0 based state array.
S-box
R3 1 1 1 1 0 0
Logic
Circuit What this new state array means is that instead of rep-
R126 1 0 1 1 1 0 resenting only 8 bits from a single data, each state array
R127 1 1 1 1 0 0 element represents eight registers embedding 32 chunks of
Fig. 9. Using logic circuit-based S-box with the column-major repre- data in themselves. Now, we will consider the multiplica-
sentation. tion of the MixColumn matrix with the proposed state ar-
ray. The new value for 𝑅?,O (i.e. 𝑅′?,O ) is calculated as indi-
In sum, instead of inputting 8 bits at a time to the S-box
cated by Equation (4).
function, eight 32-bit registers are passed in and being op-
erated on. This way 32 data chunks are operated on in par-
𝑅′?,O = 2 × 𝑅?,O ⊕ 3 × 𝑅|?N7|S,O ⊕ 𝑅|?N:|S,O ⊕ 𝑅|?N9|S ,O (4)
allel. It should be mentioned that, now, instead of passing
8 bits of data from the total of a single 128-bit data chunk, In Equation (4), a number of register shift operations
eight registers from the total of 128 registers representing handle the multiplication by 2. Moreover, in our represen-
32´128-bit data are passed into the S-box logic circuit. The tation, there is no need for Byte-level shift operations as

1045-9219 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2019.2911278, IEEE
Transactions on Parallel and Distributed Systems

OMID HAJIHASSANI ET AL.: FAST AES IMPLEMENTATION: A HIGH-THROUGHPUT BITSLICED APPROACH 7

one can substitute the registers with one another to per- 4.6 AddRoundKey
form the needed shift operations as mentioned in Section In the AES implementation on GPU, all of the round keys
4.2. It is important to notice that the multiplication is in for all of the AES rounds are processed once in the host
𝐺𝐹(21 ) and in the case of overflowing, the result of the CPU. By processing the value of the round keys once, there
multiplication is modulo by 𝑔(𝑥) = 𝑥 T + 𝑥 9 + 𝑥 + 1 poly- is no computational difficulty imposed on the GPU execu-
nomial. When the registers are shifted to the next register, tion. However, storing the preprocessed round keys im-
for the purpose of multiplication, the overflowing of the poses I/O operations and memory complexity to the exe-
multiplication can be detected when the MSB of the origi- cution.
nal 8-bit value is equal to 1. For example, the result of the As mentioned before, by using the proposed column-
multiplication of 𝑋 by 2 is given in Algorithm 1. major representation, each GPU thread simultaneously op-
erates on 32 data chunks. Hence, in order to apply the
Algorithm 1: Multiplication of 𝑋 by 2 in 𝐺𝐹(21 ) round key to all of the 32 data chunks in parallel, each
𝐼𝑛𝑝𝑢𝑡: 𝑋 = 𝑥3 , 𝑥7 , … , 𝑎𝑛𝑑 𝑥C round key is expanded into 128 registers and then is
𝑂𝑢𝑡𝑝𝑢𝑡: 2 × 𝑋 XORed with the 128 registers from the MixColumn stage.
í 𝐶𝑎𝑟𝑟𝑦 = 𝑥C Therefore, in the processing of the round keys, the column-
𝑥C = (𝑥` ) major representation should be adopted in order to repre-
𝑥` = (𝑥8 ) sent the value of all the round keys. The value of each of
𝑥8 = (𝑥T ) the 128 bits of a single round key should be copied 32 times
𝑥T = (𝑥9 ) ⊕ Carry
to fill each register from the total of the 128 registers. As
𝑥9 = (𝑥: ) ⊕ Carry
𝑥: = (𝑥7 )
the value of the round key bits are duplicated, each of the
𝑥7 = (𝑥3 ) ⊕ Carry 128 round key registers are either 0xFFFFFFFF or
𝑥3 = Carry 0x00000000.
ý The state array representation of the 128 registers ac-
commodating a single round key is similar to Figure 10. In
The multiplication by 3 can be reduced to a multiplication this case, each element instead of being 𝑅",$ is 𝐾",$ , which
by 2 and an addition (XOR) in 𝐺𝐹(21 ). Hence, in our im- represents 8 registers from the total of the 128 registers, 𝐾3 ,
plementation code, Equation (4), which is representing the 𝐾7 , …, and 𝐾7:C.
calculation of one element from the state array in the Figure 11 illustrates the result of the XORing operation
MixColumn stage, is expanded by Equation (5). on the elements of the key state array and the register-
based state array from the MixColumn stage output. Here,
𝑅e ?,O = 2 × 𝑅?,O ⊕ 3 × 𝑅|?N7|S,O ⊕ 𝑅|?N:|S,O ⊕ 𝑅|?N9|S ,O (5) the elements having identical indices from both of the state
= 2 × (𝑅?,O ⊕ 𝑅|?N7|S,O ) ⊕ 𝑅|?N7|S ,O ⊕ 𝑅|?N:|S ,O ⊕ 𝑅|?N9|S ,O arrays are XORed together.
We have employed two different techniques in our AES
For example, based on Equation (5), 𝑅3,3 is calculated as implementation to store and handle the keys. One way is
follows. to store the round keys and to keep the values for all the
𝑅e 3,3 = 2 × 𝑅3,3 ⊕ 3 × 𝑅7,3 ⊕ 𝑅:,3 ⊕ 𝑅9,3 round keys in the GPU’s global memory. Then, the 128 reg-
= 2 × (𝑅3,3 ⊕ 𝑅7,3 ) ⊕ 𝑅7,3 ⊕ 𝑅:,3 ⊕ 𝑅9,3 isters representing each round key are moved to the GPU’s
shared memory to be used. This should be done one round
The CUDA implementation of the MixColumn stage is key at a time, which is due to the fact that the size of the
available in Appendix B. generated round keys exhausts the available shared
memory space. For example, in the case of AES with 𝑛
4.5 ShiftRows rounds, 𝑛 × 128 ×32-bit registers are needed. This number
The ShiftRows stage of the AES embeds shift and permu- of needed registers easily exceeds the available shared
tation of state array elements. As already indicated, in the memory space for each thread. By storing the round keys
conventional representation, each element of the state ar- in the global memory and fetching them into the shared
ray accommodates 8 bits from a 128-bit data. memory, at the end of each round, several I/O accesses
However, in our proposed representation, as mentioned should be made which hinders the achievable throughput.
in the MixColumn stage, each element of the state array In order to diminish the damaging effect of the required
represents eight registers from the total of 128 registers. I/O accesses in the previous approach and to fine tune our
The ShiftRows function performed on the register- execution performance, we propose a technique in which
based state array is the same as the approach in Figure 2 the round key values are etched into the code with a script-
(instead of 𝑆",$ , 𝑅",$ should be considered). In this repre- ing manner before the compilation of the code. To do so,
sentation, ShiftRows is implemented by register shift and we propose the use of PyCUDA to generate the unique ex-
swapping instead of the costly Byte shift and swapping. In ecution code accommodating round key values from a ge-
the actual implementation of the ShiftRows stage, the reg- neric AES code. In this approach, to eliminate the use of
ister shift and swapping is implicitly expressed in the code “if” statements to select the round key values each time,
by changing the input references of the MixColumn stage. we have employed logical multiplexer like code with a
Hence, the ShiftRows stage is embedded into the arrange- one-hot encoded selector input. This way there is no diver-
ment of the MixColumn arguments and there is no need to gence present in the code.
perform any explicit shift operations at any granularity.

1045-9219 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2019.2911278, IEEE
Transactions on Parallel and Distributed Systems

8 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, MANUSCRIPT ID

the original plaintext input. However, it is possible, we in-


R0,0 R0,1 R0,2 R0,3 tend to recompute these intermediate results for different
R1,0 R1,1 R1,2 R1,3
data stream inputs. In the following section, the achievable
throughput results from the implementation of the ECB
R2,0 R2,1 R2,2 R2,3 and the CTR modes are given.
R'0,0 R'0,1 R'0,2 R'0,3
R3,0 R3,1 R3,2 R3,3
R'1,0 R'1,1 R'1,2 R'1,3 6 EVALUATION RESULTS
R'2,0 R'2,1 R'2,2 R'2,3 In this section, we will present the evaluation results of the
K0,0 K0,1 K0,2 K0,3 throughput achieved from the execution of the ECB and
R'3,0 R'3,1 R'3,2 R'3,3 the CTR modes of operation for AES on a number of
K1,0 K1,1 K1,2 K1,3 CUDA-enabled GPUs. In this study, six Nvidia GPUs are
K2,0 K2,1 K2,2 K2,3 deliberately selected for evaluation purposes. The em-
ployed GPUs have different build characteristics such as
K3,0 K3,1 K3,2 K3,3 different single and double precision throughput and
memory bandwidth. These features are carefully selected
Fig. 11. Register-wise XORing of the register-based state array and
to represent a wide range of execution platforms. Firstly,
the round key state array.
the range of the selected GPUs completely represents the
platforms available to a wide range of users spanning
5 AES MODES OF OPERATION home and enterprise users. Secondly, these GPUs give us a
AES is a block cipher cryptography algorithm, which takes fair comparison with the previously proposed methods.
in a single 128-bit plaintext/ciphertext alongside a key 6.1 Setup
value and returns a 128-bit output. By applying AES en-
cryption to inputs greater than 128 bits in size, a couple of The systems that we have employed in order to evaluate
security issues arises. If the same key is used to encrypt the proposed methods on many-core platforms are dis-
each 128-bit segment, in the case in which two or more cussed here. The GPUs employed for the evaluation of the
number of the 128-bit input segments are alike, their en- proposed solutions include GTX 980, GTX 1050 Ti, GTX
cryption process renders same results. This gives an at- 1080, and GTX 1080 Ti, which all are on machines with two
tacker the opportunity to exploit this occurring feature to Intel XEON E5 2697 V3 CPUs clocked at 2.6 GHz. Each ma-
analyze the plaintext data. chine has 128 GB of DDR3 RAM. Also, we evaluate our al-
Different operation modes were improvised to apply gorithm on Tesla P100 and V100 accelerators on cloud plat-
the block cipher algorithms to the data streams. Four out forms to prove the scalability of our work and also to pre-
of the five modes of operation used by AES are immune to sent a fair compassion with best known work [30]. We used
the mentioned security risk posed by the aforesaid issue. CUDA version 9 in the programming of the GPUs. The
Here, we do not have the intention to skim over these specification of all these GPUs is given in Table 2. The em-
modes of operation. Other resources have done a great job ployed GPUs are based on earliest state of the art microar-
on that [9], [15]. Actually, what we are concerned with is chitectures such as Maxwell and Pascal.
TABLE 2
the parallelization level of these operation modes.
SPECIFICATION OF THE EMPLOYED GPUS
The simplest mode of operation, which is void of any
intuition and is actually prone to the previously mentioned Name
Single Preci- Double Preci- Memory Band-
security risk, is the electronic codebook (ECB) mode. Other sion (Gflops) sion (Gflops) width (GB/s)
modes such as the cipher block chaining (CBC), cipher GTX 980 4612 144 224
feedback (CFB), output feedback (OFB), and counter (CTR) GTX 1050 Ti 1981 62 112
are designed to be perfectly immune to the discussed op- GTX 1080 8228 257 320
eration risk on data streams. Among all these five modes GTX 1080 Ti 10609 332 484
of operation, only two, which are the ECB and the CTR Tesla P100 8071 4036 732
modes are capable of being implemented in a parallel man- Tesla V100 14028 7014 900
ner [9]. The lack of parallelism in other modes, namely
CBC, CFB, and OFB are due to the chained encryption 6.2 Performance Profiling
scheme improvised in their core design. In all of these By using Nvidia Visual Profiler and Nvidia CUDA Profil-
modes, each block’s encryption relies on the encryption re- ing Tool (CUPT), the performance of the proposed imple-
sults from its previous blocks. This renders the encryption mented algorithm is analyzed in detail. Moreover, hard-
of 128-bit segments in a data stream sequentially in nature.
ware cost, and the overall throughput of the implemented
In this paper, we have applied the proposed bitsliced sub-functions are determined. Appendix C gives the de-
representation scheme to all the stages of the AES-128 al- tailed report for our implementation. This data is extracted
gorithm. We have used the proposed representation in the by Nvidia Visual Profiler and Nvidia Nsight Compute.
implementation of ECB and CTR operation modes of AES Note that the results are for the implementation of AES-
algorithm. It is worth noting that in the CTR mode the en- ECB encryption on GTX 1050 Ti. It has not escaped our no-
cryption results from the incremental counter inputs can be tice that due to the huge data intensive operation in AES
processed in advance and kept for the XORing phase with

1045-9219 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2019.2911278, IEEE
Transactions on Parallel and Distributed Systems

OMID HAJIHASSANI ET AL.: FAST AES IMPLEMENTATION: A HIGH-THROUGHPUT BITSLICED APPROACH 9

shared memory is used to minimize the I/O delay. Our ex- A compilation of leading previous works proposed on
periments show that the overall resource utilization is the ECB implementation is given by Figure 12. The GPUs
quite acceptable for our implementation which results in a used by each work are outlined in Table 1. In Figure 12, We
huge performance gain. compare our work with the most recently proposed high-
performance works given by [30] and [28] on the same re-
6.3 ECB Implementation ported GPU platforms. Upon our request, authors of [31]
Here, the performance results from the execution of the provided us with the bitsliced ECB implementation code
ECB mode of operation implemented with our proposed from [31], which we test on GTX 1080 Ti for a fair compar-
bitslicing approach are given. In this implementation, we ison. Moreover, due to the lack of access to outdated GPUs
use shared memory to store and fetch the input data for employed by others and implementation code from other
each thread from the global memory. Moreover, each GPU works, the performance per cost metric is proposed for the
thread processes and operates on 32 numbers of 128-bit sake of evaluation, in Figure 12, to give a fair comparison
data chunks. Table 3 gives the performance achieved by with other referenced works. The detailed evaluation of the
the implementation of the proposed ECB mode on all of reported performance per cost metric for these works are
the GPUs given in Table 2. The performance per cost met- given in Appendix G. In figure 12, the left vertical axis
ric of each GPU platform is also provided to give a fair shows the performance Gbps range and the right vertical
evaluation on different GPUs since achievable perfor- axis shows the performance per cost range in terms of
mance is highly related to the employed platform. This Gbps/$. Moreover, our work is also presented by the per-
metric is obtained by normalizing the performance value formance achieved on the GTX 1080Ti and Tesla V100
of each GPU platform (by dividing the performance value GPUs which prove to be our best evaluated implementa-
by its approximate GPU cost). The detailed procedure of tions in terms of performance.
the performance per cost calculation is given in Appendix The work from the [34] is benchmarked on a Radeon
G. Note that the GPU prices might change in the course of HD 7970, which is close to the GTX 1050 Ti in the single
time; hence, the values should be updated in case prices precision processing power. GTX 1050 Ti outperforms the
are altered. implementation from [34] by 1.44x. However, the GPU em-
TABLE 3 ployed in [34] is not based on the CUDA architecture.
ECB ENCRYPTION THROUGHPUT The stage involved with the change of the representa-
tion is two orders of magnitude faster than the AES en-
Throughput (Gbits per Performance /GPU cost
Name cryption phase. Furthermore, the PCI-e connections at
sec.) (Gbits per sec./$)
hand cannot handle the performance of our AES imple-
GTX 980 489.44 1.529
mentations. However, the available Nvlink 2.0 [37] inter-
GTX 1050 Ti 296.74 1.484 connection speed can handle data communication rate of
GTX 1080 805.16 1.008 up to 300 GBytes/s.
1400 1.4
GTX 1080 Ti 1156.07 0.896 Performance
Performance/Cost
Tesla P100 832.58 0.171 1200 1.2

Performance/Cost (Gbps/$)
Tesla V100 1384.51 0.209 1000 1
Performance(Gbps)

800 0.8
The peak performance is achieved by the Nvidia Tesla
V100 that equals 1.38 Terabits per second of encryption 600 0.6

throughput. This is achieved by a GPU grid equal to 400 0.4

524,288 threads. These threads are organized into 8192 200 0.2
blocks each having 64 threads. Here, 2 Gigabits of data that
was already stored in the GPU’s global memory is en- 0
or
k
or
k
or
k
or
k ] ] ] ] ] ]
0

rW rW rW rW [30 0 [34 [28 [31 Ti [27 [35 0


crypted in 0.00144 seconds. Ou 100 Ou 80Ti Ou 80
V 10 10
Ou 00
P1
P1
0
79
70
10
80
10
80 K4
0
C2
05

The top achieving bitslicing approach proposed in [30] Fig. 12. The AES-ECB Encryption performance and performance/cost
achieves roughly 606 Gbits of encryption throughput on a achieved by our work and other related efforts.
Tesla P100 Nvidia GPU. In comparison with this solution,
our implementation on the same platform (Tesla P100), im- 6.4 CTR Implementation
proves the AES-ECB by 1.37x. Our proposed bitsliced tech- In this section, the achievable AES-CTR encryption
nique on GTX 1080 Ti is observed to outperform the work throughput of all the GPUs given in Table 2 are thoroughly
from [30] by 7.1x in terms of performance per cost metric, discussed. Here, the counter mode of operation for AES is
which is a considerable achievement. This achievement in implemented with the proposed bitslicing approach in our
the performance per cost metric, is due to the fact that our work. In this implementation, each thread operates on and
implementation could be applied to single precision gran- encrypts 32´128-bit data chunks. Here, each GPU grid is
ularity (32-bit) computations which is the main processing comprised of 524,288 threads organized into 8192 blocks
power available in a wide range of cheap and general-pur- each having 64 threads. 32,768 Bytes of shared memory is
pose GPUs like GTX 1080 and GTX 1080 Ti, but Tesla P100 used by each GPU block accommodating the input data for
is known for its powerful double precision computation 64 threads. Table 4 illustrates the throughput achieved by
(64 bit) and is suitable for high-tech scientific applications the implementation of the proposed CTR mode.
and R&D purposes.

1045-9219 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2019.2911278, IEEE
Transactions on Parallel and Distributed Systems

10 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, MANUSCRIPT ID

In the CTR implementation, the peak performance is implemented based on the same proposed data-represen-
achieved by the Nvidia Tesla V100 that equals 1.47 Terabits tation. The inverse-Sbox and inverse-MixColumn are im-
per second of encryption throughput. Here, 2 Gigabits of plemented in the proposed bitsliced manner as the build-
data is encrypted within 0.001352 seconds. ing blocks of the decryption algorithm. However, as could
TABLE 4 be predicted, due to the fact that the inverse-MixColumn
CTR ENCRYPTION THROUGHPUT requires much heavier matrix multiplication (in GF(21 ))
Throughput (Gbits per Performance /GPU cost
compared to the MixColumn in encryption, the circuit in
Name
sec.) (Gbits per sec./$)
the decryption algorithm suffers more complexity com-
1.693
pared to the encryption procedure. The detailed proposed
GTX 980 541.81
implementation of inverse-Sbox and inverse-MixColumn
GTX 1050 Ti 324.04 1.660
are available in Appendix D and E, respectively. A compar-
GTX 1080 843.17 1.053
ison between decryption and encryption throughput in
GTX 1080 Ti 1213.60 0.941
ECB mode, of the proposed bitslicing approach is given in
Tesla P100 854.9 0.176
Figure 14.
Tesla V100 1478.42 0.223
It is worth mentioning that in some applications, the ef-
. In our work, unlike other AES implementations on ficiency and throughput of encryption procedure is much
GPU, there is no outstanding difference between CTR and more important compared to the decryption phase. For in-
ECB modes of operation throughput. stance, in the satellite communication [38], where high-
An illustration for the difference between all of the pre- speed data should be encrypted, then sent or carried by the
vious works proposed on the CTR implementation is given satellite communication system.
in Figure 13. Again, our works are presented by the perfor- 1400
Encryption
mance achieved on GTX 1080 Ti and Tesla V100 GPUs. Decryption

1200
Also, we implement our algorithm on GTX980 to compare
our work to the best known CTR implementation on GPU
1000
[33]. Appendix G gives the performance cost normalized
Performance(Gbps)

metric evaluation details. 800


1.8
1750
Performance
Performance/Cost 1.6
600
1500
Performance/Cost (Gbps/$)

1.4
Performance(Gbps)

1250
1.2 400

1000 1
200
0.8
750
0.6
0
500 Tesla V100 Tesla P100 GTX 1080 Ti GTX 1080 GTX 1050 Ti GTX 980
0.4
250 Fig. 14. The AES-ECB Encryption and Decryption comparison.
0.2

0 0
Our Work
V100
Our Work
1080Ti
Our Work
980
[33]
980
[32]
C2050
6.6 Multi-GPU
Fig. 13. The AES-CTR Encryption performance and performance/cost As it is feasible to utilize more than one GPU in a single
achieved by our work and other related works. machine, we here propose a solution that can further scale
up the throughput of the AES CTR and ECB execution
In this evaluation, each GPU thread encrypts 32 counter throughput on a greater number of GPUs within a single
values. Instead of reading from a file or the memory, the standalone machine. Another important factor that em-
counters are generated in the CUDA code by each of the powers us to use multiple numbers of GPUs is the fact that
GPU threads. 32 counters are generated from an initial the AES encryption in both of the ECB and CTR modes of
seed equal to the global and the unique position of each operation shows a strong trait of data parallelism.
thread in the GPU grid. The counters are initially gener- Firstly, the input data is apportioned amongst all of the
ated in the proposed representation and do not need to be available GPUs. In the partitioning phase, if the GPUs are
altered from the row-major to the column-major represen- alike in the processing power metric, the input data is
tation. After the encryption, each output should be equally broken down into same sized partitions. Moreo-
changed in representation from the column-major back to ver, if the GPUs are not identical, the input data is appor-
the row-major representation. Hence, there is no need for tioned amongst them with regard to their processing
the actual input data to be represented in the column-major power. Then, each GPU goes on by encrypting its ap-
representation. This way, there is need for only one phase pointed share of the input data.
of representation change in the code. Therefore, the AES- In this approach, one CPU thread is needed to invoke
CTR code execution achieves higher throughput compared the kernel code on each of the available GPUs. The GPUs
to our AES-ECB implementation. can be managed by OpenMP threads in parallel. The rest
of the encryption process is handled in the same way as
6.5 Decryption Implementation discussed in the single-GPU implementation of the CTR
The decryption of the AES algorithm in the proposed and ECB modes of operation. In our implementation, our
bitsliced approach, on the both modes of ECB and CTR is evaluation setup is comprised of two GTX 980 GPUs. In the

1045-9219 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2019.2911278, IEEE
Transactions on Parallel and Distributed Systems

OMID HAJIHASSANI ET AL.: FAST AES IMPLEMENTATION: A HIGH-THROUGHPUT BITSLICED APPROACH 11

setup in which two identical GPUs are employed, the employed to span a wide range of computing platforms, is
achieved performance shows a near linear trend, however, given in Section 6. Also, we have considered the cost of
it is known that by increasing the number of GPUs to 4 and platforms to give a fair comparison with other works.
8 in this technique, the overall performance doesn’t follow Hence, we have incorporated the performance/cost metric
a linear increase due to the cost of data scheduling latency, into the evaluation section.
data concatenation, and other parameters [39]. Our exper- In addition, the scalability of our implementation on
iment with two GTX 980 GPUs resulted in 92% perfor- two GPUs shows close to linear scalability. Our suggested
mance of theoretical separated 2 GPU (100%). The results work outperforms the top achieving throughput from [30]
for two GPU implementation are available in Appendix F. by 2x and 11.3x in terms of performance and performance/
In the case where the GPUs are not identical, the input cost metrics, respectively. Based on the best of our
data needs to be dynamically or statically apportioned knowledge, this is the highest achieved encryption
amongst the GPUs. One can dynamically partition the in- throughput. We intend to evaluate our proposed bit slicing
put data by executing sample micro encryption bench- technique’s side-channel resistance feature in our future
marks on each GPU. Then, by comparing the execution work.
throughput of all of the available GPUs, a precise insight ACKNOWLEDGMENT
into each GPU’s processing power is gained. Furthermore,
We are grateful to Prof. Hamid Sarbazi Azad, head of the
one can statically partition the data set by taking into ac-
school of computer science, for his support and useful
count the processing power reported by the GPU vendors. guidance. We also would like to thank all the other mem-
The execution scheme in this approach is given in Figure bers of HPC lab at IPM Institute. Furthermore, we would
15. In this figure, first, the input data is divided equally in like to acknowledge authors of [31] who generously pro-
size by the quantity of available GPU resources. Then, each vided their implementation code. Finally, we also wish to
portion is fed to one of the identical GPUs for encryption. sincerely appreciate the efforts put by the anonymous re-
Host viewers to pinpoint all the shortcoming of the original
Input Data
(CPU)
Partitioning manuscript.
Kernel
Call
Kernel
Call
REFERENCES
[1]. M.A. Alsmirat, Y. Jararweh, M. Al-Ayyoub, M.A. Shehab, and B.B.
Gupta, “Accelerating Compute Intensive Medical Imaging Segmenta-
Device 0 (GPU) Device n (GPU) tion Algorithms Using Hybrid CPU-GPU Implementations,” In Mul-
timedia Tools and Applications, vol. 76, Issue 3, pp 3537–3555, February
AES AES
2017.

[2]. M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro,


Host G.S. Corrado, A. Davis, J. Dean, M. Devin, and S. Ghemawat, “Ten-
(CPU) sorflow: Large-scale Machine Learning on Heterogeneous Distributed
Output Data
Concatenation
Systems,” In arXiv preprint arXiv:1603.04467, 2016.

Fig. 15. Multi-GPU execution scheme using multi-GPUs within a sin- [3]. S. Rahmani, A. Ahmadzadeh, O. Hajihassani, S. Mirhosseini, and
S. Gorgin, “An Efficient Multi-core and Many-core Implementation of
gle machine.
K-means Clustering,” In ACM-IEEE International Conference on Formal
Methods and Models for System Design (MEMOCODE), pp. 128-131,
7 CONCLUSION 2016.
With the emergence of the notion of big data, a strong de-
[4]. T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu,
mand for high-throughput security and encryption mech- C. Zhang, and Z. Zhang, “Mxnet: A Flexible and Efficient Machine
anisms has risen. In this paper, we propose a novel bit slic- Learning Library for Heterogeneous Distributed Systems,” In arXiv
ing approach for AES cryptography, which achieves high- preprint arXiv:1512.01274, 2015.
throughput execution through the introduction of a new
[5]. Y. Wang, A. Davidson, Y. Pan, Y. Wu, A. Riffel, and J.D. Owens,
data representation scheme.This data representation model “Gunrock: A High-performance Graph Processing Library on the
alters the representation of the input data from the conven- GPU,” In ACM SIGPLAN Notices, ACM, vol. 51, no. 8, p. 11., 2016.
tional row-major representation to the proposed column-
major representation, illustrated in Figure 7, Section 4. [6]. US National Bureau of Standards, “Data Encryption Standard,”In
Moreover, through the incorporation of the proposed Federal Information Processing Standard (FIPS) Publication 46-1, Jan.
1988.
representation scheme into all of the stages of the AES al-
gorithm, we have eliminated the need for the costly Byte- [7]. F.I.P. Standard, “Advanced Encryption Standard (AES),” In Na-
level shift and mask operations extensively used through- tional Institute of Standards and Technology (NIST), 2001.
out all of the AES functions. We have employed the pro-
[8]. B. Schneier, “Description of a New Variable-length Key, 64-bit
posed bit slicing approach in GPU to implement the CTR Block Cipher (Blowfish),” In International Workshop on Fast Software
and ECB modes of operation for AES. Unlike all the previ- Encryption, Springer, Berlin, Heidelberg, pp. 191-204., 1993.
ously proposed bit slicing techniques, in our work, each
GPU thread simultaneously operates on 32 numbers of [9]. W. Stallings, “Cryptography and Network Security: Principles
128-bit data chunks. and Practice,” In Pearson, 2016.
Our work, on a Tesla V100 achieves 1.47 and 1.38 [10]. A. Ahmadzadeh, O. Hajihassani, and S. Gorgin, “A High-perfor-
Terabits per second of encryption throughput for CTR and mance and Energy-efficient Exhaustive Key Search Approach via
ECB modes of operation for AES, respectively. The evalu- GPU on DES-like Cryptosystems,” In The Journal of Supercomputing,
ation results on five other GPUs, which are deliberately vol. 74, no. 1, pp. 160-182, 2018.

1045-9219 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2019.2911278, IEEE
Transactions on Parallel and Distributed Systems

12 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, MANUSCRIPT ID

[11]. S. John Walker, “Big Data: A Revolution that Will Transform [30]. N. Nishikawa, H. Amano, and K. Iwai, “Implementation of
How We Live, Work, and Think,” In International Journal of Advertis- Bitsliced AES Encryption on CUDA-Enabled GPU,” In International
ing, vol. 33, no. 1, pp. 181-183, 2014. Conference on Network and System Security, Springer, Cham, pp. 273-
287, 2017.
[12]. N. Marz, and J. Warren. “Big Data: Principles and Best Practices
of Scalable Realtime Data Systems,” In Manning Publications Co., [31]. R.K. Lim, L. Ruth Petzold, and Ç. Kaya Koç, “Bitsliced High-per-
2015. formance AES-ECB on GPUs,” In The New Codebreakers, Springer, Ber-
lin, Heidelberg, pp. 125-133., 2016.
[13]. M. Chen, S. Mao, and Y. Liu, “Big Data: A Survey,” In Mobile net-
works and applications, vol. 19, no. 2, pp. 171-209, 2014. [32]. W.M.N. Zola, and L.C.E. De Bona, “Parallel Speculative Encryption
of Multiple AES Contexts on GPUs,” In Innovative Parallel Computing
[14]. N. Khan, I. Yaqoob, I.A.T. Hashem, Z. Inayat, M. Ali, W. (InPar), IEEE, pp. 1-9., 2012.
Kamaleldin, M. Alam, M. Shiraz, and A. Gani, “Big Data: Survey,
Technologies, Opportunities, and Challenges,” In The Scientific World [33]. W.K. Lee, H.S. Cheong, R.C.W. Phan, and B.M. Goi, “Fast Imple-
Journal, vol. 2014, Article ID 712826, 18 pages, 2014. mentation of Block Ciphers and PRNGs in Maxwell GPU Architec-
ture,” In Cluster Computing, vol. 19, no. 1, pp. 335-347, 2016.
[15]. M. Dworkin, “Recommendation for Block Cipher Modes of Op-
eration. Methods and Techniques,” In No. NIST-SP-800-38A. National [34]. N. Nishikawa, K. Iwai, H. Tanaka, and T. Kurokawa, “Through-
Institude of Standards and Technology Gaithersburg MD Computer Secu- put and Power Efficiency Evaluation of Block Ciphers on Kepler and
rity div, 2001. gcn Gpus Using Micro-benchmark Analysis,” In IEICE TRANSAC-
TIONS on Information and Systems vol. 97, no. 6, pp. 1506-1515, 2014.
[16]. D. Kirk, “NVIDIA CUDA Software and GPU Parallel Computing
Architecture,” In ISMM, vol. 7, pp. 103-104. 2007. [35]. Q. Li, C. Zhong, K. Zhao, X. Mei, and X. Chu, “Implementation
and Analysis of AES Encryption on GPU,” In High Performance Com-
[17]. D. Canright, “A Very Compact S-box for AES,” In International puting and Communication & 2012 IEEE 9th International Conference on
Workshop on Cryptographic Hardware and Embedded Systems, Springer, Embedded Software and Systems (HPCC-ICESS), 2012 IEEE 14th Interna-
Berlin, Heidelberg, pp. 441-455. 2005. tional Conference on, IEEE, Liverpool, pp. 843-848, 2012.

[18]. J. Daemen, and V. Rijmen, “AES Proposal: Rijndael,” 1999. [36]. G. Tang, K. Takata, M. Tanaka, A. Fujimaki, K. Takagi and N.
Takagi, "4-bit Bit-Slice Arithmetic Logic Unit for 32-bit RSFQ Micro-
[19]. A. Satoh, S. Morioka, K. Takano, and S. Munetoh, “A Compact processors," in IEEE Transactions on Applied Superconductivity, vol. 26,
Rijndael Hardware Architecture with S-box Optimization,” In Inter- no. 1, pp. 1-6, Jan. 2016.
national Conference on the Theory and Application of Cryptology and Infor-
mation Security, Springer, Berlin, Heidelberg, pp. 239-254. 2001. [37]. Foley, Denis. "Nvlink, Pascal and Stacked Memory: Feeding the
Appetite for Big Data." Nvidia.com, 2014.
[20]. N. Mentens, L. Batina, B. Preneel, and I. Verbauwhede, “A Sys-
tematic Evaluation of Compact Hardware Implementations for the [38]. John R. Vacca, "Satellite Encryption", Academic Press Inc, 1998.
Rijndael S-box,” In Cryptographers’ Track at the RSA Conference,
Springer, Berlin, Heidelberg, pp. 323-333, 2005. [39]. L. Chen, O. Villa, S. Krishnamoorthy and G. R. Gao, "Dynamic
load balancing on single- and multi-GPU systems," 2010 IEEE Inter-
[21]. C. Paar, “Efficient VLSI Architectures for Bit-parallel Computa- national Symposium on Parallel & Distributed Processing (IPDPS),
tion in Galois fields,” PhD Thesis, Inst. for Experimental Math., Univ. of Atlanta, GA, 2010, pp. 1-12.
Essen, 1994.
Omid Hajihassani is a master’s student at the University of Alberta.
[22]. J. Viega, M. Messier, and P. Chandra, “Network Security with He is an active member in the High-performance Computing Labora-
OpenSSL: Cryptography for Secure Communications,” O'Reilly Me- tory at Institute for Research in Fundamental Sciences (IPM), Tehran,
dia, Inc., 2002. Iran. He has conducted high-quality research on porting problems to
high-performance solutions, big data, and cloud computing.
[23]. R. Jeffrey, “Intel Advanced Encryption Standard Instructions
(AES-NI),” Technical Report, Technical Report, Intel, 2010. Saleh Khalaj Monfared is a bachelor’s student at the K. N. Toosi Uni-
versity of Technology, Tehran, Iran. He is an active member in the
[24]. D. Kaplan, J. Powell, and T. Woller, “AMD Memory Encryp- High-performance Computing Laboratory at Institute for Research in
tion,” White paper, 2016. Fundamental Sciences (IPM), Tehran, Iran. His research interests fo-
cus on wireless networks, cryptography and network security proto-
[25]. S. Ghaznavi, C. Gebotys, and R. Elbaz, “Efficient Technique for cols.
the FPGA Implementation of the AES Mixcolumns Transformation,”
In International Conference on Reconfigurable Computing and FPGAs, Re- Seyed Hossien Khasteh received the B.Sc. degree in electrical en-
ConFig'09, IEEE, pp. 219-224, 2009. gineering, the M.Sc. degree in Artificial Intelligence and the Ph.D. de-
gree in Artificial Intelligence all from the Sharif University of Technol-
[26]. Keon Jang, Sangjin Han, Seungyeop Han and Sue Bok Moon, ogy, Tehran, Iran. He is currently an assistant Professor with the Com-
"SSLShader: Cheap SSL Acceleration with Commodity Processors," In puter Engineering Department, K. N. Toosi, University of Technology,
Proceedings of the 8th USENIX conference on Networked systems design Tehran, Iran. His current research interests include social network
and implementation (NSDI), pp 1-14, 2011. analysis, machine learning and big data analysis.
[27]. Ma, J., Chen, X., Xu, R., & Shi, J, “Implementation and Evaluation Saeid Gorgin received BS and MS degrees in computer engineering
of Different Parallel Designs of AES Using CUDA,” IEEE Second Inter- from the South branch, and the Science and Research branch, of the
national Conference on Data Science in Cyberspace (DSC), 2017. Azad University of Tehran in 2001 and 2004, respectively. He re-
ceived his Ph.D. degree in Computer System Architecture from Sha-
[28]. A. A. Abdelrahman, M. M. Fouad, H. Dahshan and A. M. Mousa, hid Beheshti University, Tehran, Iran in 2010. He is currently an as-
"High performance CUDA AES implementation: A quantitative per- sistant professor of Computer Engineering in Department of Infor-
formance analysis approach," 2017 Computing Conference, London, mation Technology and Intelligent Systems of Iranian Research Or-
2017, pp. 1077-1085 ganization for Science and Technology (IROST), Tehran, Iran. His re-
search interests include computer arithmetic and VLSI design.
[29]. E. Biham, “A Fast New DES Implementation in Software,” In In-
ternational Workshop on Fast Software Encryption, Springer, Berlin, Hei-
delberg, pp. 260-272, 1997.

1045-9219 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

You might also like