Manuscript
Received:
14,Sep., 2011
Revised:
25,Jan.,2012
Accepted:
5,Mar.,2012
Published:
15,Sep., 2012
Keywords
SHA3,
KECCAK Hash
Function,
Unrolling
Method,
Pipeline
Register,
HighSpeed
Implementation,
Abstract Because of the weakening
of the widelyused SHA1 hash algorithm
and concerns over the similarlystructured
algorithms of the SHA2 family; the US
NIST has initiated the SHA3 contest in
order to select a suitable dropin
replacement. In this paper we review
KECCAK hash functions algorithm and
apply several methods to improve the
performance with respect to throughput,
frequency and timing. In trying to improve
any of these parameters one may
adversary affect the other factors.
Different architectures are coded in VHDL
and implemented on FPGAs and are
compared in terms of speed.
1. Introduction
In todays modern world of email, internet banking,
online shopping, and other sensitive digital
communications, cryptography has become a vital tool for
ensuring the privacy of data transfers. Hash functions
operate at the root of many popular cryptographic methods
in current use, such as the Digital Signature Standard
(DSS), Transport Layer Security (TLS) and Internet
Protocol Security (IPSec) protocols, numerous random
number generation algorithms, encryption algorithms,
allornothing transforms, and password storage
mechanisms [1].
As cryptographic algorithms become more widely
used, the need for highspeed implementations of these
algorithms increases. Softwarebased implementations of
cryptographic algorithms fall short in performance in many
applications, e.g. on heavily loaded servers. Therefore, an
obvious need for highspeed implementations exists.
In many of these cryptographic schemes, the
throughput of the incorporated hash functions specifies the
throughput of the system. Especially in applications where
transmission and reception rates are high, any latency or
delay on calculating the digital signature of the data packet
leads to degradation of the networks quality of service[2].
Reprogrammable hardware is an almost ideal choice
A. Gholipour is with the Iran University of Science and Technology,
(: agholipour@elec.iust.ac.ir).
S. Mirzakuchaki is with the Department of Electrical Engineering, Iran
University of Science and Technology, (: mkuchaki@iust.ac.ir).
for cryptographic implementations because high speed can
be achieved without significant reduction in flexibility.
Flexibility, meaning that the design can be easily
changed or modified, is of especially great importance in
cryptographic implementations for the following reasons.
First, a cryptographic algorithm can be considered secure
only until proven otherwise. If a severe flaw in an algorithm
is found, the algorithm must be replaced with a more secure
one. Second, in many applications, a large variety of
different algorithms are in use, and therefore, it should be
easy to change from one algorithm to another.
Following the weakening of the widelyused SHA1
hash algorithm and concerns over the similarlystructured
algorithms of the SHA2 family, the NIST has set up the
SHA3 competition with the goal of identifying one (or
more) modern hash functions which can act as a drop in
replacement for the SHA2 family [3].
KECCAK hash function is one of these candidates
accepted by NIST for the SHA3 hash function competition.
In this paper we describe the implementation of the
KECCAK on FPGAs.
The paper is organized as follows, section 2 presents
the KECCAK algorithms and in section 3 describes some
techniques that increase the speed of the implementation
and the result comes in section 4. Finally, conclusions are
offered in section 5.
2. KECCAK Algorithm
KECCAK is a family of hash functions that are based
on the sponge construction and use as a building block a
permutation from a set of 7 permutations. There are 7
KECCAK f permutations, indicated by KECCAK f[b],
where
l
b 2 25 = and l ranges from 0 to 6. KECCAK
f[b] is a permutation over
b
Z S
2
e , where the bits of s are
numbered from 0 to b  1. b is the width of the
permutation. These KECCAK f permutations are iterated
constructions consisting of a sequence of almost identical
rounds. The number of rounds nr depends on the
permutation width, and is given by l n
r
2 12 + = ,
where 25 / 2 b l = . This gives 24 rounds for KECCAK
f[1600].
The KECCAK Hash function produces a final digest
message of 256 bits, which is dependent on the input
message, composed of multiple blocks of 1024 bits each.
The input message block is XORed onto a part of the
current state and the result is passed through the KECCAK
HighSpeed Implementation of the KECCAK
Hash Function on FPGA
Atefeh Gholipour & Sattar Mirzakuchaki
International Journal of Advanced Computer Science, Vol. 2, No. 8, Pp. 303307, Aug., 2012.
International Journal Publishers Group (IJPG)
304
f permutation. The KECCAK algorithm consists of 3
stages: (i) initialization and padding; (ii) absorbing phase;
and (iii) squeezing phase. A pseudo code for this algorithm
is depicted below [4, 5].
KECCAK[r, c, d](M)
 Initialization and padding
) 4 ... 0 , 4 ... 0 ( ) , ( 0 ] , [ in y x y x S =
 01 0  ) 8 / (  ) (  01 0  x r byte d byte x M P =
00 0  ... x
 Absorbing phase
P in P bolck every for
i
], 5 [ ] , [ ] , [ y x P y x S y x S
i
+ =
w r y x that such y x / 5 ) , ( +
) ]( [ S c r f KECCAK S + =
 Squeezing phase
requested is output While
], , [  y x S Z Z =
w r y x that such y x / 5 ) , ( +
) ]( [ S c r f KECCAK S + =
Z return
The state is logically grouped into a 55 matrix of
64bit words. The KECCAKf permutation consists of 24
rounds, which are identical except for the addition of a
rounddependent constant. Each round has five steps (, ,
, and ), which feature simple logical operations and
permutations of the state bits. The initial state is all zero and
in each round the introduced data is mixed with the current
state.
u t _ i o o o o R =
4 ... 0 ] 4 , [ ] 3 , [
] 2 , [ ] 1 , [ ] 0 , [ ] [ :
in x x A x A
x A x A x A x C
= u
4 ... 0
) 1 ], 1 [ ( ] 1 [ ] [
in x
x C ROT x C x D
+ =
) 4 .... 0 , 4 ... 0 ( ) , (
] [ ] , [ ] , [
in y x
x D y x A y x A
=


.

\



.

\

=


.

\

=
y
x
y
x
y x r y x a ROT y x A
3
1
2
0
)) , ( ], , [ ( ] , [ :


.

\



.

\

=


.

\

=
y
x
Y
X
y x a Y X A
3
1
2
0
] , [ ] , [ : t
) 4 .... 0 , 4 ... 0 ( ) , (
]) , 2 [
] , 1 [ ( ] , [ ] , [ :
in y x
y x B AND
y x B NOT y x B y x A
+
+ = _
RC A A = ] 0 , 0 [ ] 0 , 0 [ : t
Here the following conventions are in use. All the
operations on the indices are done modulo 5. A denotes the
complete permutation state array and A[x, y] denotes a
particular lane in that state. B[x, y], C[x] and D[x] are
intermediate variables. The symbol denotes the bitwise
exclusive OR, NOT the bitwise complement and AND the
bitwise AND operation. Finally, ROT(W, r) denotes the
bitwise cyclic shift operation, moving bit at position i into
position i + r (modulo the lane size).
The constants r(x,y) are the cyclic shift offsets and are
specified in the following table.
TABLE 1
VALUE OF OFFSET IN
STEP
The constants RC[i] are the round constants. The
following table specifies their values in hexadecimal
notation for lane size 64 and shown in TABLE 2.
TABLE 2
VALUE OF RC[I] CONSTANT
x=3 x=4 x=0 x=1 x=2
y=2 153 231 3 10 171
y=1 55 276 36 300 6
y=0 28 91 0 1 190
y=4 120 78 210 66 253
y=3 21 136 105 45 15
RC[0] 0x0000000000000001
RC[1] 0x0000000000008082
RC[2] 0x800000000000808A
RC[3] 0x8000000080008000
RC[4] 0x000000000000808B
RC[5] 0x0000000080000001
RC[6] 0x8000000080008081
RC[7] 0x8000000000008081
RC[8] 0x000000000000008A
RC[9] 0x0000000000000088
RC[10] 0x0000000000008082
RC[11] 0x000000080000000A
RC[12] 0x000000008000808B
RC[13] 0x800000000000008B
RC[14] 0x8000000000008089
RC[15] 0x8000000000008002
RC[16] 0x800000000000808B
RC[17] 0x8000000000000080
RC[18] 0x000000000000800A
RC[19] 0x800000008000000A
RC[20] 0x8000000080008081
RC[21] 0x8000000000008080
RC[22] 0x0000000080000001
RC[23] 0x8000000800008008
Atefeh Gholipour et al.: HighSpeed Implementation of the KECCAK Hash Function on FPGA.
International Journal Publishers Group (IJPG)
305
3. Speed Optimization Techniques
In this section a discussion is given about methods for
architectural speed optimization in an FPGA. There are
three primary definitions of speed depending on the context
of the problem: throughput, latency, and timing[6].
In the context of processing data in an FPGA,
throughput refers to the amount of data that is processed per
clock cycle. A common metric for throughput is bits per
second. Latency refers to the time between data input and
processed data output. The typical metric for latency will be
time or clock cycles. Timing refers to the logic delays
between sequential elements.
Several techniques have been proposed to improve the
implementation. The most relevant are:
A. Unrolling Technique
Unrolling technique optimize the data dependency. An
unrolled architecture implements multiple rounds of the
core compression function in combinational logic, thereby
reducing the number of clock cycles required to compute
the hash. This comes at the cost of an increase in area. The
number of rounds unrolled in the algorithm, k, must be a
divisor of the total number of rounds, n, of the algorithm.
Thus the number of clock cycles to execute the algorithm
decreases by a factor of k. The goal is to increase the
minimum clock period by a factor smaller than k, thus
allowing for shorter latency and higher throughput [7].
B. Embedded Memories
Usage of embedded memories for storing required
constant values.
C. Pipelining Techniques
Pipelined design conceptually works very similar to an
assembly line in that the raw material or data input enters
the front end, is passed through various stages of
manipulation and processing, and then exits as a finished
product or data output. The beauty of a pipelined design is
that new data can begin processing before the prior data has
finished. Due to highly dependent data computation the
resulting throughput is usually not improved and more
complex control logic is required.
D. Add Register Layers
The architectural for timing improvements is to add
intermediate layers of registers to the critical path. This
technique should be used in highly pipelined designs where
additional clock cycle latency does not violate the design
specifications, and the overall functionality will not be
affected by the further addition of registers.
4. Implementation Result
It is possible to design different architectures of
KECCAK. We will describe the highspeed core design
depicted in Fig. 1 [5].
In this configuration the core will be capable of
processing 128 bytes in 24 clock cycles.
The core is composed of three main components: the
round function, the state register and the input/output buffer.
The I/O buffer allows the core to compute the absorbing
phase while the words of the next block are transferred
through the bus. This allows running the absorbing phase
while the bus is transferring the next block to be processed.
An alternative for saving area is to execute the storing of the
words composing the block directly in the state register.
We consider also two architectures, the architecture
using pipeline register and the unrolling technique. In the
first architecture the core will be capable of processing 128
bytes in 48 clock cycles and the last take 16 clock cycle to
process the 128 byte.
Fig. 1 The highspeed
The last architecture of the core is illustrated in Fig. 2.
The I/O buffer allows the core to compute the absorbing
phase while the words of the next block are transferred
through the bus. Two R blocks consist of the round function
and each of them works in different clock cycle.
Fig. 2 Unrolling architecture of the Hash
International Journal of Advanced Computer Science, Vol. 2, No. 8, Pp. 303307, Aug., 2012.
International Journal Publishers Group (IJPG)
306
R1 and R2 in two clock cycles perform three rounds.
The control signals arent shown in fig. 1, these signals
determine the status of the buffer to be input or output mode
and specify which of the R blocks are active in one clock
cycle. The processing of a complete message block requires
16 clock cycles.
The presented hashing cores were captured in VHDL
and were fully simulated and verified using the Model
Technologys ModelSim Simulator.
We have used Altera Quartus II and Xilinx ISE to
evaluate VHDL with the tools for FPGA. These tools
provide estimations of the amount of resources needed and
the maximum clock frequency reached [8,9].
The throughput is calculated by:
ClockCycle
frequency Max Blocksize
Throughput
.
=
(Equ. 1)
Block size is 1024.
To applying the unrolling method we decrease the
number of the clock cycle which essential to complete the
round function that is 24 for original implementation. In this
case we decrease the clock cycle to 16 clocks which equal
to the number of clock cycle used for reading inputs.
When adding the registers in combinational path, the
frequency of the circuit increases. In this implementation
we use one register layer.
The result of these implementations on various FPGA is
shown in Table 3, 4.
TABLE 3
PERFORMANCE ESTIMATION ON ALTERA STRATIXIII EP3SE50F484C2
TABLE 4
PERFORMANCE ESTIMATION ON VIRTEX 5 XC5VLX50FF3243
Another important issue in hardware implementation
is the occupied space. In hardware implementation this
parameter illustrate by number of registers and logics which
used. Number of registers and logics was used for each
implementation shown in Table 5 and 6.
TABLE 5
NUMBER OF USED REGISTER FOR EACH ARCHITECTURE
TABLE 5
NUMBER OF USED LOGICS FOR EACH ARCHITECTURE
As seen from above tables when increase the
throughput and maximum frequency, occupied space also
increase. In adding register layers method for frequency
increasing, the ratio of hardware usage to frequency
increasing is low and not acceptable.
5. Conclusion
In this paper we review KECCAK hash functions
algorithm and apply several methods to improve the
performance with respect to throughput, frequency and
timing. In trying to improve any of these parameters one
may adversary affect the other factors. Different
architectures are coded in VHDL and implemented on
FPGAs and are compared in terms of speed.
Different methods were coded in VHDL. The most
important method for increasing throughput is unrolling a
loop that applys to our architecture and for increasing the
frequency we add register layers in critical path; as we
explained the unrolling method has the highest throughput
and the penalty is an increase in area.
Altera StratixIII
EP3SE50F484C2
Virtex5
XC5VLX50FF3243
Original
Architecture
4304 (38000)
ALUTs
1434 (7200)
Slices
Register
Layers
Architecture
14402 (38000)
ALUTs
2636 (7200)
Slices
Unrolling
Architecture
5633 (38000)
ALUTs
1562 (7200)
Slices
Max Freq.
(MHz)
Requirement
Clock Cycle
Throughput
(Gbit/s)
Original
Architecture
230.57 24 9.83
Register
Layers
Architecture
382.85 48 8.17
Unrolling
Architecture
212.49
16 13.59
Max Freq.
(MHz)
Requirement
Clock Cycle
Throughput
(Gbit/s)
Original
Architecture
111.732 24 4.67
Register
Layers
Architecture
146.649 48 3.13
Unrolling
Architecture
84.21 16 5.38
Altera StratixIII
EP3SE50F484C2
Virtex5
XC5VLX50FF3243
Original
Architecture
2641 (38000) 2640 (28800)
Register
Layers
Architecture
2641 (38000) 4242(28800)
Unrolling
Architecture
4250 (38000) 2652(28800)
Atefeh Gholipour et al.: HighSpeed Implementation of the KECCAK Hash Function on FPGA.
International Journal Publishers Group (IJPG)
307
References
[1] R. P. McEvoy, F. M. Crowe, C. C. Murphy, and W. P.
Marnane, "Optimisation of the SHA2 family of hash
functions on FPGAs," IEEE Computer Society Annual
Symposium on Emerging VLSI Technologies and
Architectures (ISVLSI'06), pp. 317322, 2006.
[2] JaeBong Yoo, ByungKi Kim, HoMin Jung, Taewan
Gu, ChanYoung Park, YoungWoong Ko. "Efficient
Pipelined Hardware Implementation of RIPEMD160
Hash Function". International Journal of Electronics,
Circuits and Systems. Volume 2 Number 2 Spring
2008.
[3] National Institute of Standards and Technology (NIST).
Cryptographic Hash Algorithm CompetitionWebsite.
http://csrc.nist.gov/groups/ST/hash/sha3.
[4] G. Bertoni, J. Daemen, M. Peters, G. Van Assche.
Keccak specifications.
http://keccak.noekeon.org/Keccakspecifications.pdf
[5] G. Bertoni, J. Daemen, M. Peters, G. Van Assche.
"Keccak sponge function family main document"
http://keccak.noekeon.org/Keccakmain2.1.pdf
[6] Steve Kilts, "Advanced FPGA Design: Architecture,
Implementation, and Optimization" WileyIEEE Press
2007.
[7] Roar Lien, "FPGA Implementations of SHA1 Secure
Hash Standard" Thesis, 2003.
[8] J. Strombergson, "Implementation of the Keccak hash
function in FPGA devices",
http://www.strombergson.com/files/Keccak_in_FPGAs
.pdf.
[9] G. Bertoni, J. Daemen, M. Peters, G. Van Assche.
Keccak Hardware implementation in VHDL. File
archive. December 2008.
http://keccak.noekeon.org/KeccakVHDL1.0.zip