Professional Documents
Culture Documents
Abstract—This paper presents a new set of techniques for hard- To enforce the security in general purpose processors (GPPs)
ware implementations of Secure Hash Algorithm (SHA) hash func- and to improve performance, cryptographic algorithms have to
tions. These techniques consist mostly in operation rescheduling
and hardware reutilization, therefore, significantly decreasing the
be applied at the hardware level, for example in the attestation
critical path and required area. Throughputs from 1.3 Gbit/s to of external memory transactions. Specialized hardware cores
1.8 Gbit/s were obtained for the SHA implementations on a Xilinx are typically implemented either as application-specific inte-
VIRTEX II Pro. Compared to commercial cores and previously grated circuit (ASIC) cores [4]–[6] or in reconfigurable devices
published research, these figures correspond to an improvement in
[7]–[10]. Some work has been done to improve the SHA com-
throughput/slice in the range of 29% to 59% for SHA-1 and 54%
to 100% for SHA-2. Experimental results on hybrid hardware/soft- putational throughput by unrolling the calculation structure, but
ware implementations of the SHA cores, have shown speedups up at the expense of more hardware resources [11], [12].
to 150 times for the proposed cores, compared to pure software im- In this paper, we propose an efficient hardware implementa-
plementations.
tion of SHA. Several techniques have been proposed to improve
Index Terms—Crytography, field-programmable gate array the hardware implementation of the SHA algorithm, using the
(FPGA), hardware implementation, hash functions, Secure Hash following design techniques:
Algorithm (SHA).
• parallel counters and balanced carry save adders (CSA), in
order to improve the partial additions [4], [5], [7];
I. INTRODUCTION • unrolling techniques optimize the data dependency and im-
prove the throughput [5], [9], [11], [13];
RYPTOGRAPHIC algorithms can be divided into three
C several classes: public key algorithms, symmetric key al-
gorithms, and hash functions. While the first two are used to
• balanced delays and improved addition units; in this algo-
rithm, additions are the most critical operations [4], [13];
• embedded memories store the required constant values [8];
encrypt and decrypt data, the hash functions are one-way func- • pipelining techniques, allow higher working frequencies
tions that do not allow the processed data to be retrieved. This [5], [14].
paper focuses on hashing algorithms. Currently, the most com- This work extends the ideas originally proposed by the au-
monly used hash functions are the MD5 and the Secure Hash thors in [15] and [16] and presents a significant set of experi-
Algorithm (SHA), with 128- to 512-bit output Digest Messages mental results. Our major contributions to the improvement of
(DMs), respectively. While for MD5, collision attacks are com-
the SHA functions hardware implementation can be summa-
putationally feasible on a standard desktop computer [1], cur-
rized as follows:
rent SHA-1 attacks still require massive computational power
• operation rescheduling for a more efficient pipeline usage;
[2], (around hash operations), making attacks unfeasible for
• hardware reuse in the DM addition;
the time being. For applications that require additional levels of
• a shift-based input/output (I/O) interface;
security, the SHA-2 has been introduced. This algorithm outputs
• memory-based block expansion structures.
a DM with size from 224 to 512 bits.
The SHA-1 was approved by the National Institute of Stan- A discussion on alternative data block expansion structure has
dards and Technology (NIST) in 1995 as an improvement to the also been introduced.
SHA-0. SHA-1 quickly found its way into all major security ap- The fully implemented architectures proposed in this paper,
plications, such as SSH, PGP, and IPSec. In 2002, the SHA-2 [3] achieve a high throughput for the SHA calculation via oper-
was released as an official standard, allowing the compression ation rescheduling. At the same time, the proposed hardware
of inputs up to bits. reuse techniques indicates an area decrease, resulting in a
significant increase of the throughput per slice efficiency
metric. Implementation results on several FPGA technologies
Manuscript received January 29, 2007; revised May 21, 2007. This work
was supported in part by the Portuguese Foundation for Technology, by the of the proposed SHA, show that a throughput of 1.4 Gbit/s is
Dutch Technology Foundation STW, applied science division of NWO (project achievable for both SHA-128 and SHA-256 hash functions.
DCS.7533), and by the European Network of Excellence on High-Performance For SHA-512 this value increases to 1.8 Gbit/s. Moreover,
Embedded Architecture and Compilation.
R. Chaves is with the Instituto Superior Técnico/INESC-ID, Lisbon 1049- a Througput/Slice improvement up to 100% is achieved, re-
001, Portugal, and also with the Computer Engineering Department, EEMCS, garding current state of the art.
Delft University of Technology, 2628 CD Delft, The Netherlands. The proposed SHA cores have also been implemented within
G. Kuzmanov is with the Computer Engineering Department, EEMCS, Delft
University of Technology, 2628 CD Delft, The Netherlands. the reconfigurable co-processor of a Xilinx VIRTEX II Pro
L. Sousa is with the Instituto Superior Técnico/INESC-ID, Lisbon 1049-001, MOLEN prototype [17]. The hybrid implementation results
Portugal. indicate a 150 times speedup against pure software implemen-
S. Vassiliadis, deceased, was with the Computer Engineering, EEMCS, Delft
University of Technology, 2628 CD Delft, The Netherlands. tations, and a 670% Throughput/Slice improvement regarding
Digital Object Identifier 10.1109/TVLSI.2008.2000450 related art.
1063-8210/$25.00 © 2008 IEEE
1000 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 16, NO. 8, AUGUST 2008
or
(2) (5)
(8)
A. Operation Rescheduling
While the variables , , , , , and are obtained di-
rectly from the values of the round, not requiring any computa- (13)
tion, the values and for round cannot be computed until
the values for the same variables have been computed in the pre- with only two addition units
vious round, as shown in (10)
(14)
TABLE I
SHA-1 DATA BLOCK EXPANSION UNIT COMPARISON
TABLE III
SHA-1 CORE PERFORMANCE COMPARISONS
TABLE IV
SHA256 CORE PERFORMANCE COMPARISON
TABLE V
SHA512 CORE PERFORMANCE COMPARISON
slices while achieving a throughput 18% higher. These two re- range of 100% to 170%. When compared with the commercial
sults suggest a gain on the TP/Slice metric of about 29%. SHA256 core from Helion [23], the proposed core suggests an
For the SHA-1 core capable of receiving an IV other than the identical area value (less 7%) while achieving a 40% gain to the
constant specified in [18], a slight increase in the required hard- throughput, resulting in an improvement of 53% to the TP/Slice
ware occurs. This is due to the fact that the IV can no longer metric. The structure proposed by McEvoy [13] also has mes-
be set by the set/reset signals of the registers. This however has sage padding hardware, however, no figures are given for the
a minimal effect in the cores performance, since this loading individual cost of this extra hardware. This message padding is
mechanism is not located in the critical path. The decrease of performed once at the end of the message, and has no significant
the Throughput/Slice metric, from 2.7 to 2.5, caused by the ad- cost when implemented in software. Thus, the majority of the
ditional hardware for the IV loading is counterbalanced by the proposed cores do not include the hardware for this operation.
capability of this SHA-1 core (Our+IV) to process fragmented
messages. C. SHA 512 Core
Table V presents the implementation results for our SHA512
B. SHA 256 Core
core and the most significant related art, to out best knowledge.
The proposed SHA256 hash function core has been also com- When compared with [22], our core requires 25% less recon-
pared with the most recent and most efficient related art. The figurable logic while a throughput increase of 85% is achieved,
comparison figures are presented in Table IV. When compared resulting in a TP/Slice metric improvement of 165%. From all
with the most recent academic work [13], [22] the results show known SHA512 cores, the unrolled core proposed by Lien in
higher throughputs, from 17% up to 98%, while achieving a [11] is the only one capable of achieving a higher throughput.
reduction in area above 25% up to 42%. These figures sug- However, this throughput is only slightly higher (4%), but re-
gest a significant improvement to the TP/Slice metric in the quires twice as much area as our proposal and indicates a 77%
CHAVES et al.: COST-EFFICIENT SHA HARDWARE ACCELERATORS 1007
higher TP/Slice metric. It should also be noticed that, the re- ACKNOWLEDGMENT
sults presented by Lien in [11] do not include the data expansion The authors would like to dedicate this paper to S. Vassil-
module, which would increase the required area even further. iadis, a brilliant researcher, a sunny character, and a dear friend.
His scientific talent will be greatly missed, his legacy—never
D. Integration on the MOLEN Processor
forgotten.
In order to create a practical platform where the SHA cores
can be used and tested, a wrapping interface has been added REFERENCES
to integrate these units in the MOLEN polymorphic processor. [1] V. Klima, “Finding MD5 collisions—A toy for a notebook.” Cryp-
The MOLEN operation [17], [24] is based on the coprocessor tology ePrint Archive, 2005/075, 2005.
architectural paradigm, allowing the usage of reconfigurable [2] X. Wang, Y. L. Yin, and H. Yu, “Finding collisions in the full SHA-1,”
in CRYPTO, V. Shoup, Ed. New York: Springer, 2005, vol. 3621,
custom designed hardware units. The MOLEN computational Lecture Notes in Computer Science, pp. 17–36.
paradigm enables the SHA cores to be embedded in a re- [3] National Institute of Standards and Technology (NIST), MD, “FIPS
configurable coprocessor, tightly coupled with the GPP. The 180–2, secure hash standard (SHS),” 2002.
[4] L. Dadda, M. Macchetti, and J. Owen, “The design of a high speed
considered polymorphic architecture prototype uses the FPGA ASIC unit for the hash function SHA-256 (384, 512),” in Proc. DATE,
with an embedded PowerPC, running at 300 MHz as the core 2004, pp. 70–75.
GPP, and a main data memory running at 100 MHz. The [5] M. Macchetti and L. Dadda, “Quasi-pipelined hash circuits,” in Proc.
IEEE Symp. Comput. Arithmetic, 2005, pp. 222–229.
implementation is identical to the one described in [25]. [6] L. Dadda, M. Macchetti, and J. Owen, D. Garrett, J. Lach, and C. A.
For this coprocessor implementations of the SHA hash Zukowski, Eds., “An ASIC design for a high speed implementation of
functions, the SHA128, SHA256, and SHA512 cores, with the hash function SHA-256 (384, 512),” in Proc. ACM Great Lakes
Symp. VLSI, 2004, pp. 421–425.
IV loading, have been used. Implementations results of the [7] T. Grembowski, R. Lien, K. Gaj, N. Nguyen, P. Bellows, J. Flidr, T.
SHA128 CCU indicate a device occupation of 813 slices, using Lehman, and B. Schott, “Comparative analysis of the hardware imple-
a total of 6% of the available resources on a XC2VP30 FPGA. mentations of hash functions SHA-1 and SHA-512,” in ISC, A. H. Chan
and V. D. Gligor, Eds. New York: Springer, 2002, vol. 2433, Lecture
In this functional test the CCU is running with same clock Notes in Computer Science, pp. 75–89.
frequency as the main data memory, operating at 100 MHz, [8] M. McLoone and J. V. McCanny, “Efficient single-chip implementa-
thus achieving a maximum throughput of 623 Mbit/s. When tion of SHA-384 & SHA-512,” in Proc. IEEE Int. Conf. Field-Program.
Technol., 2002, pp. 311–314.
compared with the pure software implementations, capable of [9] N. Sklavos and O. Koufopavlou, “Implementation of the SHA-2 hash
achieving a maximum throughput of 4 Mbit/s and 5 Mbit/s for family standard using FPGAs,” J. Supercomput., vol. 31, pp. 227–248,
SHA128 and SHA256, respectively, the usage of this hybrid 2005.
[10] K. K. Ting, S. C. L. Yuen, K.-H. Lee, and P. H. W. Leong, “An FPGA
HW/SW approach allows for a speedup up to 150 times. based SHA-256 processor,” in FPL, M. Glesner, P. Zipf, and M. Ren-
The CCUs for the SHA256 and SHA512 cores require 994 ovell, Eds. New York: Springer, 2002, vol. 2438, Lecture Notes in
and 1806 Slices using in total 7% and 13% of the available re- Computer Science, pp. 577–585.
[11] R. Lien, T. Grembowski, and K. Gaj, “A 1 Gbit/s partially unrolled
sources, respectively. At 100 MHz, the SHA-2 CCUs are ca- architecture of hash functions SHA-1 and SHA-512,” in Proc. CT-RSA,
pable of achieving a maximum throughput of 785 Mbit/s for 2004, pp. 324–338.
SHA-256 and 1.2 Gbit/s for the SHA-512 hash function. [12] N. Sklavos, E. Alexopoulos, and O. G. Koufopavlou, “Networking data
integrity: High speed architectures and hardware implementations,”
Int. Arab J. Inf. Technol., vol. 1, pp. 54–59, 2003.
VII. CONCLUSION [13] R. P. McEvoy, F. M. Crowe, C. C. Murphy, and W. P. Marnane,
“Optimisation of the SHA-2 family of hash functions on FPGAs,” in
We have proposed hardware rescheduling and reutilization Proc. IEEE Comput. Soc. Annu. Symp. Emerging VLSI Technol. Arch.
techniques to improve SHA algorithm realizations, both in (ISVLSI), 2006, pp. 317–322.
[14] H. E. Michail, A. P. Kakarountas, G. N. Selimis, and C. E. Goutis, “Op-
speed and in area. With operation rescheduling, the critical path timizing SHA-1 hash function for high throughput with a partial un-
can be reduced in a similar manner to structures with loop un- rolling study,” in PATMOS, V. Paliouras, J. Vounckx, and D. Verkest,
rolling, without increasing the required hardware, also leading Eds. New York: Springer, 2005, vol. 3728, Lecture Notes in Com-
puter Science, pp. 591–600.
to the usage of a well balanced pipeline structure. An efficient [15] R. Chaves, G. Kuzmanov, L. A. Sousa, and S. Vassiliadis,
technique for the addition of the DM is also proposed. This “Rescheduling for optimized SHA-1 calculation,” in Proc. SAMOS
technique allows for a substantial reduction on the required Workshop Comput. Syst. Arch. Model. Simulation, Jul. 2006, pp.
425–434.
reconfigurable resources, while concealing the extra clock [16] R. Chaves, G. Kuzmanov, L. A. Sousa, and S. Vassiliadis, “Improving
cycle delay introduced by the pipeline. SHA-2 hardware implementations,” in Proc. Workshop Cryptograph.
Implementation results clearly indicate significant perfor- Hardw. Embedded Syst. (CHES), Oct. 2006, pp. 298–310.
[17] S. Vassiliadis, S. Wong, G. N. Gaydadjiev, K. Bertels, G. Kuzmanov,
mance and hardware gains for the proposed cores when com- and E. M. Panainte, “The MOLEN polymorphic processor,” IEEE
pared to the existing commercial cores and related academia Trans. Comput., vol. 53, no. 11, pp. 1363–1375, Nov. 2004.
art. Experimental results for hybrid, hardware/software, im- [18] National Institute of Standards and Technology (NIST), MD, “An-
nouncing the standard for secure hash standard,” FIPS 180, 1993.
plementations of the SHA algorithms suggest a speed up of [19] National Institute of Standards and Technology (NIST), “The keyed-
150 times for both hash computations regarding pure software hash message authentication code (HMAC),” FIPS 198, 2002.
implementations. [20] CAST, Woodcliff Lake, NJ, “SHA-1 secure hash algorithm cryptopro-
cessor core,” 2005 [Online]. Available: http://http://www.cast-inc.com/
[21] HELION, Cambridge, U.K., “Fast SHA-1 hash core for Xilinx FPGA,”
A. SHA Evaluation Prototypes 2005 [Online]. Available: http://www.heliontech.com/
[22] N. Sklavos and O. Koufopavlou, “On the hardware implementation of
SHA evaluation prototypes are available for download at: the SHA-2 (256,384,512) hash functions,” in Proc. IEEE Int. Symp.
http://ce.et.tudelft.nl/MOLEN/applications/SHA/. Circuits Syst. (ISCAS), 2003, pp. 25–28.
1008 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 16, NO. 8, AUGUST 2008
[23] HELION, Cambridge, U.K., “Fast SHA-2 (256) hash core for Xilinx Leonel Sousa (M’01–SM’03) received the Ph.D. de-
FPGA,” , 2005 [Online]. Available: http://www.heliontech.com/ gree in electrical and computer engineering from the
[24] S. Vassiliadis, S. Wong, and S. D. Cotofana, “The MOLEN -coded Instituto Superior Tecnico (IST), Universidade Tec-
processor,” in Proc. 11th Int. Conf. Field-Program. Logic Appl. (FPL), nica de Lisboa, Lisbon, Portugal, in 1996.
Aug. 2001, vol. 2147, pp. 275–285. He is currently an Associate Professor with the
[25] R. Chaves, G. Kuzmanov, S. Vassiliadis, and L. A. Sousa, “Reconfig- Department of Electrical and Computer Engineering,
urable memory based AES co-processor,” in Proc. 13th Reconfigurable IST, and a Senior Researcher with R&D, Instituto
Arch. Workshop (RAW), Apr. 2006, pp. 192–192. de Engenharia de Sistemas e Computadores. His
research interests include VLSI architectures, com-
puter architectures, computer arithmetic, parallel and
distributed computing, multimedia systems. He has
Ricardo Chaves (S’05) received the graduate and contributed to more than 100 papers to journals and international conferences.
Masters degree in electronics and computer science Dr. Sousa is member the IEEE Computer Society and a member of ACM.
from the Instituto Superior Técnico (IST), Portugal,
Spain, in 2001 and 2003, respectively. Currently, he
is pursuing the Ph.D. degree in electrical engineering
from IST and TUDelft. Stamatis Vassiliadis (M’86–SM’92–F’97) was born
He is a Researcher with R&D, Instituto de Engen- in Manolates, Samos, Greece, in 1951. Regrettably,
haria de Sistemas e Computadores (INESC-ID). His Prof. Vassiliadis deceased in April 2007.
research interests include reconfigurable computing, He was a Chair Professor in the Electrical En-
cryptography, computer architecture, computer arith- gineering, Mathematics, and Computer Science
metic, and design of VLSI circuits. (EEMCS) Department, Delft University of Tech-
nology (TU Delft), Delft, The Netherlands. He
previously served in the Electrical Engineering fac-
ulties of Cornell University, Ithaca, NY, and the State
Georgi Kuzmanov (S’95–M’05) was born in Sofia, University of New York (S.U.N.Y.), Binghamton,
Bulgaria, in 1974. He received the M.Sc. degree in NY. For a decade, he worked with IBM, where he
computer systems from the Technical University was involved in a number of advanced research and development projects.
of Sofia, Bulgaria, in 1998, and the Ph.D. degree Dr. Vassiliadis was a recipient of numerous awards for his work, including 24
in computer engineering from Delft University of publication awards, 15 invention awards, and an outstanding innovation award
Technology (TU Delft), Delft, The Netherlands, in for engineering/scientific hardware design. His 72 US patents rank him as the
2004. top all-time IBM inventor. He received an honorable mention Best Paper Award
He is currently with the Computer Engineering at the ACM/IEEE MICRO25 in 1992, Best Paper Awards in the IEEE CAS
Laboratory, TU Delft. Between 1998 and 2000, he (1998), IEEE ICCD (2001), PDCS (2002), and the Best Poster Award in the
was with ”Info MicroSystems” Ltd., Sofia, where he IEEE NANO (2005). He was an ACM fellow and a member of the Royal Dutch
was involved in several reconfigurable computing and Academy of Arts and Science (KNAW).
ASIC projects as a Research and Development Engineer. His research interests
include reconfigurable computing, media processing, computer arithmetic, com-
puter architecture and organization, vector processors, and embedded systems.
Dr. Kuzmanov was a recipient of the “Veni” research grant from the Dutch
Technology Foundation STW, in 2006.