You are on page 1of 3

Fast Implementation of MD5:

Hardware and Software

Austin Costley and Chase Kunz
Department of Electrical and Computer Engineering
Utah State University
Logan, Utah 84321

AbstractThis paper explores the MD5 hashing algorithm as it A. Optimisation I

was implemented in hardware and software. Optimisations are
suggested and benchmarked. A six character password using
We saw that many unnecessary calculations were being
upper and lowercase alpha was hashed and provided as a performed while comparing md5 hashes against the pre-image.
challenge to find. The authors found the preimage to be HeyHey Comparisons were performed for four separate blocks for each
by using a password generator and MD5 hashing function. hash. We made only computations necessary for the first block
before comparing it. If a pre-image was not matched for the
first block we moved on to the next hash, saving unneeded
Cryptographic hashing is used in many ways, most com- computations that would normally happen on the preceding
monly it is used to verify data integrity, and store passwords. blocks from occurring. We saw a twenty percent increase on
The MD5 algorithm was developed as a method for generating baseline from this optimisation.
128-bit hash values. Due to the nature of the algorithm the
hash value cannot be used to find the original candidate, or B. Optimisation II
preimage. This property makes the MD5 hash algorithm a With the implementation of AVX intrinsics our program was
prime candidate for password storage on a computer. When able to calculate four hashes simultaneously. Unfortunately the
a user creates a password on a system the password is hashed candidate space is based on twenty-six characters which is not
and stored in a password file. If an attacker is able to gain cleanly divisible by four. At the end of each iteration through
access to the password file he/she would only be able to see the character space there were two extra spaces in our set of
the password hashes. 4 candidates to be hashed in our baseline implementation. We
This paper discusses software and hardware implementations created a circular buffer ensure all hashes were on data that
of the MD5 hashing algorithm, and proposes possible optimi- would be used. From this optimisation we saw a nine percent
sations. increase over baseline.
A. Stucture of Paper C. AVX Limitation
Section II discusses our software Implementation. It in- AVX takes advantage of modern cpus. An advantage over
cludes information regarding optimisation choices, and bench- SSE allows for a computation of the form C = A + B rather
marked results. It also details software choices and attempted than only A = A + B and an extension to 256-bit registers.
optimisations. Section III outlines the hardware implementa- Using these registers, one would ideally be able to compute
tion on the Digilent Nexys 2 development board. It details the 32-bit chunks of eight hashes simultaneously. As this
design decisions, pitfalls, and results. would allow for tremendous speedup, our original goal was
to implement an eight-hash version of the software. Although
II. S OFTWARE I MPLEMENTATION we came very close to accomplishing eight-hash computation
Our basic implementation of MD5 was written in a simple we found that our calculations were limited by the carry-
C program. A function was written to output the MD5 hash bits of the 256-bit registers. As shown in Figure 3, when
corresponding to the character string given. A structure was overflow addition occurs, a carry spills from the odd-index 32-
formed around this function to provide it with every combi- bit integers to the even-index integers. The units carried into
nation of lower and upper case alpha characters and compare the even-index integers corrupt the data held in them. Modulo
them to the target MD5 pre-image. Some basic optimzations 232 addition is needed for data integrity. We implemented 232
were performed to create a baseline version of the code from addition only to find that the overhead needed to perform the
which to further optimize, namely loop unrolling and SIMD operation slowed the program down compared to adding 4
instructions via AVX extensions. Loop unrolling allowed over- 32-bit integers two times. For this reason we calculated four
head to be taken out of the many iterations in MD5 where hashes simultaneously. However, our code is setup to easily
loops would usually be used. AVX details can be seens in implement an eight hashes simultaneously with the availability
Section II-C. of AVX2 extensions.

Hash Rate (MH/sec) Improvement

Baseline Over Baseline
md5 base 1.79 -
md5 1 2.16 20.26%
md5 2 1.96 9.49%
md5 2.41 34.6%

Fig. 1. Simulation Output
Hash Rate (MH/sec)
Optimised (-O3)
md5 base 4.57
md5 1 4.56
md5 2 4.93
md5 5.02

III. H ARDWARE I MPLEMENTATION Fig. 2. Simulation Top Level Traces

In order to increase the MD5 hash rate we implemented the

MD5 algorithm on a Field Programable Gate Array (FPGA). A block size and the clock frequency and dividing by the cycles
Digilent Nexys 2 development board to implement our design. per block. The size of the hash block was 512-bits and the
We decided to employ a loop unrolled strategy to eliminate clock frequency was 50MHz. If the timing report would have
the overhead of iteration control. The first step of our design generated we could have used the maximum clock frequency
was to develope a 32-bit adder. This was accomplished by possible in for this value. The cycles per block was determined
designing a standard single-bit fulladder with carry-in and by dividing the total simulated time (630 ns) by the clock
carry-out bits. Eight fulladders were implemented together to period (20 ns). Using this calculation the throughput of the
create an 8-bit adder, and four of those were implemented FPGA implmentation was found to be about 820 MH/s (Mega
together to create a 32-bit adder. Designing this module was Hashes / Second).
critical to the MD5 algorithm and provided a great verilog Figure 1 shows the wires relating to the desired hash and the
review. The next module we created was a 32-bit circular resluting hash. The wire names Atar, Btar, Ctar, and Dtar are
shifter. In order to optimise the shifter we used a bit mapping the 32-bit portions of the target hash, and the Afin, Bfin, Cfin,
technique where the bits were re-assigned based on the shift and Dfin are the hash results of the MD5 hashing algorithm.
value. This approach performs the shift faster than a standard Figure 2 shows the top level ouput of the MD5 algorithm. The
shifting operator. top level verilog file initiated a 50Mhz clock, and configured
The basic design of our hardware implementation was to two outputs, port, and start. The start register was set high
create a module for each round. Each iteration would call the when the MD5 module was instantiated. This happened at 5
appropriate round, and have as the input the 32-bit message ns. The port register was set high when the MD5 algorithm had
piece, shift value, 32-bit sin result (T-value), and the current generated the target hash. The cursor shows that this occured
32-bit A, B, C, and D values. The sin values and shift values at 635 ns.
were pre-calculated and hard-coded into the module call. This
was chosen to elminate the need for look up tables and
iteration counters. A tradeoff to this design decision is that Our greatest future software optimizations would be to use
each iteration requires a new wire for each of the 32-bit AVX2 intrinsics, specifically the 32-bit integer add and shift
outputs. Resulting in 256 output wires for round control. instructions. This would allow us to increase from four to
When the design was completed it was also discovered that eight simultaneous hashes. Another optimization that would
this type of loop unrolling created a clock independance that greatly improve our design would be the implementation of
present problems for the throughput calculations. Since the threading to take full advantage of modern cpus multiple core
design could work independently of the onboard clock, the architecture. The next step for the progress in our hardware
ISE Design Suite software did not generate a timing report that
would contain the timing constraints. An alternative approach
was attempted by writing the shift value to a wire on each Fig. 3. AVX 32-bit integer addition overflow
clock cycle, and then using that wire as the input to the round.
Ultimately, this design did not generate the timing report on
implmentation, but it did increase the simulated hash rate.
The throughput calculation was generated by multiplying the
development would be to implement pipelining and further
refine the our implementation to generate timing reports and
timing constraints. This project was a great opportunity to
understand the use of hashing algorithms in computer security,
and to explore a possible attack platform.