Cryptanalysis of Hash Functions using an approach based on SAT-solvers.

Jair Cazarin Villanueva 125535

Dr. Mauricio Osorio

February 2008

1. Introduction.
Cryptography is the study of mathematical techniques related to aspects of information security such as confidentiality, data integrity, entity authentication, and data origin authentication [1]. In other words, cryptography is about the prevention and detection of cheating and other malicious activities. Cryptography hash functions produce hash values, which concisely represent longer messages or documents from which they were computed [2] and they are used for a variety purposes but mainly in cryptography. Therefore, computer security depends heavily on the strength of hash functions. Examples of hash functions are MD4 [7], MD5 [8] and the SHA family [9]. The main role of cryptographic hash functions is in the provision of message integrity checks and digital signature [3]. It’s very clear that the formal analysis of their robustness is of outmost importance, unfortunately several standard cryptographic hash functions were broken in 2005 [4]. Breaking in this case mean to find a way to efficiently produce different messages which are mapped to the same hash value by some hash function, as would compromise the security of applications in which this functions are used. So, the use of other kind of approaches seems to be the next step toward a greater assurance of security. On the other hand, we’ve Boolean Satisfiability (SAT) Solvers which attacks the problem of determining if the variables of a given Boolean formula can be assigned in such a way as to make the formula evaluate to true. The use of SAT-solvers in various applications is increasing and a growing number of problems who were efficiently encoded into SAT are successfully being tackled by these programs [5], furthermore the performances of the algorithms of these programs have been increased [6]. Most of this applications are still belong to the traditional domains of formal verification and artificial intelligence, and although several applications of SAT solvers to cryptanalysis have been described in the literature, their efforts have been failed to produce any attacks of interests to cryptologists [4] until the research of Ilya Mironov and Lintao Zhang in [2]. In this paper, they described that some of the attacks to this hash functions can be automated by encoding them as CNF Formulas, which are the within reach of modern

SAT-solvers, and with this transformation delegate the more laborious part of the attack to the them, creating the first example of a SAT-solver aided cryptanalysis of a non-trivial cryptographic primitive. The strategy was based on the fact that the original attacks consisted of several steps each of which involves a lot of bit-tweaking and manual work, causing to keep track as many as 122 Boolean conditions in the simplest function, using this, they found a way to transform this conditions to a CNF formulas which are then used by a SAT-solver in order to automate certain parts of the attack obviating the need for compiling tables of sufficient conditions and designing clever message modification techniques. So far, current research in this field found some ways to automate certain parts of these attacks, it’s still needed to verify and test this approaches in order to find a complete automation and/or improvements in both areas. This can be accomplished in two ways, one can be creating a new toolkit for cryptanalysts and the other one is improving SATsolvers for specific cryptography problems.

2. Objectives.
2.1 General Objective.
To design and test this approach with different kind of SAT-solvers with the purpose to find the most suitable one, then implement a semiautomatic testing tool to help cryptanalysts in order to find weaknesses in hash functions, and through this tool, analyze more in depth trying to detect possible improvements of the SAT-solvers algorithms for this specific kind of problems.

2.2 Specific Objectives.
• • • Understand the theory and construction of hash functions including its principles. Study the MD4 and MD5 family of Hash Functions and how it really works. Interpret the problem of Boolean satisfiability including its complexity and variations.

• • • • • •

Install and use the MiniSAT SAT-solver. Understand how the attacks on Hash functions run. Study of the discovery process of the differential path for MD4 and MD5. Realize how to automate the differential path via SAT-solvers. Discover one or two more SAT-solvers in order to compare results with Mini-SAT. Generate full collisions for MD4 and Md5 hash functions.

3. Research Scope.
This thesis is focused on working with the MDx family of hash functions designed by Ron Rivest. Although there are others state-of-the-art hash functions like SHA-0 and SHA-1 and the foundations of this hash functions share similar design principles, it’s very well known that the attack on SHA-1 is just theoretical [10] and also for SHA-0 generate a full collision would require 3 million CPU hours using common SAT-Solvers [4]. Moreover, the first two stages of the attacks are usually done by hand or by applying some of heuristics that implies a lot of creativity, therefore until now it’s not possible to develop a full automatic attack and thus we’re going to replicate just the attacks already known by the literature. In the case of SAT-solvers we’ll use MiniSAT [11] as main SAT-solver and if time constraints permit us, we are going to be able to test two or more different SAT-solvers that are going to be chosen as the research advance.

4. Hardware y Software.
It’s required a simple personal computer under the minimal characteristics in both processing and storage for the execution of the distinct approaches and algorithms in this research, for this reason it will be used a laptop with the following features: • • Dell XPS M1210 Core 2 Duo Intel 2 GHz.

3 GB of RAM.

For the software, right now we’ve only taken into account MiniSAT [11] which is a minimalistic, open source SAT solver with the purpose of help researchers and developers with projects related on SAT. Probably also we’re going to use SATELITE that is a CNF minimizer and preprocessor to the MiniSAT. Some reasons of choosing MiniSAT are its key features like efficiency, integration and easy modification, but more important its performance on SAT competitions [6].

5. Problem Statement.
5.1 Theory of Hash Functions.
Cryptographic hash functions also known as one way hash functions are a major tool in modern cryptography. Hash functions are defined as a computationally efficient function mapping binary string or arbitrary length to binary strings of some fixed length, called hash-values [1]. The basic idea is that a hash-value serves as a compact representative of an input string. In the cryptographic field, a hash function h is typically chosen such that it is computationally infeasible to find two distinct inputs which hash to a common value, this means find inputs x and y such that h(x) = h(y) (this is called the collision-resistant property and is recognized as the “gold standard” of security hash functions [13]) and also that a given specific hash-value, it is computationally infeasible to find an input x such that h(x) = y (This is known as preimage resistant). The formal definition of a cryptographic hash function is a mapping: ℎ: {0,1}∗ → {0,1 } Where {0,1}* denotes the set of bit strings of arbitrary length. The image h(X) of some message I {0,1}∗ is called the hash value of X [16].

The most common cryptographic uses of hash functions are with digital signatures and for data integrity. With digital signatures, a long message is usually hashed and only the hash-value is signed (this is known as signature schemes) [14]. The party receiving the message then hashes the received message and verifies that the received signature is correct for this hash-value. Note here that the inability to find two messages with the same hash-value is a security requirement, since otherwise; the signature on one message hash-value would be the same as that on another, allowing a signer to sign one message at a later point in time claim to have signed another. In the case of data integrity the hash-value corresponding to a particular input is computed at some point in time. The integrity of this hash-value is protected in some manner. At a subsequent point in time, to verify that the input data has not been altered, the hash-value is recomputed using the input at hand, and compared for equality with the original hash-value. Specific applications include virus protection and software distribution. Now it’s very important to differentiate a weak one-way hash function, and a strong oneway hash function, so let’s define each one. A weak one-way hash function is a function H such that [14]: • • • H can be applied to any argument of any size. H produces a fixed size output. Given H and y, it’s easy to compute H(y), this means that they are computable in polynomial time. • Given H and a “suitably chosen” y, it’s computationally infeasible to find y’ ≠ y such that H(y) = H(y’). A strong one-way hash function is a function H such that [14]: • • • H can be applied to any argument of any size. H produces a fixed size output (larger than a weak hash function). Given H and y, it’s easy to compute H(y).

Given H, it’s computationally infeasible to find any pair y, y’ such that y ≠ y’ and H(y) = H(y’).

The main differences between strong and weak hash functions are that they are easier to use in systems design because there are no pre-conditions on the select of y, and they provide the full claimed level of security even when used repeatedly. In this thesis research, we will use just strong one-way hash functions. Also in the literature we can find a different kind of taxonomy for hash functions [17]: • Preimage resistant, if given hash value y, it is computationally infeasible to find a message x with h(x) = y. • Second preimage resistant, if given a message y, it’s computationally infeasible to find a message x ≠ y with h(x) = h(y). • Collision Resistant, if it’s computationally infeasible to find a collision, that is a pair of two different messages y and y’ with h(y) = h(y’). We can infer that one-way is equivalent to preimage resistant and a weak hash function is second preimage resistant. These properties are going to be used forward when we start talking about hash function construction. A first standard hash function was MD4 (designed by Ron Rivest [7]), then was followed by a better version called MD5. Then it appeared the first NIST-approved hash function, SHA-0 (Secure Hash Algorithm) [9] which adopted the general structure of MD4 and two years later was replaced by a new version, SHA1 [4]. Actually just two hash functions were in wide-spread use: MD5 and SHA-1. One of the first persons who studied the construction of hash function was Damgard from Aarhus University in Denmark, and a theory in collision free hash function’s construction was to consider families of hash functions instead of just one hash function, in order to make complexity theoretic treatment possible [14]. Under this statement we can say that MD4, MD5, SHA-0 and SHA-1 belong to one family.

Most hash functions have a similar iterative structure which is based around what is termed a compression function [ [12]. In summary, the computation of the hash value for message depends on what is called a chaining variable. At the start o hashing, this of chaining variable has a fixed initial value which is specified as a part of the algorithm. The compression function is then used to update the value of this chaining variable in a suitably complex way under the action and influence of the part of the message being part hashed. This process continues recursively, with the chaining variable being updated under the action of different parts of the message, until the entire message has been used. The final value of the chaining variable is then output as the hash value corresponding to output that message. Later we’ll study some of the approaches in the basic construction block of a collision-resistant compression function. resistant

Figure 1. The use of a compression functions in an iterative hash function. MDx and SHAx terative follow the same design.

5.2 MD4.
MD4 is a message-digest algorithm developed by Rivest in 1990 and operates on 32 digest 32-bit words. The message is padded to ensure that its length in bits plus 64 is divisible by 512. A 64-bit binary representation of the original length of the message is then concatenated to bit the the message [17]. The message is processed in 512-bit blocks, and each block is processed . , in three distinct rounds.

Attacks on versions of MD4 with either the first or the last rounds missing we developed very quickly by Den Boer, Bossealaers et al [18]. Also [19] has shown how collisions for the full version of MD4 can be found in under a minute on typical PC.

5.3 MD5.
Some weaknesses that might lead to a compromised were discovered on MD4, so RSA has to improve it and MD5 born in 1991 (Also by Rivest). It is basically MD5 with “safety-belts” and while it is slightly slower than MD4, it’s more secure [17]. The algorithm consists of four distinct rounds, which has slightly different design from that of MD4. Messagedigest size, as well as padding requirements, remains the same. Attacking MD5 is a much more involved proposition than attacking MD4 since it is far more complicated algorithm to analyze.

5.4 Overview of the attacks on hash functions.
The attacks on the MDx family of hash algorithms are very similar. We can summary in finding a 512-bit message such that H(IV, M) = H(IV, M o δ),where H is the compression function and δ is fixed [4]. Also as [2] said the complexity of this attack is 2 , where n = 128 or 160. The trick is on the choice of a good δ and use some techniques to find a M that take advantage of the weaknesses of the compression function bring the complexity of the attack to fewer 2&$ evaluations of the hash function. In more detail, these attacks are divided in four stages [2]: 1. Choose ∆°
", … , ∆ ° #' .

Here ° stands for both xor and mod 2%$ .
#,

2. Choose a differential path ∆° , … , ∆° " or 80).

where r is the number of rounds (r = 48, 64

3. Find a set of sufficient conditions on the message M = { intermediate variables I , … , {
" °∆ " ", … , #' °∆ ° #' {

", … ,

#' {,

and the

that guarantee that the message pair M, M’ =
#.

follows the differential path ∆° , … , ∆° "

4. Choose a message M such that all sufficient conditions hold.

In other words, choosing the differential path allow us to restrict the space of possible solutions which in fact is a clever approach because we’re not start finding two messages that collide under the hash function. Besides we have to choose carefully the differential path in order to maximize the probability of a collision. It’s expected that SAT-solvers might be very helpful in automating the third and fourth step.

5.5 Boolean Satisfiability problem.
SAT is the problem of deciding if there is an assignment for the variables in a propositional formula that makes the formula true. There are some problems that can be encoded naturally into SAT like for example planning, constraint satisfaction, vision interpretation, diagnosis, etc. In addition, satisfiability is closely related to the theorem proving; refutation procedures [21], for example. There at least three factors that are the reasons of why has been such an explosion of research into SAT: • • • Improvement in search procedures for SAT. Identification of hard SAT problems. The demonstration that we can often solve real world problems by encoding them into SAT. It’s very important to note that SAT belongs to the class of NP-complete problems whose algorithmic solutions are currently believed to have exponential worst case complexity [24]. Much research into SAT considers problems in conjunctive normal form (CNF). A

formula is in CNF iff it is a conjunction of clauses; a clause is a disjunction of literals, where a literal is a negated or un-negated Boolean variable. A clause containing just one literal is called a unit clause. A clause containing no literals is called the empty clause and is interpreted as false. We can place restrictions on the size of clauses in the problem [22].

An equivalent formulation is to say that each clause should have at least one literal that is true under the assignment. Such a clause is the said to be satisfied. If the is no assignment satisfying all clauses, the CNF is said to be unsatisfiable. An example of what an instance of SAT looks like: { ∨ ∨ {{ ∨ {{ ∨ {{ ∨ {{ ∨  {

SAT is a typical search problem. We are given an instance I (that is, some input data specifying the problem at hand, in this case a Boolean formula in a conjunctive normal form), and we are asked to find a solution S (an object that meets a particular specification, in this case an assignment that satisfies each clause). If no such solution exists, we must say so.

5.6 Sat-Solvers Algorithms.
A wide variety of techniques have been developed for solve SAT instances, as a result all of them can be classified as either complete or approximate. Complete methods systematically examine the entire solution (if one exists) in bounded time or otherwise return that the formula is unsatisfiable. In this thesis we’re going to focus on software that implements a complete method, but further we can analyze the behavior of some approximate methods.
5.6.1 DPLL Algorithm.

The original Davis-Putnam procedure was based on a resolution rule that eliminated the variables one-by-one and added all possible resolvents to the set of clauses; this was known as DP Method. Unfortunately, this procedure requires exponential spaces, therefore quickly was replaced the resolution rule with a splitting rule which divides the problem into two smaller sub problems, this was known as DPLL because of their authors, Davis, G, Logemman and Donald Loveland in 1962 [25]. This is the fastest known algorithm for satisfiability testing that is no just sound, but also complete. In summary, DPLL is depth-first search with backtracking and unit propagataion.

The algorithm starts with a set of clauses, gradually instantiates variables, and gradually simplifies the set of clauses simultaneously. The stopping cases are: • An empty set of clauses is satisfiable—in this case, theentire algorithm terminates with success. • A set containing an empty clause is no satisfiable—in this case, the algorithm backtracks and tries a different value for n instantiated variable. Algorithm (the input is a set of clasues): 1. If the clause set is empty, return “success”. 2. If the clause set contains an empty clause, then return “failure”. 3. If the clause set contains a unit clause (i.e. a literal). a. Do unit propagation starting with this literal; call DPLL recursively on the simplified clause set. 4. Otherwise: choose a variable u heuristically. a. If DPLL(clauses[u=T] succeeds, return success. b. Otherwise, return the result of DPLL(clauses[u=F]). DPLL(clauses[u=T]) means call DPLL recursively with a variable u instantiated to have value true. Note that this is a recursive implementation of a depth-first search, with unit propagation as forward-checking. Unit propagation is a procedure that helps to simplify a set of clauses, when we have a unit clause (one that consists of a single literal), then the correct value of the variable in it is obvious. Other clauses are simplified by the application of the following two rules: 1. If a clause consists of a variable alone, say x, and then any other clause containing not x can be simplified to consist of just its other disjuncts. 2. If a clause consists of a negated variable alone, say not x, then any other clause containing x can be simplified to consist of just its other disjuncts.

With good data structures, we can implement unit propagation to take linear time in the size of the input set of clauses.
5.6.2 Stochastic Local Search Algorithms.

Approximate SAT algorithms have gained widespread attention because they offer a computationally feasible approach to finding high-quality solutions to NP-hard problems in a scalable and efficient manner [26]. SLS (Stochastic Local Search Algorithms) generally involve taking a candidate solution and performing some sort of perturbation which results in one or more new candidate solutions. An evaluation function is then used to determine which of the candidate solutions should be accepted. Also this kind of algorithms included two operations called intensification and diversification [27]. Intensification is a means of greedily improving solution quality within a small area of the search space for a local optimum, while diversification helps to prevent stagnation by ensuring that contain only suboptimal solutions. Incorporating some form of randomness has proved to an efficient diversification mechanism, while intensification can be achieved through a variety of techniques such as iterative improvement or the selection step in a genetic algorithm.

5.7 MiniSAT.
Minisat was described in the paper An Extansible SAT-solver by Niklas Eén and Niklas Sorensson from the Chalmers University of Technology in Sweden [4]. Because of the growing number of problems encoded into SAT, the found that modifies an existing solver with an understanding of the problem domain and of modern SAT-techniques, is was so difficult. For this reason, they developed a small, complete and efficient SATsolver with the purpose to give the sufficient details about implementation enable researchers around the world to construct his o her own solver in a very short time, in order to meet the needs of a particular application area. The ideas behind MiniSAT are based on conflict-driven backtracking, watched literals and dynamic variables ordering. MiniSAT was implemented in C++. Later, we’ll analyze more in depth internal algorithms of MiniSAT.

6. Bibliography already reviewed.
[2] Dejan Jovanovi´c and Predrag Janicic. Logical analysis of hash functions. Pages 200–215. Springer Verlag, 2005. [3] “RSA Laboratories - 2.1.6 What is a hash function?.” http://www.rsa.com/rsalabs/node.asp?id=2176. [4] Ilya Mironov, Lintao Zhang. Applications of SAT Solvers to cryptanalysis of hash functions. [6] “SAT Competitions.” http://www.satcompetition.org/. [7] “RFC 1186 (rfc1186) - MD4 Message Digest Algorithm.” http://www.faqs.org/rfcs/rfc1186.html. [8] “RFC 1321 (rfc1321) - The MD5 Message-Digest Algorithm.” http://www.faqs.org/rfcs/rfc1321.html. [9] “RFC 3174 (rfc3174) - US Secure Hash Algorithm 1 (SHA1).” http://www.faqs.org/rfcs/rfc3174.html. [11] “MiniSat Page.” http://minisat.se/. [12] Ivan Damgard. Collision fre hash functions and public key signature schemes. In David Chaum and Win L. Price, editors, Advances in Cryptology. Springer, 1988. [14] Ivan Damgard. A design principle for hash functions. In advances in cryptology. Springer, 1990. [15] Brassard, Gilles. One way hash functions and DES. Advances in Cryptology. Berlin: SpringerVerlag, 1990. [16] El de MD4 [18] Crypto FAQ RSA http://www.rsa.com/rsalabs/node.asp?id=2253

6. Bibliography partially reviewed.
[1] Menezes, a. et al. Handbook of Applied Cryptography. Boca Raton: CRC Press, 1997 [5] Niklas Een and Niklas Sorensson. An extensible SAT Solver. SAT 2003. [10] Xiaoyun Wang, Yiqun Lisa Yin, and Hongbo Yu. Finding collisions in the full SHA-1. [17] Ilya Mironov. Hash functions, theory, attacks and applications. [19] B. den Boer and A. Bosselaers, An attack on the last two rounds of MD4, Advances in Cryptology - Crypto '91, Springer-Verlag (1992), 194-203. [20] H. Dobbertin, Alf Swindles Ann, CryptoBytes (3) 1 (Autumn 1995).

[21] Gabbay, Dov and Christopher Hogger. Handbook of Logic in Artificial Intelligence and Logic Programming. Oxford: Clarendon Press, 1993. [22] I.P. Gent and T. Walsh, "The search for Satisfaction", Internal Report, Dept. of Computer Science, University of Strathclyde, 1999 [23] Algorithms for the Satisfiability (SAT) Problem: A survey. J. Gu, P. W. Purdom, J. Franco, and B. W. Wah, in "Satisfiability Problem: Theory and Applications", DIMACS Series in Discrete Mathematics and Theoretical Computer Science, American Mathematical Society, 1997, pp. 19-152. [24] Sanjoy Dasgupta, Christos Papadimitriou, Umesh Vazirani. Algorithms. McGrawHill. [25] Davis, M., G. Logemann, and D. Loveland (1962, July). A machine program for theoremproving. Commun. ACM 5 (7), 394-397. [26] Holger H. Hoos and Thomas Stützle: Stochastic local search: foundations and applications (2005). [27] Li, C.M., and Anbulagan. Heuristics based on unit propagation for satisfiability problems. In Proc. 15th IJCAI. 1997.

7. Bibliography to Review.
* Philip Hawkes, Michael Paddon, Gregory G. Rose: Musings on the Wang et al. MD5 Collision, Cryptology ePrint Archive, Report 2004/264, 13 October 2004. * M.J.B. Robshaw, On Recent Results for MD2, MD4 and Md5. RSA Laboratories Bulletin, News and advice from RSA Laboratories. Number 4. Nomver 12, 1996. * Hans Dobbertin, The Status of MD5 after a Recent Attack. RSA Laboratories, CryptoBytes, The technical newsletter of RSA laboratories, a division of RSA Data Security, INC. Number 2, Summer 1996. * Ilya Mironov. Hash functions: Theory, attacks and applications. November 14, 2005. * Ilya Mironov. Hash Functions: From Merkle-Damgard to Shoup. Computer Science Department, Stanford University. * Preneel Bart. Analysis and Design of Cryptographic Hash Functions. February 2003. * Klima Vlastimil. Finding MD5 Collisions on a Notebook PC Using Multi-message Modifications. March 31, 2005. * Propositional Logic, Class Notes for CS264A, UCLA. * Goldberg Evgueni and Yakov Novikov. BerkMin: a Fast and Robust Sat-Solver.

* Niklas Een and Armin Biere. Effective Preprocessing in SAT through Variable and Clause Elimination. * Marques-Silva Joao et al. GRASP: A search Algorithm for a propositional satisfiability. * Irinas Rish and Rina Dechter. Resolution versus Search: Two strategies for SAT. * Fabio Massaci. Using WALK-SAT and Rel-SAT for Cryptographic Key Search.

Appendix 1.
MD4 Algorithm Description
We begin by supposing that we have a b-bit message as input, and that we wish to find its message digest. Here b is an arbitrary nonnegative integer; b may be zero, it need not be a multiple of 8, and it may be arbitrarily large. We imagine the bits of the message written down as follows: m_0 m_1 ... m_{b-1} . The following five steps are performed to compute the message digest of the message. Step 1. Append padding bits The message is "padded" (extended) so that its length (in bits) is congruent to 448, modulo 512. That is, the message is extended so that it is just 64 bits shy of being a multiple of 512 bits long. Padding is always performed, even if the length of the message is already congruent to 448, modulo 512 (in which case 512 bits of padding are added). Padding is performed as follows: a single "1" bit is appended to the message, and then enough zero bits are appended so that the length in bits of the padded message becomes congruent to 448, modulo 512. Step 2. Append length A 64-bit representation of b (the length of the message before the padding bits were added) is appended to the result of the previous step. In the unlikely event that b is greater than 2^64, then only the low-order 64 bits of b are used. (These bits are appended as two 32-bit words and appended low-order word first in accordance with the previous conventions.) At this point the resulting message (after padding with bits and with b) has a length that is an exact multiple of 512 bits. Equivalently, this message has a length that is an exact multiple of 16 (32-bit) words. Let M[0 ... N-1] denote the words of the resulting message, where N is a multiple of 16. Step 3. Initialize MD buffer

A 4-word buffer (A,B,C,D) is used to compute the message digest. Here each of A,B,C,D are 32-bit registers. These registers are initialized to the following values in hexadecimal, low-order bytes first): word word word word A: B: C: D: 01 89 fe 76 23 ab dc 54 45 cd ba 32 67 ef 98 10

Step 4. Process message in 16-word blocks We first define three auxiliary functions that each take as input three 32-bit words and produce as output one 32-bit word. f(X,Y,Z) g(X,Y,Z) h(X,Y,Z) = = = XY v not(X)Z XY v XZ v YZ X xor Y xor Z

In each bit position f acts as a conditional: if x then y else z. (The function f could have been defined using + instead of v since XY and not(X)Z will never have 1's in the same bit position.) In each bit position g acts as a majority function: if at least two of x, y, z are on, then g has a one in that bit position, else g has a zero. It is interesting to note that if the bits of X, Y, and Z are independent and unbiased, the each bit of f(X,Y,Z) will be independent and unbiased, and similarly each bit of g(X,Y,Z) will be independent and unbiased. The function h is the bit-wise "xor" or "parity" function; it has properties similar to those of f and g. Do the following: For i = 0 to N/16-1 do /* process each 16-word block */ For j = 0 to 15 do: /* copy block i into X */ Set X[j] to M[i*16+j]. end /* of loop on j */ Save A as AA, B as BB, C as CC, and D as DD. [Round 1] Let [A B C D i s] denote the operation A = (A + f(B,C,D) + X[i]) <<< s . Do the following 16 operations: [A B C D 0 3] [D A B C 1 7] [C D A B 2 11] [B C D A 3 19] [A B C D 4 3] [D A B C 5 7] [C D A B 6 11] [B C D A 7 19] [A B C D 8 3] [D A B C 9 7] [C D A B 10 11]

[B [A [D [C [B

C B A D C

D C B A D

A D C B A

11 12 13 14 15

19] 3] 7] 11] 19]

[Round 2] Let [A B C D i s] denote the operation A = (A + g(B,C,D) + X[i] + 5A827999) <<< s . (The value 5A..99 is a hexadecimal 32-bit constant, written with the high-order digit first. This constant represents the square root of 2. The octal value of this constant is 013240474631. See Knuth, The Art of Programming, Volume 2 (Seminumerical Algorithms), Second Edition (1981), Addison-Wesley. Table 2, page 660.) Do the following 16 operations: [A B C D 0 3] [D [C [B [A [D [C [B [A [D [C [B [A [D [C [B A D C B A D C B A D C B A D C B A D C B A D C B A D C B A D C B A D C B A D C B A D C B A 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15 5] 9] 13] 3] 5] 9] 13] 3] 5] 9] 13] 3] 5] 9] 13]

[Round 3] Let [A B C D i s] denote the operation A = (A + h(B,C,D) + X[i] + 6ED9EBA1) <<< s . (The value 6E..A1 is a hexadecimal 32-bit constant, written with the high-order digit first. This constant represents the square root of 3. The octal value of this constant is 015666365641. See Knuth, The Art of Programming, Volume 2 (Seminumerical Algorithms), Second Edition (1981), Addison-Wesley. Table 2, page 660.) Do the following 16 operations: [A B C D 0 3] [D A B C 8 9] [C D A B 4 11] [B C D A 12 15] [A B C D 2 3] [D A B C 10 9] [C D A B 6 11] [B C D A 14 15] [A B C D 1 3]

[D [C [B [A [D [C [B

A D C B A D C

B A D C B A D

C B A D C B A

9 5 13 3 11 7 15

9] 11] 15] 3] 9] 11] 15]

Then perform the following additions: A = A + AA B = B + BB C = C + CC D = D + DD (That is, each of the four registers is incremented by the value it had before this block was started.) end /* of loop on i */ Step 5. Output The message digest produced as output is A,B,C,D. That is, we begin with the low-order byte of A, and end with the high-order byte of D.

Md5 Algorithm Description
We begin by supposing that we have a b-bit message as input, and that we wish to find its message digest. Here b is an arbitrary nonnegative integer; b may be zero, it need not be a multiple of eight, and it may be arbitrarily large. We imagine the bits of the message written down as follows: m_0 m_1 ... m_{b-1} The following five steps are performed to compute the message digest of the message. 3.1 Step 1. Append Padding Bits The message is "padded" (extended) so that its length (in bits) is congruent to 448, modulo 512. That is, the message is extended so that it is just 64 bits shy of being a multiple of 512 bits long. Padding is always performed, even if the length of the message is already congruent to 448, modulo 512. Padding is performed as follows: a single "1" bit is appended to the message, and then "0" bits are appended so that the length in bits of the padded message becomes congruent to 448, modulo 512. In all, at least one bit and at most 512 bits are appended. 3.2 Step 2. Append Length A 64-bit representation of b (the length of the message before the

padding bits were added) is appended to the result of the previous step. In the unlikely event that b is greater than 2^64, then only the low-order 64 bits of b are used. (These bits are appended as two 32-bit words and appended low-order word first in accordance with the previous conventions.) At this point the resulting message (after padding with bits and with b) has a length that is an exact multiple of 512 bits. Equivalently, this message has a length that is an exact multiple of 16 (32-bit) words. Let M[0 ... N-1] denote the words of the resulting message, where N is a multiple of 16. 3.3 Step 3. Initialize MD Buffer A four-word buffer (A,B,C,D) is used to compute the message digest. Here each of A, B, C, D is a 32-bit register. These registers are initialized to the following values in hexadecimal, low-order bytes first): word word word word A: B: C: D: 01 89 fe 76 23 ab dc 54 45 cd ba 32 67 ef 98 10

3.4 Step 4. Process Message in 16-Word Blocks We first define four auxiliary functions that each take as input three 32-bit words and produce as output one 32-bit word. F(X,Y,Z) G(X,Y,Z) H(X,Y,Z) I(X,Y,Z) = = = = XY v not(X) Z XZ v Y not(Z) X xor Y xor Z Y xor (X v not(Z))

In each bit position F acts as a conditional: if X then Y else Z. The function F could have been defined using + instead of v since XY and not(X)Z will never have 1's in the same bit position.) It is interesting to note that if the bits of X, Y, and Z are independent and unbiased, the each bit of F(X,Y,Z) will be independent and unbiased. The functions G, H, and I are similar to the function F, in that they act in "bitwise parallel" to produce their output from the bits of X, Y, and Z, in such a manner that if the corresponding bits of X, Y, and Z are independent and unbiased, then each bit of G(X,Y,Z), H(X,Y,Z), and I(X,Y,Z) will be independent and unbiased. Note that the function H is the bit-wise "xor" or "parity" function of its inputs. This step uses a 64-element table T[1 ... 64] constructed from the sine function. Let T[i] denote the i-th element of the table, which is equal to the integer part of 4294967296 times abs(sin(i)), where i is in radians. The elements of the table are given in the appendix. Do the following: /* Process each 16-word block. */

For i = 0 to N/16-1 do /* Copy block i into X. */ For j = 0 to 15 do Set X[j] to M[i*16+j]. end /* of loop on j */ /* Save A as AA, B as BB, C as CC, and D as DD. */ AA = A BB = B CC = C DD = D /* Round 1. */ /* Let [abcd k s i] denote the operation a = b + ((a + F(b,c,d) + X[k] + T[i]) /* Do the following 16 operations. */ [ABCD 0 7 1] [DABC 1 12 2] [CDAB 2 [ABCD 4 7 5] [DABC 5 12 6] [CDAB 6 [ABCD 8 7 9] [DABC 9 12 10] [CDAB 10 [ABCD 12 7 13] [DABC 13 12 14] [CDAB 14 /* Round 2. */ /* Let [abcd k s i] denote the operation a = b + ((a + G(b,c,d) + X[k] + T[i]) /* Do the following 16 operations. */ [ABCD 1 5 17] [DABC 6 9 18] [CDAB 11 [ABCD 5 5 21] [DABC 10 9 22] [CDAB 15 [ABCD 9 5 25] [DABC 14 9 26] [CDAB 3 [ABCD 13 5 29] [DABC 2 9 30] [CDAB 7 /* Round 3. */ /* Let [abcd k s t] denote the operation a = b + ((a + H(b,c,d) + X[k] + T[i]) /* Do the following 16 operations. */ [ABCD 5 4 33] [DABC 8 11 34] [CDAB 11 [ABCD 1 4 37] [DABC 4 11 38] [CDAB 7 [ABCD 13 4 41] [DABC 0 11 42] [CDAB 3 [ABCD 9 4 45] [DABC 12 11 46] [CDAB 15 /* Round 4. */ /* Let [abcd k s t] denote the operation a = b + ((a + I(b,c,d) + X[k] + T[i]) /* Do the following 16 operations. */ [ABCD 0 6 49] [DABC 7 10 50] [CDAB 14 [ABCD 12 6 53] [DABC 3 10 54] [CDAB 10 [ABCD 8 6 57] [DABC 15 10 58] [CDAB 6 [ABCD 4 6 61] [DABC 11 10 62] [CDAB 2

<<< s). */ 17 3] 17 7] 17 11] 17 15] [BCDA 3 22 4] [BCDA 7 22 8] [BCDA 11 22 12] [BCDA 15 22 16]

<<< s). */ 14 14 14 14 19] 23] 27] 31] [BCDA 0 [BCDA 4 [BCDA 8 [BCDA 12 20 20 20 20 20] 24] 28] 32]

<<< s). */ 16 16 16 16 35] 39] 43] 47] [BCDA 14 23 36] [BCDA 10 23 40] [BCDA 6 23 44] [BCDA 2 23 48]

<<< s). */ 15 15 15 15 51] 55] 59] 63] [BCDA 5 21 [BCDA 1 21 [BCDA 13 21 [BCDA 9 21 52] 56] 60] 64]

/* Then perform the following additions. (That is increment each of the four registers by the value it had before this block was started.) */ A = A + AA B = B + BB C = C + CC D = D + DD

end /* of loop on i */ 3.5 Step 5. Output The message digest produced as output is A, B, C, D. That is, we begin with the low-order byte of A, and end with the high-order byte of D. This completes the description of MD5. A reference implementation in C is given in the appendix. 4. Summary The MD5 message-digest algorithm is simple to implement, and provides a "fingerprint" or message digest of a message of arbitrary length. It is conjectured that the difficulty of coming up with two messages having the same message digest is on the order of 2^64 operations, and that the difficulty of coming up with any message having a given message digest is on the order of 2^128 operations. The MD5 algorithm has been carefully scrutinized for weaknesses. It is, however, a relatively new algorithm and further security analysis is of course justified, as is the case with any new proposal of this sort.