You are on page 1of 9

i

Generic Parallel Cryptography for Hashing Schemes


Kévin Atighehchi, Traian Muntean
ERISCS Research Group; Aix-Marseille University, France

ABSTRACT introduced by the synchronisation delays. The motivation of


We emphasize that future secure communicating systems, designing efficient parallel algorithms based on a hash tree
secured mass storage and access policies will require efficient scheme is to obtain optimal performances when dealing with
and scalable security algorithms and protocols. Moreover, par- critical applications which require tuned implementations for
allelism will be used at quiet low level basic implementations security aspects on multi-core processors.
of software or hardware mechanisms for efficient support of The paper is structured as following. Section 2 presents tree
cryptographic algorithms. hashing schemes and a method for reducing the number of
In this paper we concentrate on a family of generic schemes processing units required. Section 3 gives a brief description
for efficient implementation of tree based hash functions. The of SKEIN and a detailed description of the associated Tree
main reason for designing a parallel algorithm based on a hash Mode, then the Section 4 presents the potential approaches
tree scheme is to obtain optimal performances when dealing for parallelism. Finally, we present a Java implementation,
with critical applications which can require tuned implemen- testing and first elements of performance evaluation for basic
tations for security aspects on multi-core target processors. In multi-cores platforms. Conclusions and future work for a
a previous paper [2] we have proposed an optimal parallel methodology for parallel cryptographic schemes are also given.
implementation for SKEIN, one of the SHA-3 finalists [1].
In this paper, we generalize to a case study of different hash II. G ENERIC PARALLEL H ASHING S CHEMES
tree schemes from the standpoint of efficiency but also with the
goal of optimizing the number of processors or cores required. Additional properties and features have been required for
Then we study scheduling strategies for a fixed number of any SHA-3 proposed methods: it must be parallelisable, more
threads using again the example of the hash scheme proposed suitable for certain classes of applications, more efficient to
by SKEIN authors. Namely, we are interested in multithreaded implement on actual platforms, avoid some of the incidental
and scalable parallel implementations with particular attention generic properties (such as length extension) of the Merkle-
to the memory usage issues. Damgård construct ([5] [4]) that often results in insecure
applications.
KEY WORDS When designing a tree hashing scheme an objective may
network security, parallel cryptographic algorithms, Merkle be of reducing the number of processors needed to reach or
hash trees. remain close to the optimal number of steps (⌈loga n⌉ for
a tree of arity a, where n is the number of blocks of the
I. I NTRODUCTION
message). The first subsection begins by recalling briefly the
New families of cryptographic hash functions are required Damgård tree hashing scheme ([5]) and the minimal number
for securing future networking applications. As presented in of processors needed to reach the potential speedup. Then we
a previous initial work [[ ]], for instance SKEIN [6] is present some other schemes which are better able to achieve
an example of such a family which has been one of the this goal. The latter shows a method to reduce drastically this
finalist candidates for the SHA-3 competition [12]. The SHA- number whatever the scheme.
3 project, announced in November 2007, was motivated by Bellow we denote by f : {0, 1}2N → {0, 1}N the compres-
the collision attacks on commonly used hash algorithms (MD5 sion function. For simplicity, we consider a tree of arity 2,
and SHA-1). Another proposal than SKEIN, namely KECCAK even if we can transcribe the followings for a tree of any arity.
[1] has been designed lastly as the winner of the SHA-3 For clarity, tree hashing schemes presented hereafter are free
competition; nevertheless we use in this paper again SKEIN of finalisation stage which typically rely on a last application
scheme family of hashing functions as they provide a more of the compression function to process the root node and
generic example for parallelizing hash trees. the length of the message in order to avoid possibly length
This paper describes various tree hashing schemes, in par- extension attacks.
ticular a new one which allows to reduce the number of
processors to 2⌊log2 n⌋−1 at the cost of a few additional steps.
We remind also a well known method independent of the tree A. Suitable Optimal Tree Hashing Schemes
scheme used to reduce significantly this number. Then, as an Let us obeserve that existing schemes, as for example the
example, we address the parallelism of the hash tree mode for Damgård construct, are not optimal. Indeed, let M be the
SKEIN, initially discussed in [2]. All these implementations message to be hashed. Let’s append to it a number of 0 such
are scalable and we evaluate them here by proposing different that its length is a power of two blocks of size N . The resulting
scheduling strategies. We insist also on the size of buffers message M0′ = M0,0 ′ ′
||M0,1 ′
||...||M0,n−1 , with n = 2k for some
used and thus we can observe the impact of the overhead k, is processed iteratively by using the following rules:
ii

fn/2−1
1) Mi+1 = j=0 f (Mi,2j ||Mi,2j+1 ). b) If Mi+1 is longer than N then split it into blocks
2) If Mi+1 is longer than N then split it into blocks of of size N , set n to the corresponding number of
size N , set n to the corresponding number of blocks blocks and return to the rule (a), else set L =
and return to the first rule, else return Mi+1 . Mi+1 ||L and run the second phase.
The minimal number ⌈log2 n⌉ of steps seems to be obtained 2) Apply Scheme 1 to the produced message L.
with 2⌈log2 n⌉−1 processors, although there exist a way to Scheme 1 and Scheme 2 differ only on how the roots of
reduce this number by processing the largest full subtree that the successive full subtrees are combined (see Figure 1). The
spans zeroed blocks with only one processor. Nevertheless, if number of steps in the hashing Scheme 2 can be estimated in
a counter is used as input of the compression function in order the following way: the height of the largest full tree produced
to distinguish the tree nodes, this optimization doesn’t work in phase 1 is ⌊log2 n⌋ whereas the height of the tree produced
anymore. over the message L in phase 2 is at most ⌈log2 ⌈log2 n⌉⌉.
We present in this section a first scheme (see [3], [7], [11],
[10]) which allows to conserve an optimal number ⌈log2 n⌉
of steps and an optimal number of instances of the compres-
sion function while decreasing the number of processors to
2⌊log2 n⌋−1 in certain cases. Finally, we present a new scheme
which allows to keep the same number of processors whatever
the length of the message is, for a number of steps which will
not exceed ⌊log2 n⌋ + ⌈log2 ⌈log2 n⌉⌉.
Scheme 1: Let M0 = M0,0 ||M0,1 ||...||M0,n−1 be the
message to be hashed, where each block M0,j is of size N ,
except the last which may be smaller. The tree is defined
iteratively. For any level i = 1, 2, ... we use the following
rules until obtaining the root Mr of size N :
1) If n is even then

n
n/2−1 Fig. 1: Differences between scheme 1 and scheme 2
Mi+1 = f (Mi,2j ||Mi,2j+1 ).
j=0

else B. A New Hash Tree Mode


 
n
(n−1)/2−1
n The structure of the SKEIN algorithm (Unique Block Iter-
Mi+1 =  f (Mi,2j ||Mi,2j+1 ) Mi,n−1 . ation, UBI chaining) is a variant of the Mayas-Meyer-Oseas
j=0 mode. For its behaviour, SKEIN takes a variable length string
of characters as its input and produces an output string of
2) If Mi+1 is longer than N then split it into blocks of arbitrary length. (it can generate long outputs by using the
size N , set n to the corresponding number of blocks threefish block cipher in counter mode). SKEIN also has a
and return to the first rule, else return Mi+1 . configuration block that is processed before any other blocks.
Scheme 2: Let L = {} be an empty message and M0 = The hash functions in the SKEIN family use three different
M0,0 ||M0,1 ||...||M0,n−1 be the message to be hashed, where sizes for their internal state : 256, 512 and 1024 bits:
each block M0,j is of size N , except the last which may be • SKEIN-512: the primary proposal; it should remain
smaller. secure for the foreseeable future.
This method runs in two phases: • SKEIN-1024: the ultra-conservative variant. If some
1) The intermediate message L is defined iteratively. We future attack managed to break Skein-512, it should
use the following rules: remain secure. Also, with dedicated hardware it can
a) If n is even then run twice as fast as Skein-512.
• SKEIN-256: the low memory variant. It can be imple-
n/2−1
n mented using about 100 bytes of memory.
Mi+1 = f (Mi,2j ||Mi,2j+1 ). SKEIN uses Threefish as a tweakable block cipher into an
j=0 UBI chaining mode to build a compression function that maps
an arbitrary input size to a fixed output size.
else
The core of Threefish is a non-linear mixing function called
n
(n−1)/2−1 MIX that operates on two 64-bit words. This cipher repeats
Mi+1 = f (Mi,2j ||Mi,2j+1 ), operations on a block a certain number of rounds (72 for
j=0 Threefish-256 and Threefish-512, 80 for Threefish-1024), each
n of these rounds being composed of a certain number of MIX
L = Mi,n−1 L.
functions (2 for Threefish-256, 4 for Threefish-512 and 8
iii
UBI

for Threefish-1024) followed by a permutation. A subkey is UBI UBI


injected every four rounds. For a parallel implementation,
each mix operation could be assigned to one thread since,
UBI UBI UBI
for a given round, they operate on different 128-bit blocks.
In theory, we could achieve a maximum speedup of 2 with
Threefish-256, 4 with Threefish-512 and 8 with Threefish- UBI UBI UBI UBI UBI

1024 provided that, at each round, waiting times (for instance


for scheduling) between threads are negligible, a permutation
being performed at the end of each round. Examples of Nb -sized
message
hardware implementation are presented in [9], [13].
Fig. 2: Tree hashing with Yl = Yf = 1
SKEIN is built with three basic elements: the block cipher
(Threefish), the UBI, and an argument (containing a con-
figuration block and optional arguments). The configuration
block is mainly used for tree hash, while optional arguments We define the leaf size Nl = Nb 2Yl and the node size
make it possible to create different hash functions for different Nn = Nb 2Yf .
purposes, all based on SKEIN.
SKEIN can work in two modes of operation, which are built Remarks: UBI is a chaining mode for the Threefish cipher,
on chaining the UBI operations: so there is no underlying parallelism other than which can
• Simple hash: Takes a variable sized input and returns be obtained with the Threefish block encryption as explained
the corresponding hash. It is a simple and reduced above.
version of the full SKEIN mode. For instance, with a We first split the message M into one or more message
hash process where the desired output size is equal to blocks M0,0 , M0,1 , ..., M0,k−1 , each of size Nl bytes except
the internal state size it consists of three chained UBI the last, which may be smaller. We now define the first level
functions, the first processes the configuration string, of tree hashing by:
the second the message and the last is used to supply n
k−1
the output. M1 = UBI(G0 , M0,i , iNl + 1 · 2112 + Tmsg · 2120 )
• Full SKEIN: The general form of SKEIN admits key i=0
processing, tree hashing and optional arguments (for ex-
ample, personalization string, public key, key identifier, The rest of the tree is defined iteratively. For any level l =
nonce, and so on). The tree mode replaces the single 1, 2, ... we use the following rules:
UBI call which processes the message by a tree of UBI 1) If Ml has length Nb , then the result G0 is defined by
calls. G1 = Ml .
The result of the last UBI call (the root UBI call in the 2) If Ml is longer than Nb bytes and l = Ym − 1, then
case of tree processing) is an input to the Output function we have almost reached the maximum tree height. The
which generates a hash of desired size. result is then defined by:

The followings sections recall the two modes of SKEIN G1 = UBI(G0 , Ml , Ym · 2112 + Tmsg · 2120 )
intended to be widely used, the Simple Hash mode and the
Hash Tree mode, a mode specifically designed for parallel 3) If neither of these conditions holds, we create
implementations (see [6] for more information). the next tree level. We split Ml into blocks
1) Hash Tree Mode: Notations Ml,0 , Ml,1 , ..., Ml,k−1 , where all blocks are of size Nn ,
A simple SKEIN hash computation has the following inputs: except the last which may be smaller. We then define:
Nb The internal state size, in bytes (32, 64 or 128).
No The output size, in bits. n
k−1

M The message to be hashed, a string of up to 299 − 8 bits Ml+1 = UBI(G0 , Ml,i , iNn +(l+1)·2112 +Tmsg ·2120 )
(296 − 1 bytes). i=0

Tree processing varies according to the following input and apply the above rules to Ml+1 again.
parameters: The result G1 is then the chaining input to the output
Yl The leaf size encoding. The size of each leaf of the tree transformation H := Output(G1 , No ) where H is the result
is Nl = Nb 2Yl bytes with Yl ≥ 1 (where Nb is the size of the hash.
of the internal state of Skein). 2) Approaches for Parallelism: Notations used
Yf The fan-out encoding. The fan-out of a tree node is 2Yf In the following sections, we denote n the number of blocks1
with Yf ≥ 1. The size of each node is Nn = Nb 2Yf . of the message and Nt the number of threads. These threads
Ym The maximum tree height; Ym ≥ 2. If the height of the are then indexed 0, 1, ..., Nt − 1.
tree is not limited this parameter is set to 255.
G0 The input chaining value and the output of the previous 1 When it is not specified, the blocks are of size N bytes and we include
b
UBI function. the last block which can be of size less or equal than Nb .
iv

We assume that k0 = n and k1 = k = 2nYl is the number


 
Approaches for parallelism
of Nb -sized blocks of level 1. We define a recursive sequence Two possible ways to address parallelism can be applied: (i)
starting at an initial value k2 by a deterministic way, in which a thread with index j must taken

k
 
ki−1
 into account, at the current step, the predictable behaviour of
k2 = Yf and ki = . the threads 0, 1, 2, ..., j − 1; and (ii) a non-deterministic way,
2 2Yf in which the first node whose child values are available is
assigned to the first ready thread. The meaning of the term first
depends on the strategy adopted to parallelise this algorithm
For i > 0 we can rewrite the sequence of ki by as described bellow. The following sections illustrate several
l n m
methods which could be applied to tree modes of other hash
ki = Y +(i−1)Y .
2 l f function families with quite little changes.
There
  a smaller index v for which kv = 1, v =
exists
log2Yf (k) + 1. The tree height is then p = min(v, Ym ). The 91
bytes string produced at level i of the tree (except the base level
81 82
i = 0) can be split into ki blocks Mi,0 , Mi,1 , Mi,2 , ..., Mi,ki
of size Nb . 71 72 73 74
Sequential implementation 51 52 53 54 61 62 63 64
The straightforward method would consist of implementing 11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44
this algorithm as it is described in its specifications. This
implementation constitutes a scheduling method for the node Fig. 3: Lower level node first (Yf = Yl = 1, Ym = 255, Nt = 4)
processing that we call Lower level and leftmost node first
(or Lower level node first for short). Such an implementation
has the drawback of consuming quite a lot of memory. For a) Lower level node priority: This method consists in
instance, if we take Yl = 1, we need an amount of available processing the tree levels successively. It should, in theory,
memory space of up to half of the message size, which may be offer the best performances due to the (almost) absence of
impossible for long messages. There is an effective algorithm synchronisations between threads, apart from synchronisations
(see [8]) which computes a value of a node of height h, due to dependencies between worker threads and main thread
while storing only up to h + 1 hash values. The idea is to which provides the input data. An example is shown in Figure
compute a new parent hash value as soon as possible before 3, in which a job is indexed as ij where i denotes the iteration
continuing to compute the lower level node hash values; we step and j the index of the assigned thread. If one counts the
call this method heigher level node first. The interest of this jobs on each level from left to right, then we can assign a
method, which maintains a stack storing the intermediate job j to a thread indexed j mod Nt . We can also apply this
values, is to rapidly discard those that are no longer needed. method in a non-deterministic way, in which case the threads
This stack, which is initially empty, is used as follows: we process left-to-right nodes on a level in a FCFS (First Come
use (push) leaf values one by one from left to right and we First Serve) mode. This method, although intended to get the
check at each step whether or not the last two values on best performances, has also the drawback of requiring huge
the stack are of the same height. If such is the case, these amount of memory as explained above.
last two values are popped out and the parent hash value is For an implementation, worker threads could wait for them-
computed and pushed onto the stack, otherwise we continue selves when the end of the tree level is reached. This does not
to push a leaf value and so on. Note that we could use a minimise the number of steps. Indeed, at the same iteration it
two hash-sized buffer at each level (from 1 to h) instead of is possible to assign the last nodes of level i to the first threads
a unique stack, even though it is useless in such a sequential and the first nodes of level i + 1 to the last threads (i.e. we
implementation. This algorithm can be applyed to SKEIN are dealing with a step in between two successive levels) if the
trees with the condition that we include a special termination overall number of nodes from level 1 to level i is not a multiple
round since they are not necessarily full trees2 (as we can of Nt . Note that the other methods described hereafter do not
see in Figure 2). In the case of a not constrained SKEIN seem to offer the opportunity to gain a few steps in order to
tree (with a depth lower than Ym ) the memory consumption optimise speed.
does not exceed (h − 1)(2Yf − 1) + 2Yl blocks of size Nb
for the computation of a node of height h. For a constrained
111
SKEIN tree, this consumption stays the same, except for the
computation of a node of height Ym − 1 where this one can 61 101
reach kYm −1 + (Ym − 2)(2Yf − 1) blocks.
31 51 71 91
21 22 32 41 52 62 72 81
11 12 13 14 23 24 33 34 42 43 44 53 54 63 64 73

2 A full tree, in our context, is a tree whose all nodes of the first level have Fig. 4: Higher level node first (Yf = Yl = 1, Ym = 255, Nt = 4)
2Yl children and all other nodes have 2Yf children.
v

b) Higher level node priority: This method consists in as- move up and compute Nt node values at the next level. Once
signing3 to a thread the higher level node among all those that these Nt nodes values are computed, the Nt 2Yf child values
may be assigned to it. Applying this method in a deterministic are removed. If the current level occupied by threads is greater
way need that each thread maintains a vector describing, at than 1 and the lack of resources on the level below prevents
each step, the number of nodes that can be processed on each them from finishing the computation of the Nt 2Yf blocks, then
level taking into account the tasks performed by all threads. they return down to level 1, otherwise they continue, and so
Thus, knowing the state of the tree, a thread indexed 0 will on (see Figure 6). In the termination phase for the end of the
choose the higher level node, a thread indexed 1 will choose message, buffers’ contents of less than Nt 2Yf blocks have to be
the second higher level node, and so on (an example is shown processed. Furthermore, top levels may need narrower buffers
in Figure 4). Applying this method in a non-deterministic (see Figure 5) and when the Ym = p parameter constrains the
way means that the higher level node whose child values are tree, the penultimate level buffer must always have a capacity
available is assigned to the first ready thread. In such a method, of kp−1 blocks. For a not constrained tree, when a level l is
a special termination phase for the end of the message is reached, buffers are not all filled, except one, and the effective
needed such that to allow threads to process nodes of arity consumption does not exceed (l(2Yf − 1) + 1)Nt blocks of
lower than expected. size Nb . In fact, this is slightly overvalued because we have to
The advantage of this approach is to conserve at best the consider the buffers’ filling both during the first phase and the
memory usage during the hash process since nodes are pro- termination one. Thus, we have to distinguish the levels such
cessed as soon as possible. Compared to the strategy presented that: j j kk
before, a particular attention is to be payed to synchronisations • l ≤ log2Yf k
for which a buffer of Nt 2Yf
Nt
overhead. From Figure 4, we must be careful that thread 32 will
blocks is needed. This upper bound corresponds to the
not produce a digest before thread 31 has finished consuming
depth of the largest full subtree that we can extract from
digest produced by 21 , forcing these threads to perform a data
the tree of dependencies between steps. At such a level,
recopy in order not to lose too much parallelism.
This scheduling method, which must be further studied, the reachable useful part of the memory consumption
seems not easy to implement and it is not clear if it offers stated herebefore
j j kkis accurate.
k
 
good performances because of the large number of synchro- • log2Yf Nt < l < log2Yf (k) + 1 for which a
nisation mechanisms required. Therefore a deterministic case buffer of kl jblocks is jneeded.
kk In particular:
implementation should be avoided since the threads might wait ◦ if l = log2Yf k
+ 1 then kl ≤ Nt 2Yf .
for themselves uselessly. Finally, note that the total number j jNt kk
k
of steps increases compared to the first scheduling method ◦ if l > log2Yf Nt + 1 then kl < Nt 2Yf . If
because of the number of purely sequential steps, which can such a level is reached then all buffers of lower
approach the height of the tree (see, for instance, the right level are empty.
side of the tree in Figure 4). This number of additional steps For a constrained tree, a maximum number kYm −1 + (Ym −
depends on the configuration of the tree and the number 2)(2Yf − 1)Nt of useful blocks is reached
of threads generated. Thus the inherent unbalanced loading j for thej completion
kk
k
between threads of this scheduling approach can induce a of the level l = Ym − 1. If Ym < log2Yf Nt this
performance penalty, a priori negligible. consumption may be much greater than the previous one.
The Ym parameter is set to 255 by default. If the user does
91 not touch this parameter the tree has negligible chance of being
constrained. In this case, such an algorithm requires about Nt
81 82
times more memory for storing internal node values than the
71 72 73 74 sequential one using level buffers instead of a stack.
31 32 33 34 61 62 63 64 We may think that this fixed number and same level nodes
11 12 13 14 21 22 23 24 41 42 43 44 51 52 53 54 first scheduling is as efficient as the lower level node first
version described above in Section II-B2a. Just like the higher
Fig. 5: Fixed number and same level nodes first level node first scheduling, any recurrent waiting between
(Yf = Yl = 1, Ym = 255, Nt = 4) threads should reduce performance, though it has the advantage
of being simpler to implement.
d) Assigning subtrees: If we consider the case where
c) Priority to a fixed number of nodes of higher level and
a thread is processing a subtree, the user could control an
same level: A third method takes again the idea of using a
additional parameter, the height hs of the subtree. Threads
stack (see Section II-B2), but applied to an arbitrary number of
should be able to process full subtrees, not necessarily full
threads. For Nt threads, at each level we use buffers which can
subtrees on the right side of the original tree and finally a last
receive Nt 2Yf blocks of size Nb , except the base level where
top subtree of height less than or equal to hs .
the leaves are the input data buffer blocks. For the same level
Although the use of sub-trees may slightly unbalance load
these threads have to compute Nt 2Yf node values in order to
between threads, it would have the advantage of reducing the
3 An assignment of a node to a thread means that the thread is responsible total number of dependencies during the execution and thus
for producing the hash value of this node using the hash values of its children. improve performances. Note that the effect of this parameter
vi

lvl
and
2
1 rj = (kj mod 2Yf ) · 1kj mod 2Yf >0 + 2Yf · 1kj mod 2Yf =0
4 Termination phase if j > 0.
3 starts this step
2 If k1 is of the form 2Yf h , k1 processing units are needed
1 to reach the maximum performance improvement. If k1 is not
step of this form, it is possible to reduce significantly this amount
of resources. Let’s take for example the tree of the Figure 7
Fig. 6: level buffer usage for a message of 32Nb bytes (with parameters Yf = Yl = 1 and n = 17) in which we
(Yf = Yl = 1, Ym = 255, Nt = 4) use nine processors whereas certain nodes on the right side
process only one block instead of two. Indeed, the thread 42
could additionaly process the node output by 25 , so the work
could be similar to the Yf and Yl effect but the user might done by 25 could be done by 33 , and finally the one done by
have interest in treating a tree of a particular configuration, 19 could be done by 25 (see Figure 8) so that 8 processors are
for example to check a hash issued from a tree of a specific sufficient.
configuration.
The scheduling policies outlined above still possible to be 51
applied. 41 42
e) Remark: From the parallelism standpoint, processing
files or data streams are different problems. If for streams, fine- 31 32 33
grained parallelism seems to be the best solution in theory, too
21 22 23 24 25
much synchronisations should reduce performances. In case of
files, we know their length in bytes, consequently we could 11 12 13 14 15 16 17 18 19
assign subtree of appropriate height in order to to reduce at
best the synchronisations and to have a speedup close to Nt . Fig. 7: Basic assignment
For instance,
l m Yf = Yl , we could choose a height
assuming
hs = log2Yf Nnt . This reassignment is without consequence for the perfor-
mances if α has an appropriate value (for instance 0). Note
III. R EACHING THE POTENTIAL SPEEDUP that if we take an example where k1 = 144 we can save 16
processors. Reafecting this way will concern only the nodes
This section deals with the minimal number of processors from the most right side of the tree. Processing such a node at
required to reach the potential speedup. The time complexity level j > 1 can be affected to a thread used also by the parent
of a function UBI for the evaluation UBI(G, M, Ts ) can be node if rj+1 + rj + α ≤ 2Yf .
described by
  
l
T (l) = a · · 1l>0 + 1l=0 + b 51
8Nb
41 42
where l is the message length in bits, the constant a is the time
31 32 42
complexity for a block ciphering operation and the constant
b corresponds to the time complexity for the initialisation 21 22 23 24 33
operations such as padding operation and argument evaluation.
11 12 13 14 15 16 17 18 25
We can assume that b is much lower than a, so we
parametrize b by αa with α ∈ [0, 1] and define this time Fig. 8: One level of a subtree is lifted up
complexity by
T (n, α) = a · (n + α) (1)
l m The Algorithm 1 allows to determine this number of proces-
where n = 8Nl b is not zero. sors in the case where k1 is not of the form 2Yf h . It takes as
input the representation of the number of blocks of the string
The block ciphering operation by Threefish is then con- k1 in base 2Yf , that we denote (dn dn−1 ...d1 d0 )2Yf where d0
sidered as an elementary operation; it constitutes one unique is the lower significant digit. Consecutive zero digits (or their
iteration for the UBI chaining mode. absence) give us indications to determine the sufficient number
We define an iterative sequence starting at an initial value of processors.
r0 and terminating at a value rv−1 by Instruction at the line 1 should not be included in the loop
starting at the line 2 since the time complexity of the Skein
r0 = (k0 mod 2Yl ) · 1k0 mod 2Yl >0 + 2Yl · 1k0 mod 2Yl =0 hash tree is determined by the time complexity of the largest
vii

Algorithm 1 An algorithm which returns the number of and they are fundamental levels of our secure communicating
processing units needed to reach the potential speedup infrastructures. Currently the SHA family of functions is
1: Set P = 2Yf n dn . the most popular, but because existing version was found
2: for j = n − 1 to 1 do breakable, new families are required for such algorithms.
3: if dj > 0 then One example of the candidates to the SHA-3 competition
4: Compute P = P + 2Yf j dj was studied for Skein which, among others (cf. Keccak sponge
5: end if functions family), is appropriate for hardware implementation,
6: if rj+1 + rj + α ≤ 2Yf then both for devices with little memory and high speed require-
7: Return P ments. Furthermore, software implementations of this family
8: end if of hash functions in C or Java can be used immediately,
9: end for increasing its accessibility. The C version is the fastest, but
10: if d0 > 0 then the availability of a pure Java implementation with acceptable
11: Compute P = P + d0 performance is interesting for a large class of applications.
12: end if Further work is in progress for testing parallel implementa-
13: if r1 + r0 + α ≤ 2Yf then tions on a highly multi-core/multi-processor system. Moreover,
14: Return P − 1 further research should be done to implement a more specific
15: else thread scheduling policy that would increase performances by
16: Return P minimising the scheduling overhead.
17: end if Such a methodology will also be studied for future optimal
hardware implementations for Keccak, the winner of the SHA-
3 contest.
complete subtree on the left side4 added to the time complexity
of the root node processing, this latter being assigned to only
one thread. Include this instruction means that we are ready R EFERENCES
to increase the workload of this thread in favour of a lesser
[1] Sha-3 competition. http://www.nist.gov/itl/csd/ct/hash competition.cfm.
number of processors to use.
Finally someone could want minimise the number of busy [2] K. Atighehchi, A. Enache, T. Muntean, and G. Riserucci. An efficient
parallel algorithm for skein hash functions. PDCS, 2010.
processing units at each step. Such an assignment strategy is
[3] Dave Bayer, Stuart Haber, and W. Scott Stornetta. Improving the
shown in Figure 9. efficiency and reliability of digital time-stamping. In Sequences II:
Methods in Communication, Security and Computer Science, pages
329–334. Springer-Verlag, 1993.
51
[4] Jean-Sébastien Coron, Yevgeniy Dodis, Cécile Malinaud, and Prashant
41 42 Puniya. Merkle–damgård revisited: How to construct a hash function.
pages 430–448. Springer-Verlag, 2005.
31 32 42 [5] Ivan Damgård. A design principle for hash functions. In CRYPTO ’89:
Proceedings of the 9th Annual International Cryptology Conference on
21 22 23 24 33 Advances in Cryptology, pages 416–427, London, UK, 1990. Springer-
Verlag.
11 12 13 14 15 16 17 18 33
[6] Niels Ferguson, Stefan Lucks Bauhaus, Bruce Schneier, Doug Whiting,
Fig. 9: Two levels of a subtree are lifted up Mihir Bellare, Tadayoshi Kohno, Jon Callas, and Jesse Walker. The
skein hash function family (version 1.2), 2009.
[7] Stuart Haber and W. Scott Stornetta. Secure names for bit-strings. In
In terms of number of processors implied, we could easily in ACM Conference on Computer and Communications Security, pages
improve the tree hashing scheme by just not applying the UBI 28–35. ACM Press, 1997.
function for a node of arity lower than 2Yl (level 1) or 2Yf [8] Markus Jakobsson, Tom Leighton, Silvio Micali, and Michael Szydlo.
(higher level). This is to say choosing the Scheme 1 Section Fractal merkle tree representation and traversal, 2003.
II-A. The way of using the tweak in such a scheme and its [9] Men Long and Intel Corporation. Implementing skein hash function on
security aspects are not presented in this paper. xilinx virtex-5 fpga platform.
[10] Toshihiko Matsuo and Kaoru Kurosawa. On parallel hash functions
based on block-ciphers. In IEICE Transactions, volume 87-A, pages
IV. C ONCLUSION AND FURTHER WORK 67–74, 2004.
We emphasise that efficiency for securing complex data [11] B. Preneel, B. Van Rompay, J.J.Quiaquater, H. Massiasand, and J. Serret
exchange protocols will be mandatory for deploying future Avila. Design of a timestamping system, 1999.
critical networking applications. [12] Andrew Regenscheid, Ray Perlner, Shu jen Chang, John Kelsey, Mridul
This paper proposes a methodology for parallelising hash Nandi, Souradyuti Paul Nistir, Andrew Regenscheid, Ray Perlner, Shu
trees which are the most commonly used cryptographic prim- jen Chang, John Kelsey, Mridul Nandi, and Souradyuti Paul. The sha-3
cryptographic hash algorithm competition, 2009.
itives. These functions can be found in almost any application
[13] Stefan Tillich. Hardware implementation of the sha-3 candidate skein.
4 The most left children of the root node is always a root node of a complete Cryptology ePrint Archive, Report 2009/159, 2009. http://eprint.iacr.
tree org/.
viii

A NNEXE : JAVA IMPLEMENTATION B. Sequential implementation


This section provides some details on a Java implementation To do a Simple Hash, one simply calls the update() and
of Skein based on an approach like the first one in Section digest() methods on a Skein class. There are also methods
II-B2, which offers maximum parallelism in theory and is available to perform the Tree hash computation sequentially.
independent of the algorithm parameters (in particular the
parameters influencing the node sizes and tree structure in
Skein Hash Tree mode). Details of the performance given C. Parallel implementation
here are for illustrative purposes, only. We have not attempted 1. One thread per node
here to optimise the method for practical use; we are aiming We create one thread per node, and let the scheduler
instead to demonstrate the performance improvements that can handle how they are executed.
be obtained on any system configuration. 2. One thread per node with a thread pool
Thread scheduling in Java. There are two kinds of sched- To optimize the first implementation we create a thread
ulers : green and native. A green scheduler is provided by pool that has a fixed number of threads. These threads
the Java Virtual Machine (JVM), and a native scheduler is accept jobs in a FIFO manner and then executes them
provided by the underlying OS. In this work, tests were (Figure 10).
performed on a Linux operating system with a JVM using the 3. Pipe input file
native thread scheduler. This provides a standard round-robin Because most of the time people hash many files at the
strategy. same time, we have decided to implement a pipe that
Threads can have different states : initial state (when not applies the hash function in parallel for each input file.
started), runnable state (when the thread can be executed), This implementation uses the thread pool with a thread
blocked state and terminating state. The main issue is when count equal to the number of files to be hashed. It is
a thread is in the blocked state, i.e. waiting for some event implemented using both the Simple hash and the Tree
(for example, a specific I/O operation or waiting for a signal hash methods.
notification), in which case the thread is not consuming CPU
resources at all, meaning that having a large number of blocked Working
threads
threads does not impact much on the efficiency of the system.
Job

Job Thread
Job Job
A. Class organisation Job
scheduler
Job Job Job Job queue
Our implementation of Skein is composed of several classes,
splitting the core functionality and special code of the algo- Job Job Job Job Job Job Job Job
CPU
rithm:
• Main algorithm: The Skein core is implemented as three
classes, Skein256, Skein512 and Skein1024. They all
provide the same interface, and support Simple Hash Fig. 10: Parallel Skein using one thread per node and a thread pool
as well as Full Skein. Tree and thread management is
done in other support classes.
• Tree and thread support: Different class were imple-
mented, each representing the different kind of nodes
we can have in the hash tree. All of these classes have D. Testing and performances
a similar interface: The tests were done using a basic platform for mini-
- TreeNode: used when hashing a file with a hash mal illustrative purpose only: a Dell Latitude D830, Intel(R)
tree Core(TM)2 Duo CPU T7500 @ 2.20Ghz, 2GB RAM, L2
- NodeThread: used in a hash tree with one thread cache size 4MB with a Ubuntu 9.10 operating system. For
per node evaluating the performance of the various implementations
- NodeJob: used in a hash tree with one job per we wrote the Speed class, used in conjunction with the
node, and processes those jobs with a thread pool YourKit profiling tool, which allows monitoring the CPU and
- ThreadPool: manage the pool of thread used with memory usage. In order to determine the efficiency of the
NodeJob instance. implementation, we have performed tests using a fixed file
- Other classes are needed for the pipeline of 700MB. The performance results are illustrated below:
implementation of Skein: SimplePipeFile and Although the fastest version should theoretically be Skein-
TreePipeFile are used, the first one for Simple 1024, from this chart we can see that the version using a block
Hash and the second one for Full Skein with tree. size of 512 bits is faster. That is because the computer used
In addition to these classes, two main classes were written. for these tests has a 64 bit processor. Furthermore, the slowest
The Speed class is used to test the speed-up of the algorithm, for our test is Skein-256, but this one would be the fastest on
and the Test class implements direct calls to the different hash a 32 bit CPU.
methods on different inputs, as well as running tests provided The tests using the YourKit profiler showed that the parallel
in the Skein reference paper. versions use more heap memory, but the CPU load stay close
ix

E. Comparing with other implementations


Our implementations were also tested and compared to other
Skein implementations, one in Java, from sphlib-2.0, and the
second in C from the NIST submission of Skein.
In Java, using our Speed class to test both our implementa-
tion and sphlib-2.0 implementation, we obtained the results
in Table I. As we can see the Skein implementation from
Implementation Processings speed
Skein-512 - our implementation 36MB/s
Skein-512 - sphlib-2.0 34MB/s
Fig. 11: Processing speed (in MB/s) comparison between the Skein SHA-512 - sphlib-2.0 27MB/s
versions TABLE I: Speed results - sphlib-2.0

the sphlib-2.0 is slower. Also it is important to notice that


this Skein implementation is much faster when compared to
to 100%, meaning both processors available on the platform the SHA-512 one (Skein is therefore a good candidate for
are used at their full capacity. replacing the current SHA-2).
In terms of execution time the One Thread per Node imple- For the second comparison in terms of execution time we
mentation is the slowest. This is mainly due to the overhead of used a 700MB file and hashed it using both our implementation
the thread scheduler and poor memory management. Creating in Java and the C reference implementation of Skein. The
a lot of threads, although highly scalable, for single use is quite execution times are the followings: 27 seconds with the Java
costly: it triples the heap memory usage in comparison to the sequential version, 20 seconds with the Java parallel version
sequential version, but is not very effective. It is also slower and 24 seconds with the C reference (sequential) version. The
when compared to the sequential version, due to excessive Java implementation of the tree mode was of course slower
synchronisation required between threads. than the C version, but not significantly; therefore some Java
applications can use Java implementation of Skein with no
To optimise this parallel implementation, we created a very significant loss of performances. On the other hand, the
thread pool class. With a limited number of threads, we use parallel implementation using the thread pool is faster than the
less memory, although it is still high when compared to the C implementation.
sequential version. On the other hand, the execution time for
this implementation is less than half of the sequential version.
The last parallel implementation uses the thread pool class
with as many threads as input files. This is an efficient
implementation as the execution time is half of the sequential
version, and the difference for the used heap memory is quite
small for the simple mode hash it is less than 0.2MB. Also an
important factor is that when running the tests on a computer
with two CPUs, having two input files means that both threads
stay in the runnable state, which allows us to maximize CPU
utilisation.
Comparing the three versions of Skein implementations,
we noticed that for the simple sequential implementation the
amount of heap memory used is almost the same. A small
difference was noted for Skein-1024, which uses 0.1MB more
heap memory and 0.1MB more non-heap memory. This is
because this last version uses blocks of 1024 bits. In terms
of execution time the results reflected the ones in the chart
above.
For the tree implementations, we used the same parameters
for all three versions of Skein. The results showed that for the
sequential version Skein-256 uses less memory, and for the
parallel implementations Skein-1024 uses less heap memory.
The reason is that the number of nodes is smaller for Skein
versions with bigger block sizes, and the size of each node
does not vary much between the three versions.

You might also like