Professional Documents
Culture Documents
fn/2−1
1) Mi+1 = j=0 f (Mi,2j ||Mi,2j+1 ). b) If Mi+1 is longer than N then split it into blocks
2) If Mi+1 is longer than N then split it into blocks of of size N , set n to the corresponding number of
size N , set n to the corresponding number of blocks blocks and return to the rule (a), else set L =
and return to the first rule, else return Mi+1 . Mi+1 ||L and run the second phase.
The minimal number ⌈log2 n⌉ of steps seems to be obtained 2) Apply Scheme 1 to the produced message L.
with 2⌈log2 n⌉−1 processors, although there exist a way to Scheme 1 and Scheme 2 differ only on how the roots of
reduce this number by processing the largest full subtree that the successive full subtrees are combined (see Figure 1). The
spans zeroed blocks with only one processor. Nevertheless, if number of steps in the hashing Scheme 2 can be estimated in
a counter is used as input of the compression function in order the following way: the height of the largest full tree produced
to distinguish the tree nodes, this optimization doesn’t work in phase 1 is ⌊log2 n⌋ whereas the height of the tree produced
anymore. over the message L in phase 2 is at most ⌈log2 ⌈log2 n⌉⌉.
We present in this section a first scheme (see [3], [7], [11],
[10]) which allows to conserve an optimal number ⌈log2 n⌉
of steps and an optimal number of instances of the compres-
sion function while decreasing the number of processors to
2⌊log2 n⌋−1 in certain cases. Finally, we present a new scheme
which allows to keep the same number of processors whatever
the length of the message is, for a number of steps which will
not exceed ⌊log2 n⌋ + ⌈log2 ⌈log2 n⌉⌉.
Scheme 1: Let M0 = M0,0 ||M0,1 ||...||M0,n−1 be the
message to be hashed, where each block M0,j is of size N ,
except the last which may be smaller. The tree is defined
iteratively. For any level i = 1, 2, ... we use the following
rules until obtaining the root Mr of size N :
1) If n is even then
n
n/2−1 Fig. 1: Differences between scheme 1 and scheme 2
Mi+1 = f (Mi,2j ||Mi,2j+1 ).
j=0
The followings sections recall the two modes of SKEIN G1 = UBI(G0 , Ml , Ym · 2112 + Tmsg · 2120 )
intended to be widely used, the Simple Hash mode and the
Hash Tree mode, a mode specifically designed for parallel 3) If neither of these conditions holds, we create
implementations (see [6] for more information). the next tree level. We split Ml into blocks
1) Hash Tree Mode: Notations Ml,0 , Ml,1 , ..., Ml,k−1 , where all blocks are of size Nn ,
A simple SKEIN hash computation has the following inputs: except the last which may be smaller. We then define:
Nb The internal state size, in bytes (32, 64 or 128).
No The output size, in bits. n
k−1
M The message to be hashed, a string of up to 299 − 8 bits Ml+1 = UBI(G0 , Ml,i , iNn +(l+1)·2112 +Tmsg ·2120 )
(296 − 1 bytes). i=0
Tree processing varies according to the following input and apply the above rules to Ml+1 again.
parameters: The result G1 is then the chaining input to the output
Yl The leaf size encoding. The size of each leaf of the tree transformation H := Output(G1 , No ) where H is the result
is Nl = Nb 2Yl bytes with Yl ≥ 1 (where Nb is the size of the hash.
of the internal state of Skein). 2) Approaches for Parallelism: Notations used
Yf The fan-out encoding. The fan-out of a tree node is 2Yf In the following sections, we denote n the number of blocks1
with Yf ≥ 1. The size of each node is Nn = Nb 2Yf . of the message and Nt the number of threads. These threads
Ym The maximum tree height; Ym ≥ 2. If the height of the are then indexed 0, 1, ..., Nt − 1.
tree is not limited this parameter is set to 255.
G0 The input chaining value and the output of the previous 1 When it is not specified, the blocks are of size N bytes and we include
b
UBI function. the last block which can be of size less or equal than Nb .
iv
2 A full tree, in our context, is a tree whose all nodes of the first level have Fig. 4: Higher level node first (Yf = Yl = 1, Ym = 255, Nt = 4)
2Yl children and all other nodes have 2Yf children.
v
b) Higher level node priority: This method consists in as- move up and compute Nt node values at the next level. Once
signing3 to a thread the higher level node among all those that these Nt nodes values are computed, the Nt 2Yf child values
may be assigned to it. Applying this method in a deterministic are removed. If the current level occupied by threads is greater
way need that each thread maintains a vector describing, at than 1 and the lack of resources on the level below prevents
each step, the number of nodes that can be processed on each them from finishing the computation of the Nt 2Yf blocks, then
level taking into account the tasks performed by all threads. they return down to level 1, otherwise they continue, and so
Thus, knowing the state of the tree, a thread indexed 0 will on (see Figure 6). In the termination phase for the end of the
choose the higher level node, a thread indexed 1 will choose message, buffers’ contents of less than Nt 2Yf blocks have to be
the second higher level node, and so on (an example is shown processed. Furthermore, top levels may need narrower buffers
in Figure 4). Applying this method in a non-deterministic (see Figure 5) and when the Ym = p parameter constrains the
way means that the higher level node whose child values are tree, the penultimate level buffer must always have a capacity
available is assigned to the first ready thread. In such a method, of kp−1 blocks. For a not constrained tree, when a level l is
a special termination phase for the end of the message is reached, buffers are not all filled, except one, and the effective
needed such that to allow threads to process nodes of arity consumption does not exceed (l(2Yf − 1) + 1)Nt blocks of
lower than expected. size Nb . In fact, this is slightly overvalued because we have to
The advantage of this approach is to conserve at best the consider the buffers’ filling both during the first phase and the
memory usage during the hash process since nodes are pro- termination one. Thus, we have to distinguish the levels such
cessed as soon as possible. Compared to the strategy presented that: j j kk
before, a particular attention is to be payed to synchronisations • l ≤ log2Yf k
for which a buffer of Nt 2Yf
Nt
overhead. From Figure 4, we must be careful that thread 32 will
blocks is needed. This upper bound corresponds to the
not produce a digest before thread 31 has finished consuming
depth of the largest full subtree that we can extract from
digest produced by 21 , forcing these threads to perform a data
the tree of dependencies between steps. At such a level,
recopy in order not to lose too much parallelism.
This scheduling method, which must be further studied, the reachable useful part of the memory consumption
seems not easy to implement and it is not clear if it offers stated herebefore
j j kkis accurate.
k
good performances because of the large number of synchro- • log2Yf Nt < l < log2Yf (k) + 1 for which a
nisation mechanisms required. Therefore a deterministic case buffer of kl jblocks is jneeded.
kk In particular:
implementation should be avoided since the threads might wait ◦ if l = log2Yf k
+ 1 then kl ≤ Nt 2Yf .
for themselves uselessly. Finally, note that the total number j jNt kk
k
of steps increases compared to the first scheduling method ◦ if l > log2Yf Nt + 1 then kl < Nt 2Yf . If
because of the number of purely sequential steps, which can such a level is reached then all buffers of lower
approach the height of the tree (see, for instance, the right level are empty.
side of the tree in Figure 4). This number of additional steps For a constrained tree, a maximum number kYm −1 + (Ym −
depends on the configuration of the tree and the number 2)(2Yf − 1)Nt of useful blocks is reached
of threads generated. Thus the inherent unbalanced loading j for thej completion
kk
k
between threads of this scheduling approach can induce a of the level l = Ym − 1. If Ym < log2Yf Nt this
performance penalty, a priori negligible. consumption may be much greater than the previous one.
The Ym parameter is set to 255 by default. If the user does
91 not touch this parameter the tree has negligible chance of being
constrained. In this case, such an algorithm requires about Nt
81 82
times more memory for storing internal node values than the
71 72 73 74 sequential one using level buffers instead of a stack.
31 32 33 34 61 62 63 64 We may think that this fixed number and same level nodes
11 12 13 14 21 22 23 24 41 42 43 44 51 52 53 54 first scheduling is as efficient as the lower level node first
version described above in Section II-B2a. Just like the higher
Fig. 5: Fixed number and same level nodes first level node first scheduling, any recurrent waiting between
(Yf = Yl = 1, Ym = 255, Nt = 4) threads should reduce performance, though it has the advantage
of being simpler to implement.
d) Assigning subtrees: If we consider the case where
c) Priority to a fixed number of nodes of higher level and
a thread is processing a subtree, the user could control an
same level: A third method takes again the idea of using a
additional parameter, the height hs of the subtree. Threads
stack (see Section II-B2), but applied to an arbitrary number of
should be able to process full subtrees, not necessarily full
threads. For Nt threads, at each level we use buffers which can
subtrees on the right side of the original tree and finally a last
receive Nt 2Yf blocks of size Nb , except the base level where
top subtree of height less than or equal to hs .
the leaves are the input data buffer blocks. For the same level
Although the use of sub-trees may slightly unbalance load
these threads have to compute Nt 2Yf node values in order to
between threads, it would have the advantage of reducing the
3 An assignment of a node to a thread means that the thread is responsible total number of dependencies during the execution and thus
for producing the hash value of this node using the hash values of its children. improve performances. Note that the effect of this parameter
vi
lvl
and
2
1 rj = (kj mod 2Yf ) · 1kj mod 2Yf >0 + 2Yf · 1kj mod 2Yf =0
4 Termination phase if j > 0.
3 starts this step
2 If k1 is of the form 2Yf h , k1 processing units are needed
1 to reach the maximum performance improvement. If k1 is not
step of this form, it is possible to reduce significantly this amount
of resources. Let’s take for example the tree of the Figure 7
Fig. 6: level buffer usage for a message of 32Nb bytes (with parameters Yf = Yl = 1 and n = 17) in which we
(Yf = Yl = 1, Ym = 255, Nt = 4) use nine processors whereas certain nodes on the right side
process only one block instead of two. Indeed, the thread 42
could additionaly process the node output by 25 , so the work
could be similar to the Yf and Yl effect but the user might done by 25 could be done by 33 , and finally the one done by
have interest in treating a tree of a particular configuration, 19 could be done by 25 (see Figure 8) so that 8 processors are
for example to check a hash issued from a tree of a specific sufficient.
configuration.
The scheduling policies outlined above still possible to be 51
applied. 41 42
e) Remark: From the parallelism standpoint, processing
files or data streams are different problems. If for streams, fine- 31 32 33
grained parallelism seems to be the best solution in theory, too
21 22 23 24 25
much synchronisations should reduce performances. In case of
files, we know their length in bytes, consequently we could 11 12 13 14 15 16 17 18 19
assign subtree of appropriate height in order to to reduce at
best the synchronisations and to have a speedup close to Nt . Fig. 7: Basic assignment
For instance,
l m Yf = Yl , we could choose a height
assuming
hs = log2Yf Nnt . This reassignment is without consequence for the perfor-
mances if α has an appropriate value (for instance 0). Note
III. R EACHING THE POTENTIAL SPEEDUP that if we take an example where k1 = 144 we can save 16
processors. Reafecting this way will concern only the nodes
This section deals with the minimal number of processors from the most right side of the tree. Processing such a node at
required to reach the potential speedup. The time complexity level j > 1 can be affected to a thread used also by the parent
of a function UBI for the evaluation UBI(G, M, Ts ) can be node if rj+1 + rj + α ≤ 2Yf .
described by
l
T (l) = a · · 1l>0 + 1l=0 + b 51
8Nb
41 42
where l is the message length in bits, the constant a is the time
31 32 42
complexity for a block ciphering operation and the constant
b corresponds to the time complexity for the initialisation 21 22 23 24 33
operations such as padding operation and argument evaluation.
11 12 13 14 15 16 17 18 25
We can assume that b is much lower than a, so we
parametrize b by αa with α ∈ [0, 1] and define this time Fig. 8: One level of a subtree is lifted up
complexity by
T (n, α) = a · (n + α) (1)
l m The Algorithm 1 allows to determine this number of proces-
where n = 8Nl b is not zero. sors in the case where k1 is not of the form 2Yf h . It takes as
input the representation of the number of blocks of the string
The block ciphering operation by Threefish is then con- k1 in base 2Yf , that we denote (dn dn−1 ...d1 d0 )2Yf where d0
sidered as an elementary operation; it constitutes one unique is the lower significant digit. Consecutive zero digits (or their
iteration for the UBI chaining mode. absence) give us indications to determine the sufficient number
We define an iterative sequence starting at an initial value of processors.
r0 and terminating at a value rv−1 by Instruction at the line 1 should not be included in the loop
starting at the line 2 since the time complexity of the Skein
r0 = (k0 mod 2Yl ) · 1k0 mod 2Yl >0 + 2Yl · 1k0 mod 2Yl =0 hash tree is determined by the time complexity of the largest
vii
Algorithm 1 An algorithm which returns the number of and they are fundamental levels of our secure communicating
processing units needed to reach the potential speedup infrastructures. Currently the SHA family of functions is
1: Set P = 2Yf n dn . the most popular, but because existing version was found
2: for j = n − 1 to 1 do breakable, new families are required for such algorithms.
3: if dj > 0 then One example of the candidates to the SHA-3 competition
4: Compute P = P + 2Yf j dj was studied for Skein which, among others (cf. Keccak sponge
5: end if functions family), is appropriate for hardware implementation,
6: if rj+1 + rj + α ≤ 2Yf then both for devices with little memory and high speed require-
7: Return P ments. Furthermore, software implementations of this family
8: end if of hash functions in C or Java can be used immediately,
9: end for increasing its accessibility. The C version is the fastest, but
10: if d0 > 0 then the availability of a pure Java implementation with acceptable
11: Compute P = P + d0 performance is interesting for a large class of applications.
12: end if Further work is in progress for testing parallel implementa-
13: if r1 + r0 + α ≤ 2Yf then tions on a highly multi-core/multi-processor system. Moreover,
14: Return P − 1 further research should be done to implement a more specific
15: else thread scheduling policy that would increase performances by
16: Return P minimising the scheduling overhead.
17: end if Such a methodology will also be studied for future optimal
hardware implementations for Keccak, the winner of the SHA-
3 contest.
complete subtree on the left side4 added to the time complexity
of the root node processing, this latter being assigned to only
one thread. Include this instruction means that we are ready R EFERENCES
to increase the workload of this thread in favour of a lesser
[1] Sha-3 competition. http://www.nist.gov/itl/csd/ct/hash competition.cfm.
number of processors to use.
Finally someone could want minimise the number of busy [2] K. Atighehchi, A. Enache, T. Muntean, and G. Riserucci. An efficient
parallel algorithm for skein hash functions. PDCS, 2010.
processing units at each step. Such an assignment strategy is
[3] Dave Bayer, Stuart Haber, and W. Scott Stornetta. Improving the
shown in Figure 9. efficiency and reliability of digital time-stamping. In Sequences II:
Methods in Communication, Security and Computer Science, pages
329–334. Springer-Verlag, 1993.
51
[4] Jean-Sébastien Coron, Yevgeniy Dodis, Cécile Malinaud, and Prashant
41 42 Puniya. Merkle–damgård revisited: How to construct a hash function.
pages 430–448. Springer-Verlag, 2005.
31 32 42 [5] Ivan Damgård. A design principle for hash functions. In CRYPTO ’89:
Proceedings of the 9th Annual International Cryptology Conference on
21 22 23 24 33 Advances in Cryptology, pages 416–427, London, UK, 1990. Springer-
Verlag.
11 12 13 14 15 16 17 18 33
[6] Niels Ferguson, Stefan Lucks Bauhaus, Bruce Schneier, Doug Whiting,
Fig. 9: Two levels of a subtree are lifted up Mihir Bellare, Tadayoshi Kohno, Jon Callas, and Jesse Walker. The
skein hash function family (version 1.2), 2009.
[7] Stuart Haber and W. Scott Stornetta. Secure names for bit-strings. In
In terms of number of processors implied, we could easily in ACM Conference on Computer and Communications Security, pages
improve the tree hashing scheme by just not applying the UBI 28–35. ACM Press, 1997.
function for a node of arity lower than 2Yl (level 1) or 2Yf [8] Markus Jakobsson, Tom Leighton, Silvio Micali, and Michael Szydlo.
(higher level). This is to say choosing the Scheme 1 Section Fractal merkle tree representation and traversal, 2003.
II-A. The way of using the tweak in such a scheme and its [9] Men Long and Intel Corporation. Implementing skein hash function on
security aspects are not presented in this paper. xilinx virtex-5 fpga platform.
[10] Toshihiko Matsuo and Kaoru Kurosawa. On parallel hash functions
based on block-ciphers. In IEICE Transactions, volume 87-A, pages
IV. C ONCLUSION AND FURTHER WORK 67–74, 2004.
We emphasise that efficiency for securing complex data [11] B. Preneel, B. Van Rompay, J.J.Quiaquater, H. Massiasand, and J. Serret
exchange protocols will be mandatory for deploying future Avila. Design of a timestamping system, 1999.
critical networking applications. [12] Andrew Regenscheid, Ray Perlner, Shu jen Chang, John Kelsey, Mridul
This paper proposes a methodology for parallelising hash Nandi, Souradyuti Paul Nistir, Andrew Regenscheid, Ray Perlner, Shu
trees which are the most commonly used cryptographic prim- jen Chang, John Kelsey, Mridul Nandi, and Souradyuti Paul. The sha-3
cryptographic hash algorithm competition, 2009.
itives. These functions can be found in almost any application
[13] Stefan Tillich. Hardware implementation of the sha-3 candidate skein.
4 The most left children of the root node is always a root node of a complete Cryptology ePrint Archive, Report 2009/159, 2009. http://eprint.iacr.
tree org/.
viii
Job Thread
Job Job
A. Class organisation Job
scheduler
Job Job Job Job queue
Our implementation of Skein is composed of several classes,
splitting the core functionality and special code of the algo- Job Job Job Job Job Job Job Job
CPU
rithm:
• Main algorithm: The Skein core is implemented as three
classes, Skein256, Skein512 and Skein1024. They all
provide the same interface, and support Simple Hash Fig. 10: Parallel Skein using one thread per node and a thread pool
as well as Full Skein. Tree and thread management is
done in other support classes.
• Tree and thread support: Different class were imple-
mented, each representing the different kind of nodes
we can have in the hash tree. All of these classes have D. Testing and performances
a similar interface: The tests were done using a basic platform for mini-
- TreeNode: used when hashing a file with a hash mal illustrative purpose only: a Dell Latitude D830, Intel(R)
tree Core(TM)2 Duo CPU T7500 @ 2.20Ghz, 2GB RAM, L2
- NodeThread: used in a hash tree with one thread cache size 4MB with a Ubuntu 9.10 operating system. For
per node evaluating the performance of the various implementations
- NodeJob: used in a hash tree with one job per we wrote the Speed class, used in conjunction with the
node, and processes those jobs with a thread pool YourKit profiling tool, which allows monitoring the CPU and
- ThreadPool: manage the pool of thread used with memory usage. In order to determine the efficiency of the
NodeJob instance. implementation, we have performed tests using a fixed file
- Other classes are needed for the pipeline of 700MB. The performance results are illustrated below:
implementation of Skein: SimplePipeFile and Although the fastest version should theoretically be Skein-
TreePipeFile are used, the first one for Simple 1024, from this chart we can see that the version using a block
Hash and the second one for Full Skein with tree. size of 512 bits is faster. That is because the computer used
In addition to these classes, two main classes were written. for these tests has a 64 bit processor. Furthermore, the slowest
The Speed class is used to test the speed-up of the algorithm, for our test is Skein-256, but this one would be the fastest on
and the Test class implements direct calls to the different hash a 32 bit CPU.
methods on different inputs, as well as running tests provided The tests using the YourKit profiler showed that the parallel
in the Skein reference paper. versions use more heap memory, but the CPU load stay close
ix