Buckets Strike Back - Improved Parallel Shortest-Paths

Buckets strike back: Improved Parallel Shortest-Paths
Ulrich Meyer
Max-Planck-Institut für Informatik (MPII),
Stuhlsatzenhausweg 85, 66123 Saarbrücken, Germany,
www.uli-meyer.de

Abstract has finite, nonnegative derivative at . This is also true for
our algorithm.
We study the average-case complexity of the parallel The parallel random access machine (PRAM) [19] is one
single-source shortest-path (SSSP) problem, assuming arbi- of the most widely studied abstract models of a parallel com-
trary directed graphs with nodes, edges, and indepen-

puter. A PRAM consists of J independent processors (pro-
dent random edge weights uniformly distributed in . We cessing units, PUs) and a shared memory, which these pro-
provide a new bucket-based parallel SSSP algorithm that cessors can synchronously access in unit time. The strongest
runs in !#"%$ '&)(+* ,-.*0/21 average-case model (CRCW) supports concurrent read- and write-access to
time using 3( ( 1 work on a PRAM where & denotes single memory cells. The performance of PRAM algorithms
the maximum shortest-path weight and * ,-.* is the number of is usually described by the two parameters time (assuming
graph vertices with in-degree at least $ . All previous algo- an unlimited number of available PUs) and work (the total
rithms either required more time or more work. The minimum number of operations needed). Even though the strict PRAM
performance gain is a logarithmic factor improvement; on model is only implemented on a number of experimental par-
certain graph classes, accelerations by factors of more than allel machines like the SB-PRAM [13], it is valuable to high-
5476 8 can be achieved. The algorithm allows adaptation to light the main ideas of a parallel algorithm without tedious
distributed memory machines, too. details caused by a particular architecture. Other models like
BSP [30] view a parallel computer as a collection of sequen-
tial processors, each one having its own local memory, so
1 Introduction called distributed memory machines (DMMs). The PUs are
The single-source shortest-path problem (SSSP) is a funda- interconnected by a network that allows them to communi-
mental and well-studied combinatorial optimization problem cate by sending and receiving messages. In order to facili-
with
<; many practical and theoretical applications [1]. Let 9: tate easy exposition we focus on the PRAM model and only
, 1 be a directed graph with * ,* nodes and * ; * = sketch how our SSSP algorithm can be converted to DMMs.
edges, let > be a distinguished vertex of the graph, and ? be a
function assigning a nonnegative real-valued weight to each 1.1 Previous Work.
edge of 9 . The objective of the SSSP is to compute, for each The classical sequential SSSP result is Dijkstra’s algorithm
vertex @ reachable from > , the weight of a minimum-weight [11]; implemented with Fibonacci heaps it solves SSSP on
(“shortest distance”) path from > to @ , denoted by A CBED F> @ 1 , arbitrary directed graphs with nonnegative edge weights in
abbreviated A
E
B D 1
G@ ; the weight of a path is the sum of the K L M' N( 1 time. A number of faster algorithms have
weights of its edges. been developed on the more powerful RAM (random access
Assuming independent random edge weights is a standard machine) model, see [29] for an overview. In particular,
setting for the average-case analysis of graph algorithms; see Thorup [29] has given the first ( 1 worst-case time
[14] for many examples. The uniform edge weight distribu- RAM algorithm for undirected graphs with integer or float
tion is mostly chosen in order to keep the proofs simple. Fre- edge weights. The average-case analysis of shortest-path
quently, the obtained results also hold asymptotically in the algorithms mainly focused on the All–Pairs Shortest Paths
more general situation of random edge weights that are inde- (APSP) problem for the complete graph with random edge
pendent, bounded, and their common distribution function H weights. Recently, the first linear ( 1 average-case
time algorithms for arbitrary directed graphs with random
I edge weights have been given [16, 24].
Partially supported by the Future and Emerging Technologies pro-
gramme of the EU under contract number IST-1999-14186 (ALCOM-FT) So far there is no parallel O N( 1 work PRAM
and the Center of Excellence programme of the EU under contract number
ICAI-CT-2000-70025. Parts of this work were done while the author was
SSSP algorithm with worst-case sublinear running time for
arbitrary digraphs with nonnegative edge weights. The
K M' O( 1 work solution by Driscoll et. al. [12] has
visiting the Computer and Automation Research Institute of the Hungarian
Academy of Sciences, Center of Excellence, MTA SZTAKI, Budapest.
1
Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS’02)
1530-2075/02 $17.00 © 2002 IEEE
running time P Q1 . An K Q1 time algorithm re- 2 Preliminaries
quiring KG Q1 work was presented by Brodal et. al. The classical sequential SSSP approach is Dijkstra’s algo-
[7]. All faster known algorithms require more work, e.g., rithm [11]. It maintains a partition of the node set , into
the approach by Han et. al. [18] needs Q1 time and settled, queued, and unreached nodes. For each node @ it
K SRT UM' WV Q1<X<Y.RZ1 work. The algorithm of Klein keeps a tentative distance D<7vD @ 1 . If @ is unreached, then
and Subramanian [20] takes F[ \ & ] M'^ Q1 time D<7vD @ 1 ; otherwise, D<7vD @ 1 refers to the weight of the
and K_[ ` & M' Q1 work where & is the maximum lightest path from > to @ found so far. Hence, D<7vD @ 1
shortest-path weight, i.e. & à%bcZdegf hi jlknmMcpo_q5r A BED G@ 1 . A BED G@ 1 . Settled nodes satisfy D.7vD G@ 1 A CB#D @ 1 . Initially,
Similar results have been obtained by Cohen [9] and Shi > is queued ( > ), D<7vD y> 1 , and all other nodes
and Spencer [28]. Most of these algorithms can be modi- are unreached. In each iteration, the queued node @ with
fied to run on the weakest PRAM model without concurrent smallest tentative distance is< scanned: @ is removed from
1 D<7vD 1
1 D<'GD @ 1( are relaxed,
read/write capability. the queue, and all edges i.e.,

S

" <
D 7
v
D E 1p/
Parallel shortest path problems on random graphs [6] is set to G@ ?@ . If was un-
where each of the possible edges is present with a certain reached ( ), it is now queued. It is well known that
probability have been studied intensively [8, 10, 15, 17, 25, D<7vD @ 1 A CB#D @ 1 , when @ is selected from the queue as
26, 27]. Under the assumption of independent random

edge the node with smallest tentative distance. Hence, @ is set-
weights uniformly distributed in the interval the fastest tled and will never re-enter the queue. Therefore, Dijkstra’s
work-efficient parallel SSSP algorithm for random graphs approach is a so called label-setting method. Alternatively,
[25, 26] requires Q1 time 1u and Llinear work on aver- label-correcting variants may scan nodes from the queue for
age; additionally, Gs &t( ( v Q1 time and which D<vD @ 1| A BED G@ 1 and hence have to rescan those
K ( ( s &t( 1\ Q1 work on average is suffi- nodes until they are finally settled.
cient for arbitrary directed graphs with random edge weights Label-correcting SSSP algorithms are natural candidates
where s denotes the maximum node degree in the graph and for parallelization: several queued nodes may be scanned
& is defined as above.
concurrently in one round. However, finding both provably
For arbitrary graphs with large maximum degree s , the al- good criteria to select nodes for scanning and data struc-
gorithms of [25, 26] perform poorly: wxys 1 time is needed. If tures that efficiently support these strategies remains a dif-
the number of high-degree nodes is rather small, then the run- ficult task.
ning time can be considerably improved [23]: let * ,-z* denote
the number of graph vertices with in-degree at least $ then
R
SSSP can be solved in 5{PPKGM' `z "%$ <&L(|* , *}/%1
2.1 Buckets of Fixed Width.
time on average. However, the algorithm needs non-linear The sequential SSSP algorithm of [25], called -stepping,
K ~ M' K( ( { 1 operations. and its parallelizations [25, 26] are label-correcting ap-
proaches that work in phases: if denotes the smallest ten-
1.2 New Result. tative distance in the queue data structure at the beginning
of a phase, then they scan queued nodes @ with tentative dis-
We provide an improved parallel SSSP algorithm that applies tance D<'D @ 1u ( in parallel. The parameter is called
different step-widths on disjoint node subsets at the same time the step-width. The queue is implemented by a linear ar-
y
and utilizes a new split-free bucket data structure. It achieves ray of buckets such that a queued ( @ is 1kept in
average-case running time = U-E"2$ 0&](N* ,-p*}/%1 node
F 7 7 1 . Let

using ( ( 1 operations on a CRCW PRAM, where
for tentative distances in the range
& and * ,-.* are defined as in Section 1.1. denote the largest bucket index such that ¡
are empty. Then a phase scans all nodes from the current
For sparse graphs, this means a logarithmic factor im- bucket cur ¢ in parallel but first of all only relaxes
provement on both running time and work bound compared the light edges (having weight at most ) emanating from
to the superlinear work algorithm from [23]. Furthermore, these nodes. Any node @ that is scanned from cur while
the node degrees of many huge but sparse graphs with small D<7vD @ 1O A BED G@ 1 is eventually reinserted into
cur . Non-
diameter (e.g., WWW, telephone call graphs) follow a power light edges out of @ are only relaxed after cur finally remains
law, i.e., the number of nodes with in-degree is proportional empty. By then, @ is surely settled; its non-light edges are re-
$
Q! for some constant typically ranging between and
to laxed using the final distance value for @ . When cur becomes
. For thesem graphs,
o the new approach is faster by a factor of empty after a phase, then the algorithm sequentially searches
nearly X<Y '
! as compared to the best previous algorithm for next nonempty current bucket. Testing buckets takes
£ the
with linear average-case work ([25, 26]). y ( M' Q1 parallel time; thus, for maximum shortest-path
The rest of the paper is organized as follows: in Section 2, weight & , at least ¤n &( 1.V 3¥ time is required. In order
we shortly review basic facts and techniques for average- to obtain a reasonable parallel time bound, should not be
case efficient parallel and sequential shortest-path algorithms. chosen too small. On the other hand, must not be taken too
Then, in Section 3, we present our new parallel SSSP algo- large in order to avoid a high work bound due to numerous
rithm. Finally, in Section 4, we give some concluding re- node rescans.
marks.

1530-2075/02 $17.00 © 2002 IEEE
Lemma 1 ([25]) For arbitrary directed graphs and indepen-
77 Parallel Degree Heaps are obtained by having a sequen-
dent random edge weights uniformly drawn from , the tial Degree Heap °· for each processor J3¸ . A random
·
average-case number of rescans for an arbitrary
node in mapping ¹ is used to distribute the nodes over the sequen-
the -stepping algorithm is bounded by 1 provided that tial Degree Heaps. The global step-width for a phase can be
§¦ V s for maximum node in-degree s . found in poly-logarithmic time; no restructuring of the pri-
ority queues is required. Unfortunately, PDH-SP requires a
2.2 Adaptive Bucket-Splitting. superlinear number of operations due to the usage of heaps
Based on the -stepping algorithm sketched above, a sequen- in order to decide which nodes to scan under changing global
tial SSSP algorithm, ADAP-SP, with average-case running step-widths.
time K ¨( 1 was developed [24]. Starting with buckets
of width § , it builds a bucket hierarchy in the following 3 Parallel Individual Step-Widths
way: before the algorithm scans all nodes from the current In this section we introduce the new parallel SSSP algo-
bucket cur of width cur , it checks the maximum node de- rithm called PIS-SP (for Parallel Individual Step-Widths).
gree s ^ in cur . If cur
®#¯ V s ^ then it splits cur into smaller Whereas the old PDH-SP approach always uses a common
buckets of size $ U©0ª «.¬ each, and continues with the left- step-width, PIS-SP applies different step-widths on disjoint
most nonempty bucket among those buckets that were just node sets at the same time using a new split-free bucket
generated. data structure. Thus, the new approach is not just a re-
Thus, on the one hand, the average-case number of rein- implementation of PDH-SP with a specialized bucket struc-
sertions and re-relaxations can be bounded by K ( 1 since ture but also utilizes a conceptually different node selection.
nodes with high degree are exclusively scanned from buckets
with sufficiently small widths; on the other hand, the number 3.1 The Algorithm.
of buckets is bounded by K ~( 1 as well, independent of PIS-SP applies a queue data structure that consists Q1
the maximum shortest-path weight & . Q
1 of
buckets in total, organized into arrays : each ar-
ray covers a total tentative distance range of width two. The
2.3 Parallelization: Buckets versus Heaps. buckets of each array are used in a cyclical fashion in order
A simple parallelization of ADAP-SP performs the opera- to subsequently store nodes with larger and larger tentative
tions for all nodes of the current bucket in parallel. Unfortu- º
distances. The array , ¦ ¦§¤l ¥ ³ ± consists of
nately, once a node with large degree s forces a reduction of $ buckets of width $ X each.
³ 7 7 $ / exclusively
the step-width in order to limit the risk of rescanning nodes, stores nodes with in-degree in 2
" $ X ¡ . Hence, for
this step-width is kept for a certain distance range – even if example a node @ with in-degree »
s =
¼ and tentative distance
no high-degree nodes remain in the new buckets of this range. D<7vD @ 1 ½ ¾ is stored in the array , ¿l sÀ (
¾ ¾
As a consequence, wxGs 1 time is spent to traverse these buck- , and
¾ more
V ! concretely
in %bucket
R ¿#F½ A $' 1<V R À
ets. The drawback of excessive bucket traversals was partially R ¿ ¼ $ ½2À § R $ . Note that there is no array for
removed in the Parallel Degree Heap SSSP algorithm [23], vertices with in-degree zero, as they cannot be reached any-
PDH-SP for short. PDH-SP uses a number of sequential pri- way.
ority queues and a simple method to compute an appropri- PIS-SP works as follows: initially, all nodes @ reachable
ate current step-width. As opposed to ADAP-SP, changing from the source node > via an edge F> @ 1 are put in paral-
the step-width does not require restructuring of the priority lel into their respective bucket arrays using the edge weights
queues themselves. However, this is paid for by a superlinear ?y> @ 1 as tentative distances. After that, PIS-SP operates in
work bound. phases: At the beginning of a phase it first determines the
cZdÁ D.7vD G@ 1
A sequential Degree Heap 7 ° 7 is a collection of ± globally smallest tentative distance ³
¤l ¥ relaxed heaps ° X °`² such that ° is in among all currently queued nodes. This step is quite intricate
charge of tentative distances for nodes having in-degree in since the bucket structure only implements an approximate
"2$ X 7 7
$ ¡ / . A relaxed heap allows insertions and de- priority queue. Hence, is not readily available; not too
crease key operations in worst-case constant time, deletions many operations must be used to find it. In Section 3.3 we
of the minimum in worst-case logarithmic time [12]. Let will show how can be obtained efficiently. Knowing ,
be the smallest tentative distance in ° ( for empty PIS-SP scans all nodes from #
¿ y
ÂA $'1<V À in par-

° ), and let ! . Then the algorithm computes allel for each , ¦ ¦+± . This may insert new nodes into
max ³ "à%b´"2$ X ¡t // . Subsequently, for or reinsert previously scanned nodes with improved tenta-
each ° , the SSSP algorithm scans all nodes @|=° satis- tive distances. However, each phase settles at least on node.
fying D.7vD G@ 1T ( max . Hence, nodes with .¶ in-degree s The algorithm stops if there are no queued nodes left after a
are scanned using step-width at most $ Uµ0ª «.¬ in order phase. Hence, it requires at most phases.
to limit the number of reinsertions. But whenever there are Each bucket is implemented by a separate array with dy-
either no high-degree nodes left in the queue or they have namic space adjustment; additional arrays are used in or-
sufficiently large tentative distance as compared to , then der to keep track of tentative distances and to remember in
the step-width is immediately increased to some larger value. which bucket and which array cell a queued node is currently

1530-2075/02 $17.00 © 2002 IEEE
º º
· X · and D<7vD · X G@ 1 ¦ D<vD · G@ 1 for
PIS-SP
edge weights,

and @| , . Let »@ 1 denote
of rescans for node @ during the phases to ; let þ@ 1
all the total number
ËQÌ ÍÏÎCÐ0ÑÃGÄZÒÅnÓ ÆÇÉÈKÊ dopar
foreach
indegree ÃlÆÇFÔWÕÖ
Ø denote be the -th rescan of he node @ happening in phase .
Insert Æ into ×ÙØ#ÚnÛlÜZÃGÄZÅnÆÇnÝÞß ß´à#á2â
Íæ The proof of the lemma is by induction; we show that all
while ãåä
rescans of the phases to can be injectively mapped onto
do
Ì ÍéèTê0ë'ì7í2î]ï#ðzë%ï
Determine ç ÃlÆÇ /* Phase starts */
Ë Ð0ÑZÒÓÂôõ simple paths. More specifically, a rescan of a node @
foreach Öòñ ñó dopar , in phase is mapped onto a simple dwb path J
èTÑö
Empty bucket ×UØ#ÚnÛ#ÃGç ÞZÇnÝÞ ß Ø ß´à á%â
÷ node set ø @ 7 7 @ X @ 4 @ of »
u¡
¦

edges into @ where @ is
scanned in phase ¡ , and D<7vD ·
X G@ º X 1( ?G@ º X @ 1
·
Build set ùûúnü for all edges out of ø .
D<7vD ·'@ 1 . It follows immediately that rescans of different
· · ·
Group ùýúnü by target nodes. ·
foreach target node Æ of ùûúnü dopar nodes are mapped onto different simple paths. In the remain-
Select best relaxation of an edge into Æ . der we are concerned with different rescans of the same node.
Group selected edges by target buckets. Each node is scanned at most once per phase; no node is
Perform relaxations rescanned in the first phase. If the node @ with degree s c is
(& enlarge buckets if needed). /* Phase ends */ scanned for the first time in phase then it is scanned from
bucket Û y· ÂA $'1<V á where c ¿l s c À ( ,

Figure 1. Pseudo-code for PIS-SP.
and
A CB#D @ 1 ¦ .D 7vD · G@ 1ò Û · V á ò(

stored. Removing nodes from buckets in parallel is clearly
· ¦
¶
congestion-free. However, insertions must be organized more
· (

carefully, see
Figure 1: Having removed
all nodes from the ¦ · ( $ Uµ0ª «.¬
buckets Û F ÂA $1<V$ X á for a phase (resulting in a
( $ (i.e.,
node set þ ), the set ÿ of all edges emanating from nodes Now we consider þ@ 1 where
in þ is built. An immediate parallel relaxation of the set ÿ D<7vD X G@ 1T=D<vD @ 1 and 3@ g¡ 1 å`¡ ). As @ was

might cause conflicts, therefore ÿ is first grouped by target scanned for the first time in some phase ! , all nodes with
nodes (using semi-sorting with small hashed values [5, 26]), final distances less than #" were already settled at that time;
and then the strictest relaxation request for each target node their respective edges into @ 4 (if any) will 1 neither be relaxed
(group) is selected. The selected requests are grouped once in phase ! nor later. Therefore, ;
þ
@ requires that some
more by target buckets and finally each group is appended node @ @ having @ @ 1 is scanned in phase u¡
in parallel after the last used position in the array for the (i.e., A CB#D @ 1Ù X #" ) where
1´( ? @ @ 1ò D<vD X @ 1u #" (é$ Uµ0ª «.¬ ¶
target bucket. If there are not sufficiently many contiguous
free positions left (the free positions may be scattered due to D<7vD X G @

and D<7vD
nodes that have been moved to other buckets during edge re- 1
X g
1 ( <
D
'
D 1
1 ¦ $ Uµ0ª «<¬ ¶ .
laxations), then the whole content of this bucket is compacted G@ ?G@ @ @ . These conditions
and then copied to a new array of twice the size. imply ?@ @
v¡ 1 1 then
Each phase can be performed in Q1 average-case If 3G@ þ@ 1 can be uniquely mapped
time; the respective number of operations excluding array onto the edge G$@ @ , which is relaxed for the first time in
size adjustments is linear in * þ *´( * ÿ * . The total work phase 5¡ . Obviously, @ @ is a simple dwb path
1 ¦ $ ûµ0ª «<¬ ¶ .
into @ of
needed to adjust the array sizes can be amortized over the at most É¡ edges because
?
@ 1 @
total number of edge (re-) relaxations. 1 Otherwise, i.e., if 3@ S¡ P then þ@ 5¡
was inductively mapped onto some simple dwb path J
3.2 Performance for Random Edge Weights. @ " 7 7 7 @ X @ 4 t@ of % ¦ ´¡ $ edges where @ is scanned
¡ , and D<vD · G@ º X 1Ù( ?@ º X @ 1
·
in phase ¡
D<7vD X ·'@ 1 . Hence, D<vD X !·G@ 1 ¦ D<7vD X · G@ 1 ,
In the following we consider the expected numbers of node · ·
rescans and phases for PIS-SP on graphs with random edge 4
D.7vD· X ·@ 1K D.7vD X G@ 1K · D<'D X @ 1 for all
weights: 4
¦ :
whereas

1 . Consequently, @ is not part of J . We map
7 7
Definition 1 Let J ³ @ · @ 4 be a path into an arbi-
þG@ onto the path J & GJ @ that is built from the
trary node @ 4 . J is i
7
called
h

< M
m
c
G o ¶
degree-weight
balanced
.
(dwb) if
concatenation of J and G@ @ 1 . As required, J is a sim-
?@ º X @ _1 ¦ $ ûµ0ª «<¬ ¬ , for all , ¦
ple dwb path of at most \¡ edges where the nodes are
Lemma 2 For each node @ , , the number of rescans dur- scanned in proper order and the equations for the tentative
distances hold. Furthermore, J is different from any other
path J ' ()J ' @ constructed for some þG@ ' ' 1 where
ing the execution of PIS-SP is bounded by the number of sim-
ple dwb paths into @ .
¦ ' \
' ¦ : when
constructing
J ' we either consid-
Proof: Let D.7vD · G@ 1 and denote the value of D<'D @ 1 and ered another edge G@ @ 1 :G@ @ 1 , but then the subpaths J
at the beginning of phase , respectively;
·
the set of nodes in and J ' end in different
cZdÁ D.7vD · G@ 1 . Clearly, for nonnegative nodes; or we
considered
different res-
define · ³ cans þ@ !¡ 1 and þ@ ' ¡ 1 of the same node @ ,

1530-2075/02 $17.00 © 2002 IEEE
but then these rescans were mapped on different paths J and -th phase of a EB -chunk. We will show
J ' by induction. V F2ÀB
¿y ¿y X V F
ÀB with high probability1 (1)
During the -th phase of the B -chunk, PIS-SP scans all
Lemma
!7
3 For random edge weights uniformly
drawn from from the current buckets G Û y· A $'1<V á , ¦
, PIS-SP rescans each node at most K 1 times on the
nodes

¦ . Let H · denote the set of these
holds if Û y X ÂA $ 1.V á å ¿#F ÂA $1<V À
average. nodes. Observe that (1)
c,+ ¦ ¦¢ , i.e., PIS-SP
Let * denote the number of simple dwb paths of
for at least one , _7 7
has advanced
Proof: at least one current bucket from X B . accord-
edges c + into an arbitrary node @ 4 of a graph 9 . We first
¦ $ . The argument is by induction: let s

show - *
ing to the cyclical ordering. F 7
So, let us assume that the cur-
rent buckets from X B in phase are those of
denote the in-degree of node @ . Excluding self-loops there c + is
phase . But then there must be at least one node @ IH
a set . X of at + most s 4 edges into + node @ 4 , hence - * X ¦ with D<vD @ 1t X ( F
s 4 Z$ ûµ0ª «.¬
¶
¦|s 4 Z$ Ùª «<¬ 7 7X V$ .
7B . In particular, there must be
a simple path J J G@ X @ of total weight less than
Now consider the set / " 0 X 02/1 of all simplec paths
+ B $ B X where @ H , and all @ , ¦ ¦K ,
· $ B
with354 d7 69edges
87: 0- is dwb ¦ $ .
into node @ 4 in 9 . By assumption - * have in-degree less
$ 7
B M m Xthan
o
·
. Into any node @
·
LH there

For each 0- G@ 7 7 @ 4 ;/ there are0´ at most
are less than such paths. As shown in [25], the
sum of independent
random edge weights (uniformly dis-
s edges into @ so that the concatenation with results ) is at most ¢¦
V !N . Hence, the probability that such
tributed in with probability at most
in a simple path of ( edges 7 7 into @ 4 . In particu- m ao path exists into
node @ º OH is bounded by $ BPM X B X V #U¡
lar, for each such path @ @ @ 4 , the weight of the 1 any
·F1 û$ X ¦ )<Q 8 . That proves (1). Therefore, after ¤ &uV F¥B
Ný¦
newly 7attached 7 7 edge G
@ @ is independent
7 7 7 of the weights
on @ ¶ @ 4< . Therefore, : G@ @
·
7

7
7
· @ 4< is dwb ¦ B -chunks the current buckets have been advanced so much
$ ûµ0ª «.¬ :
c º + @ @ 4< is dwb . By linearity of ex- that no node remains in the bucket structure with probability
pectation, - * X is bounded from above by at least ¡|¤ &uV ¥B # 9<Q 8 ¡ 9<Q X . This accounts for
= at most another $ B 7& M' Q1 B -phases with probability
Z$ Uµ0ª «<¬ ¶ : 0 >
@
7 7 7
@ 4< is dwb
9<Q X ; as PIS-SP requires at most phases in
4 _7d 6 8 s
at least ¡
B
the worst case, it needs at most $ _&N (~ _ 9<Q X
1 =
= c + $ B 7
& Q
1 additional B -phases on the average.
V$\ : 0 V$]
- * ¦ $ X
K M'
¦ 4 _ 7d 6 8 is dwb ¦ Altogether the algorithm runs in E $ B <&~()* , B * 1 Q1
c,+
phases on the average. Since this analysis holds for any inte-
By Lemma 2, @ 4 is rescanned at most * times, there- 3 ger , the average-case bound for all phases can be restated
·@? X of· @ is at most as K -E"2$ %& (å* ,!z*}/T Q1 . Note once more that the
3 +
-X * ¦ 3 X $ !·\ .
fore the c
average-case number of rescans 4 algorithm itself does not have to find an optimal compromise
·@? · ·@? between B -phases and . B -phases; they are just a theoretical
concept of the analysis.
Lemma 4 For

graphs
with random edge weights uniformly
drawn from , PIS-SP needs AT KGM' #M-n"2$ #&N( 3.3 Fast Node Selection.
* , *0/21 phases on the average where & denotes the maximum
shortest-path weight and * , * is the number of graph vertices
In the following we show how the bucket data structure can be
with in-degree at least $ . order to determine fast and efficiently: each
modified in
array is augmented by a pointer
to the first nonempty
bucket ^ of range ^ ^ (
Proof: Let
according to the cyclic
be some arbitrary integer. For the analysis,
let us distinguish two kinds of phases for the algorithm: B -
ordering.
phases scan only nodes with in-degree less than $ whereas
B
$ B
.)B -phases scan at least one node with in-degree or larger.
Lemma
!7
5 For random edge weights uniformly drawn from
, the total number of operations needed to maintain the
Let >CB denote the number
of .B -phases. By Lemma 3, each pointers to ^ during the execution of PIS-SP is bounded by
node is rescanned K 1 times on average. Hence, D > B K ~( 1 on the average. For each phase, maintaining the
K * , B * 1 . pointers takes at most KGM' Q1 time.
In the following
we are interested in EB -chunks: for some
constant ? , a B -chunk consists of g+G? ( ½ 1' M' u( Proof: ^ may have changed after a phase due to (i) an in-
consecutive B -phases without any intermediate . B -phases. sertion of a node @ into , (ii) a decrease of D<vD G@ 1 where
Observe that less than y<> B ( u
1 of all EB -phases do not
belong to any B -chunk. Let · be the smallest ten-
R
1 For a problem of size , we say that an event occurs with high probabil-
ity (whp) if it occurs with probability at least ß for an arbitrary but SGTUR WV
tative distance among queued nodes at the beginning of the fixed constant . XGYZS

1530-2075/02 $17.00 © 2002 IEEE

@ , (iii) a scan of @ from . Cases (i) and (ii) are
straightforward: if @ moves to a bucket dedicated to smaller
tentative distances then ^ is set to this new bucket otherwise 1/4 1/8 1/16 1/32 1/64 1/128
it stays unchanged. The grouping and selection procedure
taking place in each phase prior to parallel node insertions
can be adapted such that for each array the smallest inserted M(5) M(6)
tentative distance is determined as well. Only this value needs M

to be checked for a possible update of ^ . M(4)
Maintaining ^ for case (iii) is only non-trivial if the
bucket of the scanned node @ remains empty after a phase: M(2) M(3)
all buckets of must be checked in order to find the new

first nonempty bucket according to the cyclic ordering. On
a PRAM this can be accomplished using standard parallel
prefix-sum º and minimum computations [19] in Q1 time
and $ 1
work. These operations can be attributed to the
degree of the scanned node @ , which is at least $ X . Since
all degrees add up to 1 and since each node is scanned
M(1)
B1: 1 B2: 2−3 B3: 4−7 B4: 8−15 B5: 16−31 B6: 32−63
K 1 times on average (Lemma 3), the total amount of oper-

ations needed for all occurrences of case (iii) is bounded by Figure 2. Determination of the smallest tenta-
K K( 1 on the average, as well. tive distance in the bucket structure. The
circles denote the smallest distance in each
Finding the Minimum. We show how the values ^ array . In this example, only the nodes
^ ^ ( y1 in-
of the first nonempty buckets ^ nonempty buckets of X , R ,
in the first
troduced above are used to identify the value of the and 8 are examined to obtain : after
globally smallest tentative distance among all queued nodes. testing the nodes from the first nonempty
At the beginning of a phase PIS-SP computes the suffix-
bucket of X we find X $'1 ; hence,
minima of the values ^ , i.e., for each , it builds 1 ³ the bucket ¿#F $'1É ÂA $1.V À is tested, but
5" ^ Mº^ 7 7
^ / We set ^ ( if is
this bucket is empty. As \ F ¾ X 1 1 ,
¾
X ² ³
empty. Note that 1 ¦ 1W( X . Next, the pro- the search continues in R ¿ny É
1 ÂA
$ .
1 V RÀ
cedure refines safe estimates [ ¦P ¦] \ for the globally where we find a new smallest 1 element ( R ),
smallest tentative distance among all queued nodes: initially, but it is still larger than . We continue
in 8 ¿#F 1É ÂA $1.V 8 À where no smaller el-
we set [ X ³ + 1 and \ X ³ + 1W( X . The minimum
detection takes place in at most ± stages: ement is found. Finally, y½ 1 is larger than the
For the first stage, PIS-SP checks the bucket smallest distance seen so far ( R ). Thus, all
other elements in , ½ will be larger as
X_^` [ X ÂA $1.V X aWb : either, this bucket is empty, well, therefore | R .
or it constitutes the bucket X ^ keeping the node with
smallest tentative distance for the array X . All nodes
from X_^` [ X ÂA $1.V X aWb can be scanned during the
computes the smallest distance value among the nodes
phase according to the scan criterion. Let X be the smallest
tentative distance among all nodes of X ^ ( X ³ if X ^ in ^` [ ÂA $1.V aWb . Observe that this bucket is ei-
is empty). The value of X is determined with a parallel ther empty (then ³ ( ) or its nodes will be scanned
minimum algorithm causing work linear in the number of by PIS-SP. Therefore, the work to identify can be amor-

elements stored in bucket X ^ . If MQ" X \ / ¦ $'1 tized over the node scans. Having computed we may find
then either the whole bucket structure is empty or a node
5" \ / ¦| ( 1 , then M5" \ / definitely equals
@ with globally smallest tentative distance is stored in X ^ , ; the procedure stops. Otherwise, that is if S" \ /K

and therefore 5" X \ X / . The procedure stops. ( 1 , we only know for sure that ( 1 ¦ ¦

5" \ ( 1 ( º X / . The procedure will con-
Otherwise, if MS" X \ /) $1 , the queued node with
1
smallest tentative distance may still
be found in one of the tinue in º X with the new estimates: [ Mº X ³ å (
subsequent arrays , . . . , ² . Therefore, the procedure
and \ º X ³ MQ" \ (
1( Mº X / . After at most

sets [ $'1 , \ ³ 5" X \ X $1g ( / , and it ±¨§ Q
1 stages, is eventually determined. Figure 2
³ $ .
continues with stage on the bucket array provides an example.

In general, if the detection procedure reaches stage Ï$ Scanning the nodes from the examined buckets can be
then we may assume by induction that it has identified es- overlapped with the global minimum detection. However,
timates [ ¦ \ ¦c [ ( . The procedure asymptotically this does not improve the running time. The

1530-2075/02 $17.00 © 2002 IEEE
time required for a phase depends on the applied procedure
for the KGM' Q1 local minimum computations: with the stan- P
0
dard approach based on balanced trees [19] a phase of PIS-
SP needs Q1 time and y ( Q1 work where P
1
denotes the number of nodes scanned in that phase. Using
the constant-time linear-work randomized minimum compu-
P
tation from [15] reduces the time by a logarithmic factor. 2
Combining the results of this section we find:
P
3
Theorem 1 For
graphs with random edge weights uniformly
drawn from , PIS-SP requires PKGM' `.Mń"%$
&=( * , *0/21 average-case time using K é( ( 1 oper- deleted nodes generated
with adjacency lists requests
after inte−
ger sorting
selected
requests
transferred to
target processors
ations on a PRAM where & denotes the maximum shortest- i j request for a relaxation of edge (i,j)
path weight and * ,-.* is the number of graph vertices with in-
degree at least $ . Figure 3. Generating and performing relaxation
requests: requests are denoted by a box for
3.4 Conversion to Distributed Memory Machines. the source node and a circle for the target
node, colors are used to code node indices.
The straightforward way to obtain a DMM algorithm out of
The processors cooperate in building the total
a PRAM algorithm is to use an efficient PRAM simulation
set of requests: large adjacency lists are han-
method: given certain conditions, a step of a CRCW PRAM
dled by groups of PUs. Subsequently, the gen-
with J -processors can be simulated on aºedJ -processors BSP
GJ V J 1 time if J J X
erated requests are grouped by target nodes
f ; seein[30].
machine
d
Hence, if is defined
º
for any constant
h iTheorem
uº
using integer-sorting. Then superfluous re-
distributed algorithm runs in (>g j
as in 1, the
1 time. quests are filtered out, and the remaining re-
quests are sent to the processors which host
Alternatively, for a direct coding on a DMM machine, one
the appropriate bucket structures. Without the
first has to solve the data allocation problem: all arrays and
spreading and grouping, processor J 4 would
long adjacency lists are spread over the local memories of
be overloaded during the generation, and pro-
the processors using random hash functions. If the number
cessor J X would receive to many requests.
of processors, J , is reasonably bounded then the accesses to
the memory modules are sufficiently load-balanced in each
phase with high probability. Non-local memory access is re-
alized via message passing and random routing. Each phase graph, and the same value was estimated for telephone call
of the algorithm can be performed in a number of supersteps graphs [2]. Further studies ¾ mentioned in [4] on social net-
each of which consists of local computations, synchroniza- works resulted in nl $ , and for graphs of the electric
tion, and communication. Important ingredients for the con- power grid in the US, ol .
version are standard DMM implementations for tree based
The observed diameters are usually very small; on ran-
dom graph classes – with K Q1 edges – that are widely con-
reduction and broadcasting schemes [19]. They are needed
for prefix-sum / minimum computations
E n1 and distribution of
sidered to be appropriate models of real massive graphs like
values, e.g., when all edges @ of a high-degree node @
the WWW it turns out that & Q1 whp [22]. For
are to be relaxed, then the value of D<7vD @ 1 must be made
available to all PUs that store outgoing edges of @ . Figure 3
such graphs the number of nodes £ 3
having in-degree at least
s ^ is approximately given by L B ® S- 1 , which for
depicts the role of spreading and grouping for load balanc-
r $ and £arbitrary s ^ º ? can be bounded by
ing. The grouping steps of the algorithm can be implemented
constant
£ ~
Cp 1 ~

® S! s' m º o s ^ - X 1 . Taking s ^ XEY one
expects È X<Y M - X 1 |K XEY 1 nodes with in-degree
by DMM integer-sorting algorithms, e.g., [3]. The choice
of the sorting algorithm determines how many supersteps are
at least s ^ . On the other hand, we expect at least º one node
needed to implement a phase of the PRAM algorithm, and
hence how many processors can be reasonably used.
st
with in-degree swx rq q 1 by solving ~ s- X .
Hence, assuming independent
!77 random edge weights uni-
3.5 Performance Gain on Power Law Graphs. formly distributed in , the average-case time for SSSP
Many sparse massive graphs such as the WWW graph and on WWW-like graphs can be estimated as follows: The previ-
ous parallel SSSP algorithms from [25, 26] that achieve linear
telephone call graphs share universal characteristics which
average-case work on these graph classes require wx q q 1
st
can be described by the so-called “power law” [2, 4, 21]: the
number of nodes, k , of a given in-degree is proportional time since they have to traverse wxGs 1 ¢wx q q 1 buckets
st
to S! for some constant . For most massive graphs, sequentially. Our new approach requires X<Y
$L ¦ : independently, Kumar et. al. [21] and Babarasi
M' (t XEY 1E1 : X<Y M' R Q1 average-case time using
et. al. [4] reported ml $ for the in-degrees of the WWW linear K û( 1 work. Hence, while retaining the linear work

1530-2075/02 $17.00 © 2002 IEEE
m
bound, the average-case o running
R
time could be improved by [11] E. W. Dijkstra. A note on two problems in connexion with
a factor of wx X<Y 'p- V Q1 . graphs. Num. Math., 1:269–271, 1959.
[12] J. R. Driscoll, H. N. Gabow, R. Shrairman, and R. E. Tarjan.
Relaxed heaps: An alternative to fibonacci heaps with appli-
4 Conclusions cations to parallel computation. Communications of the ACM,
We have given a new parallel SSSP algorithm together with 31, 1988.
a powerful split-free bucket data-structure. It facilitates im- [13] A. Formella, J. Keller, and T. Walle. HPP: A high performance
proved average-case time linear-work SSSP computations for PRAM. In Proc. Euro-Par 1996 Parallel Processing, vol-
many graph classes with small diameter. ume II of LNCS 1124, pages 425–434, Berlin, August 1996.
Springer.
In order to obtain fast SSSP algorithms with linear [14] A. Frieze and C. McDiarmid. Algorithmic theory of random
average-case work, one might also try to parallelize Gold- graphs. Random Structures and Algorithms, 10:5–42, 1997.
berg’s new sequential label-setting algorithm [16]. However, [15] A. M. Frieze and L. Rudolph. A parallel algorithm for all-
in its current form, the criterion used to detect nodes that can pairs shortest paths in a random graph. In Proc. 22nd Aller-
be scanned in arbitrary order is at most as efficient as the IN- ton Conference on Communication, Control and Computing,
approach in [10]. It can be shown that there are graph classes pages 663–670, 1985.
with random edge weights, maximum constant node degree [16] A. V. Goldberg. A simple shortest path algorithm with lin-
and maximum shortest-path weight & Q1 where this ear average time. In Proc. 9th Ann. European Symposium
criterion requires wx_[ WV M' Q1 phases. In contrast, our new on Algorithms (ESA), number 2161 in LNCS, pages 230–241.
R
approach runs in Q1 time and linear work for these
Springer, 2001.
[17] Q. P. Gu and T. Takaoka. A sharper analysis of a parallel al-
graphs. Still, there may be better label-setting criteria. gorithm for the all pairs shortest path problem. Parallel Com-
For the future it would be desirable to solve the SSSP puting, 16(1):61–67, 1990.
on Web-like graphs in poly-logarithmic time using at most [18] Y. Han, V. Pan, and J. Reif. Efficient parallel algorithms for
K L N( 1 work. Furthermore, any work-efficient al- computing all pair shortest paths in directed graphs. Algorith-
gorithm with sublinear running time that is independent of mica, 17(4):399–415, 1997.
the diameter would be of great interest. [19] J. Jájá. An Introduction to Parallel Algorithms. Addison-
Wesley, Reading, 1992.
[20] P. N. Klein and S. Subramanian. A randomized parallel algo-
Acknowledgements rithm for single-source shortest paths. Journal of Algorithms,
The author would like to thank Annamária Kovács for giving 25(2):205–220, Nov. 1997.
valuable comments on a previous version of this paper. [21] R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins.
Trawling the web for emerging cyber-communities. In Proc.
8th International World-Wide Web Conference, 1999.
References [22] L. Lu. The diameter of random massive graphs. In Proc. 12th
[1] R. K. Ahuja, T. L. Magnanti, and J. B. Orlin. Network flows: Ann. Symp. on Discrete Algorithms, pages 912–921. ACM–
Theory, Algorithms and Applications. Prentice Hall, Engle- SIAM, 2001.
wood Cliffs, NJ, 1993. [23] U. Meyer. Heaps are better than buckets: Parallel shortest
[2] W. Aiello, F. Chung, and L. Lu. A random graph model for paths on unbalanced graphs. In Proc. Euro-Par 2001 Parallel
massive graphs. In Proc. 32nd Annual ACM Symposium on Processing, volume 2150 of LNCS, pages 343–351. Springer,
Theory of Computing, pages 171–180. ACM, 2000. 2001.
[3] D. A. Bader, D. R. Helman, and J. Jájá. Practical parallel al- [24] U. Meyer. Single-source shortest-paths on arbitrary directed
gorithms for personalized communication and integer sorting. graphs in linear average-case time. In Proc. 12th Ann. Symp.
Journal of Experimental Algorithms, 3(1):1–42, 1996. on Discrete Algorithms, pages 797–806. ACM–SIAM, 2001.
[4] A. Barabasi and R. Albert. Emergence of scaling in random [25] U. Meyer and P. Sanders. u -stepping: A parallel shortest
networks. Science, 286:509–512, 1999. path algorithm. In Proc. 6th Ann. European Symposium on
[5] H. Bast and T. Hagerup. Fast parallel space allocation, es- Algorithms (ESA), volume 1461 of LNCS, pages 393–404.
timation and integer sorting. Information and Computation, Springer, 1998.
123:72 – 110, 1995. [26] U. Meyer and P. Sanders. Parallel shortest path for arbitrary
[6] B. Bollobás. Random Graphs. Academic Press, 1985. graphs. In Proc. Euro-Par 2000 Parallel Processing, volume
[7] G. S. Brodal, J. L. Träff, and C. D. Zaroliagis. A parallel pri- 1900 of LNCS, pages 461–470. Springer, 2000.
ority queue with constant time operations. Journal of Parallel [27] J. Reif and P. Spirakis. Expected parallel time and sequential
and Distributed Computing, 49(1):4–21, 1998. space complexity of graph and digraph problems. Algorith-
[8] A. Clementi, J. Rolim, and E. Urland. Randomized parallel mica, 7:597–630, 1992.
algorithms. In Solving Combinatorial Problems in Parallel, [28] H. Shi and T. H. Spencer. Time–work tradeoffs of the
volume 1054 of LNCS, pages 25–50, 1996. single-source shortest paths problem. Journal of Algorithms,
[9] E. Cohen. Using selective path-doubling for parallel shortest- 30(1):19–32, 1999.
path computations. Journal of Algorithms, 22(1):30–56, Jan. [29] M. Thorup. Undirected single-source shortest paths with pos-
1997. itive integer weights in linear time. Journal of the ACM,
[10] A. Crauser, K. Mehlhorn, U. Meyer, and P. Sanders. A paral- 46:362–394, 1999.
lelization of Dijkstra’s shortest path algorithm. In 23rd Symp. [30] L. G. Valiant. A bridging model for parallel computation.
on Mathematical Foundations of Computer Science, volume Communications of the ACM, 33(8):103–111, 1990.
1450 of LNCS, pages 722–731. Springer, 1998.

1530-2075/02 $17.00 © 2002 IEEE

Buckets Strike Back - Improved Parallel Shortest-Paths

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Buckets Strike Back - Improved Parallel Shortest-Paths

Uploaded by

Copyright:

Available Formats

Buckets strike back: Improved Parallel Shortest-Paths

Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS’02)

Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS’02)

A CB#D @ 1 ¦ .D 7vD · G@ 1ò Û · V á ò(

Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS’02)

Lemma 4 For

Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS’02)

tentative distance is determined as well. Only this value needs M

all buckets of must be checked in order to find the new

K 1 times on average (Lemma 3), the total amount of oper-

Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS’02)

Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS’02)

Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS’02)

You might also like

Buckets Strike Back - Improved Parallel Shortest-Paths

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Buckets Strike Back - Improved Parallel Shortest-Paths

Uploaded by

Copyright:

Available Formats

Buckets strike back: Improved Parallel Shortest-Paths

Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS’02)

Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS’02)

A CB#D  @ 1 ¦ .D 7vD · G@ 1ò Û  · V   á   ò(  

Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS’02)

Lemma 4 For 

Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS’02)

tentative distance is determined as well. Only this value needs M

all buckets of   must be checked in order to find the new

K 1 times on average (Lemma 3), the total amount of oper-

Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS’02)

Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS’02)

Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS’02)

You might also like

A CB#D @ 1 ¦ .D 7vD · G@ 1ò Û · V á ò(

Lemma 4 For

all buckets of must be checked in order to find the new

K 1 times on average (Lemma 3), the total amount of oper-