Professional Documents
Culture Documents
Ar en
t * Co m p let e
*
A ECDo
*
st
PoPP *
We
* P se * Consi
ll
Multi-Core On-The-Fly SCC Decomposition
*
cu m
eu
Ev
e
R nt
ed
* Easy t o
alu d
at e
stack R is maintained independently of the program stack and 3. A Multi-Core SCC Algorithm
invariantly contains a subset of the latter (in the same order). Note The current section describes our multi-core SCC algorithm. The
that the algorithm is called for an initial node v0 (Line 4), i.e. only algorithm is based on the PRDFS concept, where P workers ran-
vertices reachable from v0 are considered. W.l.o.g. we assume that domly process the graph starting from v0 and prune each other’s
all vertices are reachable from v0 (as defined with GI ). search space by communicating parts of the graph that have been
The algorithm partitions vertices in three sets, s.t. ∀v ∈V , either: processed. (This dynamically, but not perfectly, partitions the graph
a. v ∈ DEAD, implying that v is part of a completely explored accross the workers, trading redundancy for communication.) To
(and reported) SCC, the best of our knowledge, the current approach of doing this is
b. v ∈/ VISITED, (hence also v ∈ / DEAD) implying that v has not by ‘removing’ completed SCCs from the graph once identified by
been encountered before by the algorithm, or one worker [46]. The random exploration strategy then would take
c. v ∈ LIVE, implying that v is part of a partial component, i.e. care of work load distribution. However, such a method would not
def
LIVE = VISITED \ DEAD . scale for graphs consisting of single large SCC. We demonstrate a
It can be shown that the algorithm ensures that all supernodes for method that is able to communicate partially identified SCCs and
the vertices on R are disjoint and together contain all LIVE vertices: can therefore scale on more graphs.
] The basis of the parallel algorithm is the sequential set-based
S(v) = LIVE (1) algorithm, developed in the previous section, which stores strongly
v∈R connected vertices in supernodes. For the time being, we do not
As a consequence, the LIVE vertices can be reconstructed using the burden ourselves with the exact data structures and focus on solu-
map S and R, so we do not actually need a VISITED set except tions that ease explanation. Afterwards, we show how the set oper-
for ease of explanation. Furthermore, it can be shown that all LIVE ations are implemented.
vertices have a unique representation on R:
Communication principles. We assume that each parallel worker
{S(v) ∩ R | v ∈ LIVE} = {{r} | r ∈ R} (2) p starts a PRDFS search with a local stack Rp and randomly
searches the graph independently from v0 . The upcoming algo-
Or, for each LIVE vertex v, its mapped supernode S(v) has exactly
rithm is based on four core principles about vertex communication:
one r ∈ R ∩ S(v) s.t. S(v) = S(r). Both equations play a role in
1. If a worker encounters a cycle, it communicates this cycle
ensuring that the algorithm returns complete SCCs, explained next.
globally by uniting all vertices on the cycle to one supernode,
However, we will also use them in the subsequent section to guide
i.e. S becomes a shared structure. As a consequence, worker p
the parallelization of the algorithm.
might unite multiple vertices that remain on the stack Rp0 of
From the above, we can illustrate how the algorithm decom-
some other worker p0 , violating Equation 2 for now.
poses SCCs. If a successor w of v is LIVE, it holds that ∃r ∈
2. Because, the workers can no longer rely on the uniqueness
R : S(r) = S(w). Such a LIVE vertex is handled by Line 12-
property, the algorithm looses its completeness property: It may
14, where the top two vertices from R are united, until r is en-
report partial SCCs because the collapsing stops early (before
countered, or rather until it holds that S(r) = S(v) = S(w).
the highest connected supernode on the stack is encountered).
Because r has a path to v (an inherited invariant of the program
To remedy this, we iterate over the contents of the supernodes
stack), all united components are indeed strongly connected, i.e.
until all of its vertices have been processed.
lie on a cycle. Moreover, because of r’s uniqueness (Equation 2),
3. Since other workers constantly grow partial SCCs in the shared
there is no other LIVE component part of this cycle (eventually this
S, it is no longer useful to retain a local VISITED set. Equation 1
guarantees completeness, i.e. that the algorithm reports a maximal
showed that indeed the LIVE vertices can be deduced from S
strongly connected component).
and R in the sequential algorithm. We choose to use an implicit
A completed SCC S(v) is reported and marked DEAD if v
definition of LIVE so that a worker p only explores a vertex v
remains on top of the R stack at Line 15, indicating that v could
if there is no r ∈ Rp , s.t. v ∈ S(r), we say v is not part of an
not be united with any other vertices in R (lower on R).
Rp supernode. Thus vertices connected to a supernode on Rp
Union-find. The disjoint set union or union-find [54] is a data by some other worker p0 can be pruned by p.
structure for storing and retrieving a disjoint partition of a set of 4. If a worker p backtracks from a vertex, i.e. it finds that all
nodes. It also supports the merging of these partitions, allowing successors are either DEAD or part of some Rp supernode, it
records said vertex in a global set DONE for all other workers to Theorem 1 (Soundness). If two vertices are in the same set, they
disregard. Here, a DONE vertex implies that it is fully explored. must form a cycle: ∀v, w : w ∈ S(v) ⇒ v →∗ w →∗ v.
Once all vertices in an SCC are DONE, the SCC is marked
DEAD . Proof. Initially the hypothesis holds, since ∀v : S(v) = {v}. As-
The algorithm. The multi-core SCC algorithm, provided in Al- suming that the theorem holds for S before the executing a state-
gorithm 2, is similar to the sequential set-based one (Algorithm 1) ment of the algorithm, we show it holds after, considering only the
and follows the four discussed principles. Each worker p maintains non-trivial case, i.e. Line 14. Before Line 12, we have S(v) =
a local search stack (Rp ), while S is shared among all workers for S(Rp .TOP()) (either v = Rp .TOP() or a recursive procedure has
globally communicating partial SCCs. We also globally share the united v with the current Rp .TOP()). The while-loop (Line 12-
DONE and DEAD sets. For now we assume that all lines in the algo-
14) continuously pops the top vertex of Rp and unites it with the
rithm can be executed atomically. new Rp .TOP(). Since ∃w0 ∈ Rp : w ∈ S(w0 ) (the condition at
Line 10 did not hold), the loop ends iff S(R.TOP()) = S(v) =
Algorithm 2 The UFSCC algorithm (code for worker p) S(v 0 ) = S(w) = S(w0 ). Because v 0 → w and w0 →∗ R.TOP()
(Lemma 1), we have that the cycle v 0 →∗ w →∗ v 0 is merged, thus
1: ∀v ∈ V : S(v) := {v} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . [global S] preserving the hypothesis in the updated S.
2: DEAD := DONE := ∅ . . . . . . . . . . . . . . . . . . . . [global DEAD and DONE ]
3: ∀p ∈ [1 . . . p] : Rp := ∅ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . [local Rp ]
4: UFSCC1 (v0 )k . . . kUFSCCp (v0 ) Lemma 2. After UFSCCp (v) terminates, v ∈ DEAD.
5: procedure UFSCCp (v)
6: Rp .PUSH(v) Proof. This follows directly from Line 16 and the fact that v is
7: while v 0 ∈ S(v) \ DONE do never removed from S(v).
8: for each w ∈ R ANDOM(POST(v 0 )) do
9: if w ∈ DEAD then continue . . . . . . . . . . . . . . . . . . . . [DEAD ]
Lemma 3. Successors of DONE vertices are DEAD or in a supern-
10: else if @w0 ∈ Rp : w ∈ S(w0 ) then . . . . . . . . . . . . . [NEW ]
11: UFSCCp (w) ode on Rp : ∀v ∈ DONE : POST(v) ⊆ (DEAD ∪ (S(v) ∩ Rp )).
12: else while S(v) 6= S(w) do . . . . . . . . . . . . . . . . . . . . [LIVE ]
13: r := Rp .POP() Proof. The only place where DONE is modified is in Line 15.
14: UNITE (S, r, Rp . TOP ()) Vertices are never removed from DEAD or S(v) for any v. In
15: DONE := DONE ∪ {v 0 } Line 8-14, all successors w of v 0 are considered separately:
16: if S(v) 6⊆ DEAD then DEAD := DEAD ∪ S(v); report SCC S(v)
17: if v = Rp .TOP() then Rp .POP()
1. w ∈ DEAD is discarded.
2. w is not in a supernode of Rp and UFSCC is recursively in-
voked for w. From Lemma 2, we obtain that w ∈ DEAD upon
First of all, instead of performing only a DFS, the algorithm also return of UFSCCp (w).
iterates over vertices from S(v), at Line 7-15, until every v 0 ∈ S(v) 3. there is some w0 ∈ Rp s.t. S(w) = S(w0 ). Supernodes of
is marked DONE (principle 2 and 4). Therefore, once this while- vertices on Rp are united until S(v) = S(v 0 ) = S(w).
loop is finished, we have that S(v) ⊆ DONE, hence the SCC may All three cases thus satisfy the conclusion of the hypothesis before
be marked DEAD in Line 16. The if-condition at that line succeeds v 0 is added to DONE at Line 15.
if worker p wins the race to add S(v) to DEAD, ensuring that only
one worker reports the SCC. Thus, when UFSCC(v) terminates,
we ensure that v ∈ DEAD and that v is removed from the local Lemma 4. Every vertex in a DEAD partial SCC set is DONE:
stack Rp (Line 17). ∀v ∈ DEAD : ∀v 0 ∈ S(v) : v 0 ∈ DONE.
For every v 0 ∈ S(v) \ DONE, the successors of v 0 are consid-
ered in a randomized order via the R ANDOM() function at Line 8 Proof. Follows from the while-loop exit condition at Line 7.
— according to PRDFS’ random search and prune strategy. After
the for-loop from Line 8-14, we infer that POST(v 0 ) ⊆ (DEAD ∪ Theorem 2 (Completeness). After UFSCCp (v) finishes, every
S(v)) and therefore v 0 may be marked DONE (principle 4). reachable vertex t (a) is DEAD and (b) S(t) is a maximal SCC.
Compared to the VISITED set from Algorithm 1, this algorithm
uses S and Rp to derive whether or not a vertex w has been ‘visited’ Proof. (a. every reachable vertex t is DEAD). Since v ∈ DEAD
before (principle 3). Here, ‘visited’ implies that there exists a vertex (Lemma 2), we have ∀v 0 ∈ S(v) : v 0 ∈ DONE (Lemma 4).
w0 on Rp (hence also on the local search path) which is part By Lemma 3, we deduce that every successor of v 0 is DEAD (as
of the same supernode as w. Section 4 shows how iteration on v 0 ∈ S(v) implies v 0 ∈ DEAD by Line 16). Therefore, by applying
Rp can be avoided. The recursive procedure for w is called for a the above reasoning recursively on the successors of v 0 we obtain
successor of v 0 that is part of the same supernode as v. Assuming that every reachable vertex from v is DEAD.
that ∀a, b ∈ S(v) : a ↔ b holds, we have that v → v 0 → w and w (b. S(t) is a maximal SCC). Assume by contradiction that for
is reachable from v. disjoint DEAD sets S(v) 6= S(w) we have v →∗ w →∗ v. W.l.o.g.,
Correctness sketch. The soundness defined here implies that re- we can assume v → w. Since S(w) 6= S(v), we deduce that w ∈
ported SCCs are strongly connected, whereas completeness implies DEAD when this edge is encountered (Lemma 3). However, since
that they are maximal. We demonstrate soundness (in Theorem 1) w →∗ v we must also have v ∈ DEAD (Theorem 2a), contradicting
by showing that vertices added to S(v) are always strongly con- our assumption. Hence, every DEAD S(t) is a maximal SCC.
nected with v (or any other vertex in S(v)). Completeness (see
Theorem 2) follows from the requirements to mark an SCC DEAD Theorem 3 (Termination). Algorithm 2 terminates for finite graphs.
(Lemma 2–4). Complete correctness follows from termination in
Theorem 3. Proof. The while-loop terminates because vertices can only be
Lemma 1 (Stack). ∀r ∈ Rp : r →∗ Rp .TOP(). added once to S(v) and due to a strictly increasing DONE in every
iteration. The recursion stops because the vertices part of LIVE
Proof. Follows from Rp being a subset of the program stack. supernodes in Rp only increase: as per Line 6 and Line 14.
4. Implementation a → e → f → e (with S(f ) = {e, f }). Worker R now considers
For the implementation of Algorithm 2, we require the following: the edge f → c and wants to know if it has previously visited a
- Data structures for Rp , S, DEAD and DONE. vertex in S(c) (which would imply a cycle). Because the worker set
- A mechanism to iterate over v 0 ∈ S(v) \ DONE (Line 7). is managed at the root of S(c) (which is node b), worker R simply
- Means to check if a vertex v is part of (S(v) ∩ Rp ) (Line 10). has to check there if its worker bit is set. Since this is the case, the
- An implementation of the UNITE() procedure (Line 14). sets S(f ) and S(e) can be united.
- A technique to add vertices to DONE and DEAD (Line 15,16).
We explain how each aspect is implemented. For a detailed version Cyclic linked list. The most challenging aspect of the imple-
of the complete algorithm, we refer the readers to Appendix A. mentation is to keep track of the DONE vertices and iterate over
Rp can be implemented with a standard stack structure. S can S(v) \ DONE. We do this by making the union-find iterable by
be implemented with a variation of the wait-free union-find struc- adding a separate cyclic list implementation. That is, we extend the
ture [2]. Its implementation employs the atomic Compare&Swap structure with a next pointer and a list status flag that indicates if
(CAS) instruction (c := CAS(x, a, b) atomically checks x = a, and a vertex is either BUSY or DONE (initially, the flag is set to BUSY).
if true updates x := b and c := T RUE, else just sets c := FALSE) The idea is that the cyclic list contains all BUSY vertices. Ver-
to update the parent for a vertex without interfering with opera- tices marked DONE are skipped for iteration, yet not physically re-
tions from other workers on the structure (FIND() and UNITE()). moved from the cycle. Rather, they will be lazily removed once
We implement this structure in an array, which we index using the reached from their predecessor on the cycle (this avoids maintain-
(unique) hashed locations of vertices. In a UNITE(S, a, b) proce- ing doubly linked lists). All DONE vertices maintain a sequence of
dure, the new root of S(a) and S(b) is determined by selecting next pointers towards the cyclic list, to ensure that the cycle of
0
either the root of S(a) or S(b), whichever has the highest index, BUSY vertices is reachable from every vertex v ∈ S(v). If all ver-
i.e.: essentially random in our setting using hashed locations. This tices from S(v) are marked DONE, the cycle becomes empty (we
mechanism for uniting vertices preserves the quasi-constant time end up with a DONE vertex directing to itself) and S(v) can be
complexity for operations on the union-find structure [23]. marked DEAD.
The DEAD set is implemented with a status field in the union- To illustrate the cyclic list structure, consider the example from
find structure. We have that S(v) is DEAD if the status for the root Figure 3. In the partial SCC {a, b, c, d}, vertices a and c are marked
of v is DEAD, thus marking an SCC DEAD, indeed implementing DONE (gray), and vertices b and d are BUSY (white). We fur-
Line 16 of Algorithm 2 atomically. This marking is achieved in ther assume that {e} is a DEAD SCC. The vertices a and c are
quasi-constant time (a FIND() call with a status update). marked DONE since their successors are in (DEAD ∪ S(a)) (see
also Lemma 3). Note that b and d (and f ) may not yet be marked
Worker set. We show how @w0 ∈ Rp : S(w0 ) = S(w) (Line 10) DONE because their successors have not been fully explored yet.
is implemented. In the union-find structure, for each node we keep The right side depicts a possible representation for the cyclic list
track of a bitset of size p. This worker set contains a bit for each structure (where the arrows are next pointers).
worker and represents which workers are currently exploring a In the algorithm, Line 7 can be implemented by recursively
partial SCC. We say that worker p has visited a vertex w if the traversing v.next to eventually encounter either a vertex v 0 ∈
worker set for the root of S(w) has the bit for p set to true. If S(v) \ DONE or otherwise it detects that there is no such vertex and
otherwise worker p encounters an unvisited successor vertex w, it the while-loop ends. Line 15 is implemented by setting the status
sets its worker bit in the root of S(w) using CAS and recursively for v 0 to DONE in the union-find structure, which is lazily removed
calls UFSCC(w) (Line 10). In the UNITE(S, a, b) procedure, after from the cyclic list.
updating the root, the worker sets for a and b are combined (with a The UNITE(S, a, b) procedure is extended to combine two
logical OR, implemented with CAS) and stored on the new root to cyclic lists into one (without losing any vertices). We show this
preserve that a worker bit is never removed from a set. The process procedure using the example from Figure 4. Here UNITE(S, a, f )
is repeated if the root is updated while updating the worker set. is called (b becomes the new root of the partial SCC). The next
Since only worker p can set worker bit p (aside from combining pointers (solid arrows) for a and f are swapped with each other
worker sets in the UNITE() procedure), only partial SCCs visited (note that this can not be done atomically), which creates one list
by worker p (on Rp ) can and will contain the worker bit for p. containing all vertices. It is important that both a and f are BUSY
As an example to explain the worker set, consider Figure 2. vertices (as otherwise part of the cycle may be lost), which is en-
Here, worker B has found the cycle a → b → c → d → a, sured by first searching and using light-weight, fine-grained locks
which are all united: S(b) = {a, b, c, d}, and worker R explored (one per node) on BUSY vertices a0 and f 0 . The locks also protect
the non-atomic pointer swap.
d
e ?
? a
{r,b} b {r} e
a b c e
a b ? b d
a c d f
{r,b} {b} {b} {r}
c f
e f
c d f
Figure 2: Illustration of the worker set in practice. The left side Figure 3: Illustration of the cyclic list in practice. The left side
depicts the graph, the union-find structure with the worker set is depicts the graph, the right side depicts the corresponding cyclic
shown right for workers R and B. list structure. White nodes are BUSY and gray nodes are DONE.
case work complexity for the algorithm is bounded by O(α·p·(n+
b f b f m)). As typically, the inverse Ackermann function is bounded by a
constant, we obtain O(p · (n + m)). The runtime, however, remains
c g c g bounded by O(m + n) (ignoring the quasi factor), as useless work
is done in parallel. The best-case runtimes from Table 1, O( n+m
p
),
a e a e
follow when assuming that approximately each vertex is processed
(a) (b) by only one worker.1
Figure 4: Cyclic list before (a) and after (b) a UNITE(S, a, f ) call. Memory usage. The memory usage of Algorithm 2, using the
Dashed edges depict parent pointers and solid edges next pointers. implementation discussed in the current section, comprises two
aspects. First, each worker has a local stack Rp , which contains at
most n items (e.g. vertices on a path or within the same SCC when
drawn from the iterable union-find structure). Second, the global
a b c a b c union-find contains: a parent pointer, a worker set, a next pointer,
and status fields for both union-find and its internal list (Figure 6).
(a) (b) Assume that vertices and pointers in the structure can be stored
in constant amount of space, their storage requires 3 units. Also
Figure 5: Cyclic list before (a) and after (b) a list ‘removal’. Solid assuming again that p n, the worker set takes another unit, e.g.
edges depict next pointers and gray nodes are DONE. one 64-bit machine word for p = 64. Then, the algorithm uses at
most (p · n) + 4n units (stacks plus shared data structures).
Removing a vertex from the cyclic list is a two-step process. The worst-case space complexity becomes O(p · n) (also when
First, the status for the vertex is set to DONE. Then, during a p is not constant, e.g. approaching n, and the worker set can-
search for BUSY vertices, if two subsequent DONE vertices are not be stored in a constant anymore). The local replication on the
encountered, the next pointer for the first one is updated (Figure 5). stacks is the main cause of the worst-case memory usage of PRDFS
Note that the ‘removed’ vertex remains pointing to the cyclic list. algorithms as shown in Table 1 (other PRDFS algorithms, such
We use a lock on the union-find structure (only on the root for as [18, 19, 32, 46], do not require a worker set). In practice how-
which the parent gets updated) and a lock on list nodes encoded as ever, the replication is proportional to the contention occurring on
status flags. The structure for union-find nodes is shown in Figure 6. vertices which is arguably low when the algorithm processes large
We chose to use fine-grained locking instead of wait-free solution, graphs and p n. Hence the actual memory usage will be closer
for two reasons: we do not require the stronger guarantees of the to the best-case complexity that occurs when stacks do not over-
latter, and the pointer indirection needed for a wait-free design lap, i.e.: O(n) (unless, again, p is not constant and the worker set
would decrease performance compared to the current node layout dominates memory usage). Section 6.1 confirms this.
with its memory locality (as we learned from our previous work).
5. Related Work
pointer parent . . . [union-find parent]
pointer next . . . . [list next pointer] Parallel on-the-fly algorithms. Renault et al. [46] were the first
bit[p] worker set . . . . . . [worker set] to present a PRDFS algorithm that spawns multiple instances of
bit[2] uf status [∈ {LIVE, LOCK, DEAD}] Tarjan’s algorithm and communicates completely explored SCCs
bit[2] list status [∈ {BUSY, LOCK, DONE}] via a shared union-find structure. We improve on this work by
communicating partially found SCCs. Lowe [36] runs multiple
Figure 6: The node record of the iterable union-find data structure. synchronized instances of Tarjan’s algorithm, without overlapping
stacks. Instead a worker is suspended if it meets another’s stack
and stacks are merged if necessary. While in the worst case this
Runtime/work tradeoff. Our treatment of the complexity here is
might lead to a quadratic time complexity, Lowe’s experiments
not as formal as in e.g. [55], and relies on practical assumptions.
show decent speedups on model checking inputs, though not for
We argue that, while the theoretical complexity study of parallel
large SCCs.
algorithms offers invaluable insights, it is also remote from the
practical solution that we focus on. E.g. we can not within the Fix-point based algorithms. Section 1 discussed several fix-point
foreseeable future obtain n parallel processors for n-sized graphs. based solutions. A notable improvement to FB [20] is the trimming
We assume that overhead from synchronization and successor procedure [39], which removes trivial SCC without fix-point recur-
randomization is bound by a constant factor. Because the worker sion. OBF [4] further improves this by subdividing the graph in
sets need to be updated, UNITE operations take O(p) time and a number of independent sub-graphs. The Coloring algorithm [42]
work (instructions). Assuming again that p is small enough for propagates priorized colors dividing the graph in disconnected sub-
the worker set to be stored in one or few words (e.g. 64), UNITE graphs whose backwards slices identify SCCs. Barnat et al. [6] pro-
becomes quasi constant (now the FIND operation it calls domi- vide CUDA implementations for FB, OBF, and Coloring. Copper-
nates). Finding a BUSY vertex in the cyclic list is bounded by smith et al. [13] present and prove an O(m · log n) serial runtime
O(n) (the maximal sequence of DONE vertices before reaching a version of FB, Schudy [48] further extended this to obtain an ex-
BUSY vertex). However, we suspect that due to the lazy removal pected runtime bounded by O(log2 n) reachability queries.
of DONE nodes, similar to path-compression [54], supernode iter- Hong et al. [25] improved FB for practical instances, by also
ation is amortized quasi constant under the condition that p n. trimming SCCs of size 2. After the first SCC is found (which is
Though if this not the case, we could always use a more compli- assumed to be large in size), the remaining components are detected
cated doubly linked list instead [58, p. 63]. So amortized, all su-
pernode operations are bounded by O(α) time and work. I.e. the 1 We acknowledge that for fair comparison with Table 1, the impact of the
inverse Ackermann function representing quasi constant time. worker set (a factor p) cannot be neglected, e.g. for large p approaching n,
Since every worker may visit every edge and vertex, performing yielding O(p · max(α, p) · (n + m)) work and O(max(α, p) · (n + m))
a constant number of supernode operations on the vertex, the worst- runtime.
(with Coloring) and decomposed with FB. Slota et al. [49] propose Table 2: Graph characteristics for model checking, synthetic and
orthogonal optimization heuristics and combine them with Hong’s explicit graphs. Here, M denotes millions and columns respectively
work. Both Hong’s and Slota’s algorithm operate in quadratic time. denote the number of vertices, transitions and SCCs, the maximal
SCC size and an approximation of the diameter.
Parallel DFS. Parallel DFS is remotely related. It has long been
researched intensively, though there are two types of works [21]: graph # vertices # trans # sccs max scc diameter
one discussing parallelizing DFS-like searches (often using terms
leader-filters.7 26.3M 91.7M 26.3M 1 71
like ‘backtracking’, load balancing and stack splitting), and theo- bakery.6 11.8M 40.4M 2.6M 8.6M 176
retical parallelization of lexicographical (strict) DFS, e.g. [1, 55]. cambridge.6 3.4M 9.5M 8,413 3.3M 418
Recent work from Träff [56] proposes a strategy to processes in- lup.3 1.9M 3.6M 1 1.9M 134
resistance.1 13.8M 32.1M 3 13.8M 60,391
coming edges, allowing them to be handled in parallel. The result- sorter.3 1.3M 2.7M 1.2M 278 300
ing complexity is O( m p
+ n), providing speedup for dense graphs. L1751L1751T1 9.2M 24.5M 3 3.1M 3501
Research into the characterization of search orders provides other L351L351T4 3.8M 11.3M 31 123,201 704
new venues to analyze these algorithms [7, 15] L5L5T16 3.3M 9.8M 131,071 25 24
Li10Lo200 4.0M 15.2M 100 40,000 416
Union-find. The investigation of Van Leeuwen and Tarjan settled Li50Lo40 4.0M 15.8M 2,500 1,600 176
Li200Lo10 4.0M 16.0M 40,000 100 416
the question which of many union-find implementations were su-
livej 4.8M 69.0M 971,232 3.8M 19
perior [54]. Goel er al. [23] however later showed that a simpler patent 3.8M 16.5M 3.8M 1 24
randomized implementation can also be superior. A wait-free con- pokec 1.6M 30.6M 325,893 1.3M 14
current union-find structure was introduced in [2]. Other versions random 10.0M 100.0M 877 10.0M 11
rmat 10.0M 100.0M 2.2M 7.8M 9
that support deletion [29]. To the best of our knowledge, we are the ssca2 1.0M 157.9M 31 1.0M 133
first to combine both iteration and removal in the union-find struc- wiki-links 5.7M 130.2M 2.0M 3.7M 284
ture. This allows the disjoint sets to be processed as a cyclic queue,
which enables concurrent iteration, removal and merging.
Table 2 summarizes graph sizes from respectively: a selection of
6. Experiments graphs from an established benchmark set for the model checker,
The current section presents an experimental evaluation of UFSCC. synthetic graphs generated in the model checker, and explicitly
Results are evaluated compared to Tarjan’s sequential algorithm, stated graphs consisting of real-world and generated graphs. We
another on-the-fly PRDFS algorithm using model checking inputs, also exported both model checking and synthetic graphs to Hong’s
synthetic graphs and an offline, fix-point algorithm. framework, to use as inputs for the offline algorithms.
We validated the implementation by means of proving the algo-
Experimental setup. All experiments were performed on a ma- rithm and thoroughly examining that all invariants are preserved in
chine with 4 AMD OpteronTM 6376 processors, each with 16 each implemented sub-procedure. Moreover, obtained information
cores, forming a total of 64 cores. There is a total of 512GB mem- from a graph search and the encountered SCCs were reported and
ory available. We performed every experiment at least 50 times and compared with the results from Tarjan’s algorithm.
calculated their means and 95% confidence intervals.
We implemented UFSCC in the LTS MIN model checker [28],2 Experiments on model checking graphs. We use model checking
which we use for decomposing SCCs on implicitly given graphs. problems from the BEEM database [43], which consists of a set
LTS MIN’s POST() implementation applies measures to permute the of benchmarks that closely resemble real problems in verification.
order of a vertex’s successors so that each worker visits the succes- From this suite, we select all inputs (62 in total) large enough for
sors in a different order. We compare against a sequential version parallelization (with search spaces of at least 106 vertices), and
of Tarjan’s algorithm and the PRDFS algorithm from Renault et small enough to complete within a timeout of 100 seconds using
al. [46] (see Section 5), which we refer to as Renault. Both are our sequential Tarjan implementation. Though our implementation
also implemented in LTS MIN to enable a fair comparison. Unfor- in reality generates the search space during exploration (on-the-fly),
tunately, we were unable to implement Lowe’s algorithm [36], nor we refer to it simply as ‘the graph’.
successful to use its implementation for on-the-fly exploration. Figure 7 (left) compares the runtimes for UFSCC on 64 work-
For a direct comparison to offline algorithms, we also imple- ers against Tarjan’s sequential algorithm in a speedup graph. The
mented UFSCC in the SCC environment provided by Hong et x-axis represents the total time used for Tarjan’s algorithm and the
al. [25]. This implementation is compared against benchmarks of
Tarjan and Hong’s concurrent algorithm (both provided by Hong et
al.). While Slota et al. [49] improve upon Hong’s algorithm, we did 40 60
not have access to an implementation. However, considering the 30 40
Speedup for UFSCC
fact that both are quadratic in the worst case, we can expect similar 20 20
performance pitfalls as Hong’s algorithm for some inputs. 10
We chose this more tedious approach because the results from 10 5
on-the-fly experiments are hard to compare directly against off-
line implementations. Offline implementations benefit from having 5
1
an explicit graph representation and can directly index based on
3 0.5
the vertex numbering, whereas on-the-fly implementations both 2 5 10 50 100 0.2 1 5 10 50 100
generate successor vertices, involving computation, and need to Time (in sec) for Tarjan Time (in sec) for Renault
hash the vertices. This results in a factor of ±25 vertices processed
per second for the same graphs using the sequential version of the
Figure 7: Comparison of time and speedup for 62 model checking
algorithms: Too much for a meaningful comparison of speedups.
graphs for concurrent UFSCC against Tarjan (left) and concurrent
2 All our implementations, benchmarks, and results are available via
Renault (right), where UFSCC and Renault use 64 workers. The
https://github.com/utwente-fmt/ppopp16. red crosses represent the selected BEEM graphs in Figure 8.
Table 3: Comparison of sequential on-the-fly Tarjan with concur- Tarjan Renault UFSCC
rent UFSCC and Renault on 64 cores for a set of model checking leader filters.7 bakery.6
inputs and synthetic graphs. 64 64
32 32
execution time (s) UFSCC speedup vs 16 16
graph Tarjan Renault UFSCC Tarjan Renault
8 8
leader filters.7 68.602 1.995 1.933 35.494 1.032 4 4
bakery.6 40.747 68.526 1.464 27.825 46.795
cambridge.6 15.162 19.965 0.716 21.176 27.884 2 2
lup.3 5.645 7.941 0.473 11.944 16.803 1 1
resistance.1 39.425 58.247 6.258 6.300 9.307 1/2 1/2
sorter.3 2.808 0.618 0.687 4.088 0.900 1 2 4 8 16 32 64 1 2 4 8 16 32 64
L1751L1751T1 28.994 36.088 2.526 11.478 14.287 cambridge.6 lup.3
L351L351T4 12.189 4.793 0.928 13.138 5.166 64 64
Speedup vs Tarjan
L5L5T16 8.095 1.255 0.782 10.357 1.606 32 32
Li10Lo200 15.144 5.127 1.399 10.828 3.666
16 16
Li50Lo40 13.473 3.973 1.892 7.122 2.100
Li200Lo10 12.906 4.438 2.874 4.491 1.544 8 8
4 4
2 2
Speedup vs Tarjan
Speedup vs Tarjan
32 32 1/4 1/4
16 16 1/16
8 8 1/16
1/64
4 4 1/64
2 2 1/256
1/256
1 1 1/1024
1 2 4 8 16 32 64 1 2 4 8 16 32 64
1/2 1/2 patent
1 2 4 8 16 32 64 1 2 4 8 16 32 64 random
32
Li50Lo40 Li200Lo10
64 64 16 16
32 32 8 8
16 16 4 4
8 8 2 2
4 4 1 1
1/2 1/2
2 2
1 1 1 2 4 8 16 32 64 1 2 4 8 16 32 64
1/2 1/2 Number of workers
1 2 4 8 16 32 64 1 2 4 8 16 32 64
Number of workers Figure 10: Scalability for the offline UFSCC and Hong implemen-
tations relative to Tarjan’s algorithm on a set of explicit graphs.
Figure 9: Scalability for selected synthetic graphs of UFSCC and
Renault relative to Tarjan’s algorithm.
References
median = 40.154 minimum = 40 median = 14.6%
[1] Aggarwal, A., Anderson, R. J., & Kao, M. Y. (1989). Parallel depth-
first search in general directed graphs. In Proceedings of the ACM
symposium on Theory of computing (pp. 297-308). ACM.
resistance.1
43 [2] Anderson, R. J., & Woll, H. (1991). Wait-free Parallel Algorithms for
lamport nonatomic.4
the Union-find Problem. In Proceedings of the Twenty-third Annual
ACM Symposium on Theory of Computing (pp. 370-380). ACM.
[3] Backstrom, L., Huttenlocher, D., Kleinberg, J., & Lan, X. (2006).
Group Formation in Large Social Networks: Membership, Growth, and
Bytes per vertex
42 Evolution. KDD.
[4] Barnat, J., Chaloupka, J., & van de Pol, J. (2009). Distributed
algorithms for SCC decomposition. Journal of Logic and Computation.
[5] Barnat, J., Brim, L., & Ročkai, P. (2010). Parallel partial order reduction
41
with topological sort proviso. In Software Engineering and Formal
Methods (SEFM), 2010 8th IEEE International Conference on (pp.
222-231). IEEE.
[6] Barnat, J., Bauch, P., Brim, L., & Ceška, M. (2011). Computing strongly
connected components in parallel on CUDA. in Proc. 25th Int. Parallel
40 and Distributed Processing Symp. (IPDPS). IEEE, pp. 544-555.
[7] Berry, A., Krueger, R., & Simonet, G. (2005). Ultimate generalizations
0.5 2 8 32 128 512
of LexBFS and LEX M. In Graph-theoretic concepts in computer science
% re-explorations (pp. 199-213). Springer.
[8] Bloemen, V. (2015). On-The-Fly parallel decomposition of strongly
Figure 11: Memory usage in bytes per vertex of UFSCC on 64 cores connected components. Master’s thesis, University of Twente.
for all graphs: (on-the-fly) model checking inputs (+) / synthetic [9] Bondy, J. A., & Murty, U. S. R. (1976). Graph theory with applications
graphs (×) and the different types of explicit graphs (?). (Vol. 290). London: Macmillan.
[10] Brim, L., Černá, I., Krčál, P., & Pelánek, R. (2001). Distributed LTL ACM SIGKDD International Conference on Knowledge Discovery and
model checking based on negative cycle detection. In FST TCS 2001: Data Mining (KDD).
Foundations of Software Technology and Theoretical Computer Science [35] Leskovec, J. SNAP: Stanford network analysis project.
(pp. 96-107). Springer Berlin Heidelberg. http://snap.stanford.edu/index.html, last accessed 9 Sep 2015.
[11] Černá, I., & Pelánek, R. (2003). Distributed explicit fair cycle [36] Lowe, G. (2015). Concurrent depth-first search algorithms based
detection (set based approach). In Model Checking Software (pp. 49-73). on Tarjan’s Algorithm. International Journal on Software Tools for
Springer Berlin Heidelberg. Technology Transfer, 1-19.
[12] Clarke, E. M., Grumberg, O., & Peled, D. (1999). Model checking. [37] Liu, Y., Sun, J., & Dong, J. S. (2009). Scalable multi-core model
MIT press. checking fairness enhanced systems. In Formal Methods and Software
[13] Coppersmith, D., Fleischer, L., Hendrickson, B., & Pınar, A. (2003). Engineering (pp. 426-445). Springer Berlin Heidelberg.
A divide-and-conquer algorithm for identifying strongly connected [38] Madduri, K., & Bader, D. A. GTgraph: A suite of synthetic graph
components. Technical Report RC23744, IBM Research, 2005. generators. http://www.cse.psu.edu/ kxm85/software/GTgraph/, last
[14] Cormen, T. H. (2009). Introduction to algorithms. MIT press. accessed 9 Sep 2015.
[15] Corneil, D. G., & Krueger, R. M. (2008). A unified view of graph [39] McLendon III, W., Hendrickson, B., Plimpton, S., & Rauchwerger, L.
searching. SIAM Journal on Discrete Mathematics, 22(4), 1259-1276. (2005). Finding strongly connected components in distributed graphs.
Journal of Parallel and Distributed Computing, 65(8):901-910.
[16] Courcoubetis, C., Vardi, M., Wolper, P., & Yannakakis, M. (1993).
Memory-efficient algorithms for the verification of temporal properties. [40] Munro, I. (1971). Efficient determination of the transitive closure of a
In Computer-aided Verification (pp. 129-142). Springer US. directed graph. Information Processing Letters, 1(2), 56-58.
[17] Dijkstra, E. W. (1976). A discipline of programming (Vol. 1). [41] Nuutila, E., & Soisalon-Soininen, E. (1993). Efficient transitive
Englewood Cliffs: prentice-hall. closure computation. Technical Report TKO-B113, Helsinki University
of Technology, Laboratory of Information Processing Science.
[18] Evangelista, S., Laarman, A., Petrucci, L., & van de Pol, J. (2012).
Improved multi-core nested depth-first search. In Automated Technology [42] Orzan, S. M. (2004). On distributed verification and verified
for Verification and Analysis (pp. 269-283). Springer Berlin Heidelberg. distribution. Ph.D. dissertation, Vrije Universiteit.
[43] Pelánek, R. (2007). BEEM: Benchmarks for Explicit Model Checkers.
[19] Evangelista, S., Petrucci, L., & Youcef, S. (2011). Parallel nested
In SPIN, volume 4595 of LNCS, pp. 263–267. Springer
depth-first searches for LTL model checking. In Automated Technology
for Verification and Analysis (pp. 381-396). Springer Berlin Heidelberg. [44] Purdom Jr, P. (1970). A transitive closure algorithm. BIT Numerical
Mathematics, 10(1), 76-94.
[20] Fleischer, L. K., Hendrickson, B., & Pınar, A. (2000). On identifying
strongly connected components in parallel. In Parallel and Distributed [45] Reif, J. H. (1985). Depth-first search is inherently sequential.
Processing (pp. 505-511). Springer Berlin Heidelberg. Information Processing Letters, 20(5), 229-234.
[21] Freeman, J. (1991). Parallel algorithms for depth-first search. [46] Renault, E., Duret-Lutz, A., Kordon, F., & Poitrenaud, D. (2015).
Technical Report MS-CIS-91-71, University of Pennsylvania Parallel explicit model checking for generalized Büchi automata. In
Tools and Algorithms for the Construction and Analysis of Systems (pp.
[22] Gabow, H. N. (2000), Path-based depth-first search for strong and 613-627). Springer Berlin Heidelberg.
biconnected components. Information Proc. Letters, 74(3-4): 107-114.
[47] Savage, C. (1982). Depth-first search and the vertex cover problem.
[23] Goel, A., Khanna, S., Larkin, D. H., & Tarjan, R. E. (2014). Disjoint Information Processing Letters, 14(5),
set union with randomized linking. In Proceedings of the Twenty-Fifth
Annual ACM-SIAM Symposium on Discrete Algorithms (SODA ’14). [48] Schudy, W. (2008). Finding strongly connected components in parallel
SIAM 1005-1017. using O(log2 n) reachability queries. In Parallelism in algorithms and
architectures (pp. 146-151). ACM.
[24] Hong, S., Chafi, H., Sedlar, E., & Olukotun, K., (2012). Green-Marl:
A DSL for easy and efficient graph analysis, ASPLOS 2012. [49] Slota, G. M., Rajamanickam, S., & Madduri, K. (2014). BFS and
coloring-based parallel algorithms for strongly connected components
[25] Hong, S., Rodia, N. C., & Olukotun, K. (2013). On fast parallel and related problems. In Parallel and Distributed Processing Symposium,
detection of strongly connected components (SCC) in small-world 2014 IEEE 28th International (pp. 550-559). IEEE.
graphs. In High Performance Computing, Networking, Storage and
Analysis (SC), 2013 International Conference for (pp. 1-11). IEEE. [50] Spencer, T. H. (1991). More time-work tradeoffs for parallel graph
algorithms. In Proceedings of the third annual ACM symposium on
[26] Hopcroft, J., & Tarjan, R. E.(1974). Efficient planarity testing. Journal Parallel algorithms and architectures (pp. 81-93). ACM.
of the ACM (JACM), 21(4), 549-568.
[51] Takac, L., & Zábovský, M. Data Analysis in Public Social Networks,
[27] Kahn, A. B. (1962). Topological sorting of large networks. Commu- International Scientific Conference & International Workshop Present
nications of the ACM, 5(11), 558-562. Day Trends of Innovations, May 2012 Lomza, Poland.
[28] Kant, G., Laarman, A., Meijer, J., van de Pol, J., Blom, S., & van Dijk, [52] Tarjan, R. E. (1972). Depth-first search and linear graph algorithms.
T. (2015). LTSmin: High-Performance Language-Independent Model SIAM. Comput. 1:146-60.
Checking. In TACAS. [53] Tarjan, R. E. (1976). Edge-disjoint spanning trees and depth-first
[29] Kaplan, H., Shafrir, N., & Tarjan, R. E. (2002). Union-find with dele- search. Acta Informatica, 6(2), 171-185.
tions. In Proceedings of the thirteenth annual ACM-SIAM symposium [54] Tarjan, R. E., & van Leeuwen, J. (1984). Worst-case analysis of set
on Discrete algorithms (pp. 19-28). union algorithms. Journal of the ACM (JACM), 31(2), 245-281.
[30] Kaveh, A. (2014). Computational structural analysis and finite element [55] de la Torre, P., & Kruskal, C. (1991). Fast and efficient parallel
methods. Wien: Springer. algorithms for single source lexicographic depth-first search, breadth-
[31] Laarman, A. W. (2014). Scalable multi-core model checking. Ph.D. first search and topological-first search. In International Conference on
thesis, University of Twente. Parallel Processing (Vol. 3, pp. 286-287).
[32] Laarman, A., & Faragó, D. (2013). Improved on-the-fly livelock [56] Träff, J. L. (2013). A Note on (Parallel) Depth-and Breadth-First
detection. In NASA Formal Methods (pp. 32-47). Springer. Search by Arc Elimination. arXiv preprint arXiv:1305.1222.
[33] Laarman, A., & Wijs, A. (2014). Partial-order reduction for multi- [57] Vardi, M. Y., & Wolper, P. (1986). An automata-theoretic approach to
core LTL model checking. In Hardware and Software: Verification and automatic program verification. In 1st Symposium in Logic in Computer
Testing (pp. 267-283). Springer. Science (LICS). IEEE Computer Society.
[34] Leskovec, J., Kleinberg, J., & Faloutsos, C. (2005). Graphs over Time: [58] Wyllie, J. C. (1979). The complexity of parallel computations.
Densification Laws, Shrinking Diameters and Possible Explanations. Technical Report TR79-387, Cornell University.
A. The Complete UFSCC Algorithm Algorithm 4 The concurrent, iterable union-find data structure
The current appendix presents the full version of the UFSCC algo- 1: ∀v ∈ V : . . . . . . . . . . . . . . . . . . . . [Shared union-find structure implementing S]
rithm and the iterable union-find data structure it relies on. We do 2: UF[v].parent := v . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . [union-find parent]
3: UF[v].workers := ∅ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . [worker set]
not provide detailed correctness arguments here (cf. [8]).
4: UF[v].next := v . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . [list next pointer]
Algorithm 3 presents the UFSCC algorithm using the interface 5: UF[v].uf status := LIVE . . . . . . . . . . . . . . . . . . [∈ { LIVE , LOCK , DEAD }]
of the union-find structure: for iteration it calls P ICK F ROM L IST, 6: UF[v].list status := BUSY . . . . . . . . . . . . . . . [∈ { BUSY , LOCK , DONE }]
and M AKE C LAIM reports the successor status (NEW, FOUND, or 7: procedure M AKE C LAIM(a, p)
8: aRoot := FIND(a)
DEAD ). The exact interface definition of the union-find structure 9: if UF[aRoot ].uf status = DEAD then
follows the conditions that can be found in Algorithm 2 (red lines). 10: return CLAIM DEAD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . [empty list]
E.g., R EMOVE F ROM L IST marks a vertex DONE. M AKE C LAIM 11: if p ∈ UF[aRoot ].workers then
adds the worker to the worker set of the supernode (with an atomic 12: return CLAIM FOUND . . . . . . . . . . . . . . . . . . . . [SCC contains worker ID]
13: while p 6∈ UF[aRoot ].workers do . . . . . . . . . . . . . . . . . . . . . [add worker ID]
bitwise OR on a bitvector). P ICK F ROM L IST, however, combines 14: UF [aRoot ].workers := UF[aRoot ].workers ∪ {p}
iteration with adding SCCs to DEAD (Line 16 in Algorithm 2): It 15: aRoot := FIND(aRoot ) . . . . . . . . . . . . . . . [ensure that worker is added]
either returns N ULL (marking the SCC DEAD instantly), or a vertex 16: return CLAIM NEW
v 0 ∈ S(v) \ DONE (though another worker may have marked v 0 17: procedure F IND(a)
0 18: if UF[a].parent 6= a then
DONE just before returning v , which we allow). 19: UF [a].parent := FIND ( UF[a].parent)
Algorithm 4 shows the implementation of the different union- 20: return UF[a].parent
find operations for creating, finding and uniting disjoint sets. The 21: procedure S AME S ET(a, b)
former two remain similar to typical union-find implementations 22: aRoot := FIND(a)
23: bRoot := FIND(b)
and require no locking, allowing for high concurrency. That syn- 24: if aRoot = bRoot then return T RUE
chronization is not required, is a consequence of the particular re- 25: if UF[aRoot ].parent = aRoot then return FALSE
quirements from the UFSCC algorithm, which guarantees us, that 26: return S AME S ET(aRoot , bRoot )
UNITE and P ICK F ROM L IST (marking supernodes dead) globally 27: procedure L OCK ROOT(a)
28: if CAS(UF[a].uf status, LIVE, LOCK) then
always occur after the last FIND on the supernode since at that point 29: if UF[a].parent = a then return T RUE
all its edges have been processed. The UNITE procedure 30: U NLOCK ROOT(a)
(1) selects a new root (Line 42), 31: return FALSE
(2) locks the other root vertices on both cycles (Line 44-46), 32: procedure L OCK L IST(a)
33: aList := P ICK F ROM L IST(a)
(3) swaps the list next pointers to create one large cycle (Line 49), 34: if aList = N ULL then return N ULL . . . . . . . . . . . . . . . . . . . . . . [DEAD SCC]
(4) updates the parent for the ‘non-root’ (Line 50), 35: if CAS(UF[aList ].list status, BUSY, LOCK) then
(5) updates the worker set (Line 51-54), and 36: return aList . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . [successfully locked list]
37: return L OCK L IST(UF[aList ].next) [aList is locked by another worker]
(6) finally releases the locks (Line 55-57).
38: procedure UNITE(a, b)
This order of execution is crucial to maintain correctness as even 39: aRoot := FIND(a)
interchanging steps (4) and (5) could lead to erroneous results. By 40: bRoot := FIND(b)
only locking the smaller SCC, we allow concurrent updates to the 41: if aRoot = bRoot then return . . . . . . . . . . . . . . . . . . . . . . . . [already united]
42: r := MAX(aRoot , bRoot ) . . . . . . . . . . . . . . . . . .[Largest index is new root]
larger SCC, which helped significantly to improve the performance. 43: q := MIN(aRoot , bRoot )
Algorithm 5 provides the cyclic list implementation that allows 44: if ¬L OCK ROOT(q) then return UNITE(aRoot , bRoot )
UFSCC to treat disjoint sets as queues, while they are being merged 45: aList := L OCK L IST(a)
and searched. P ICK F ROM L IST traverses over DONE vertices until 46: bList := L OCK L IST(b)
47: if aList = N ULL ∨ bList = N ULL then
a BUSY one is found at Line 3-4. At Line 13, the procedure waits 48: return . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . [DEAD SCC ⇒ already united]
(for UNITE) until locked vertices become BUSY or DONE. The list 49: SWAP ( UF[aList ].next, UF[bList ].next) . . . . . . . . . . . . . . . . . [non-atomic]
contracts vertices at Line 15 for two subsequent DONE vertices 50: UF[q].parent := r . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . [update parent]
(see also Figure 5), which resembles path halving [54]. No syn- 51: do . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . [update worker set]
52: r := FIND(r)
chronization is further required as path-halving is only performed 53: UF [r].workers := UF[r].workers ∪ UF [q].workers
on DONE vertices (whereas UNITE only merges BUSY list nodes). 54: while UF[r].parent 6= r . . . . . . . . . . . . . . . . [ensure that we update the root]
R EMOVE F ROM L IST only needs to mark the vertex DONE and it 55: UF[aList ].list status := BUSY . . . . . . . . . . . . . . . [unlock list on aList ]
56: UF[bList ].list status := BUSY . . . . . . . . . . . . . . . [unlock list on bList ]
will be taken care of by the path halving later.
57: UF[q].uf status := LIVE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . [unlock q]