You are on page 1of 15

Recursion in Distributed Computing

Eli Gafni1 and Sergio Rajsbaum2,


University of California, Los Angeles, Computer Science Department, Los Angeles, CA 90095 eli@ucla.edu Instituto de Matemticas, Universidad Nacional Autnoma de Mxico a o e Ciudad Universitaria, D.F. 04510 Mexico rajsbaum@math.unam.mx
1

Abstract. The benets of developing algorithms via recursion are well known. However, little use of recursion has been done in distributed algorithms, in spite of the fact that recursive structuring principles for distributed systems have been advocated since the beginning of the eld. We present several distributed algorithms in a recursive form, which makes them easier to understand and analyze. Also, we expose several interesting issues arising in recursive distributed algorithms. Our goal is to promote the use and study of recursion in distributed algorithms.

Introduction

The benets of designing and analyzing sequential algorithms using recursion are well known. Recursive algorithms are discussed in most textbooks, notably in Udi Manbers book [23]. However, little use of recursion has been done in distributed algorithms, in spite of the fact that recursive structuring principles for distributed systems have been advocated since the beginning of the eld e.g. [13,24], and have been used before e.g. [12]. In this paper we describe simple and elegant recursive distributed algorithms for some important tasks, that illustrate the benets of using recursion. We consider the following tasks: snapshots [1], immediate snapshots [6,27], renaming [4], and swap [2,29], and recursive distributed algorithms for each one. We work with a wait-free shared memory model where any number of processes can fail by crashing. We hope to convince the reader that thinking recursively is a methodology that facilitates the process of designing, analyzing and proving correctness of distributed algorithms. We propose that studying recursion in a distributed setting is a worthwhile aim, although not without its drawbacks. There is no doubt that recursion should be covered starting with the rst year introductory computer science courses, but it has been suggested recursive programming teaching be postponed until after iterative programs are well mastered, as recursion can lead to extremely
Supported by UNAM-PAPIIT.
S. Dolev et al. (Eds.): SSS 2010, LNCS 6366, pp. 362376, 2010. c Springer-Verlag Berlin Heidelberg 2010

Recursion in Distributed Computing

363

inecient solutions e.g. [28]. In distributed algorithms, a well known example is the original Byzantine Agreement algorithm [21]. It exhibits the case that recursive distributed algorithms can be even more dangerous than in the centralized setting. The recursive algorithm looks simple and convincing, yet to really understand what it is doing, i.e. to unfold the recursion, took researchers a few years [5]. Even the seminal distributed spanning tree algorithm of [18], which is not recursive, can be viewed as a recursive algorithm that has been optimized [11]. Thus, one of the goals of this paper is to open the discussion of when and how recursion should be used in distributed algorithms. There are several interesting issues that appear in recursive distributed algorithms, that do not appear in sequential recursion. Naming recursive calls. Consider the classical binary search recursive algorithm. It searches an ordered array for a single element by cutting the array in half with each pass. It picks a midpoint near the center of the array, compares the data at that point with the data being searched. If the data is found, it terminates. Otherwise, there are two cases. (1) the data at the midpoint is greater than the data being searched for, and the algorithm is called recursively on the left part of the array, or (2) the data at the midpoint is less than the data being searched for, and the algorithm is called recursively on the right part of the array. Namely, only one recursive call is invoked, either (1) or (2), but not both. In contrast, in a distributed recursive algorithm, there are several processes running concurrently, and it is possible that both recursive calls are performed, by dierent processes. Thus, we need to name the recursive calls so that processes can identify them. Concurrent branching. In sequential computing, recursive functions can be divided into linear and branched ones, depending on whether they make only one recursive call to itself (e.g. factorial), or more (e.g. bonacci). In the rst case, the recursion tree is a simple path, while in the second, it is a general tree. The distributed algorithms we present here are all of linear recursion, yet the recursion tree may not be a simple path, because, as explained above, dierent processes may invoke dierent recursive calls. See Figures 2, 4 and 9. Iterated memory. As in sequential recursion, each distributed recursive call should be completely independent from the others. Even in the sequential binary search recursive algorithm, each recursive call (at least theoretically) operates on a new copy of the array, either the left side or the right side of the array. In a distributed algorithm, a process has local variables and shared variables. Both should be local to the recursive invocation. Thus, each recursive invocation has associated its own shared memory. There are no side eects in the sense that one recursive invocation cannot access the shared memory of another invocation. Thus, recursive distributed algorithms run on an iterated model of computation, where the shared memory is divided in sections. Processes run by accessing each section at most once, and in the same order. Iterated models have been exploited in the past e.g. [14,11,16,26,25,19].

364

E. Gafni and S. Rajsbaum

Shared objects. In the simplest case, the shared memory that is accessed in each iteration is a single-writer/multi-reader shared array. Namely, a recursive distributed algorithm writes to the array, reads each of its elements, and after some local computation, either produces an output or invokes recursively the algorithm. Assuming at least one process decides, fewer processes participate in each recursive invocation (on a new shared array), until at the bottom of the recursion, only one process participates and decides. More generally, in each recursive invocation, processes communicate through a shared object that is more powerful than a single-writer/multi-reader shared array. Such an object is invoked at most once by each process. We specify the behavior of the object as a task, essentially, its input/output relation (see Section 2). Inductive reasoning. As there are no side eects among recursive invocations, we can logically imagine all processes going in lockstep from a shared object to the next shared object, just varying the order in which they invoke each one. This ordering induces a structured set of executions that facilitates an inductive reasoning, and greatly simplies the understanding of distributed algorithms. Also, the structure of the set of executions is recursive. This property has been useful to prove lower bounds and impossibility results as it facilitates a topological description of the executions e.g. [8,17,19]. We include Figure 5 illustrate this point. While recursion in sequential algorithms is well understood, in the distributed case we dont know much. For example, we do not know if in some cases side eects are unavoidable, and persistent shared objects are required to maintain information during recursive calls. We dont know if every recursive algorithm can be unfolded, and executed in an iterated model (dierent processes could invoke tasks in dierent order).

Model

We assume a shared memory distributed computing model, with processes = {p1 , . . . , pn }, in an asynchronous wait-free setting, where any number of processes can crash. Processes communicate through shared objects that can be invoked at most once by each process. The input/output specication of an object is in terms of a task, dened by a set of possible inputs vectors, a set of possible output vectors, and an input/output relation . A more formal denition of a task is given e.g. in [20]. Process pi can invoke an instance of an object solving a task once with the instruction Tasktag (x), and eventually gets back an output value, such that the vector of outputs satises the tasks specication. The subindex tag identies the particular instance of the object, and x is the input to the task. The most basic task we consider is the write/scan task, that is a specication of a single-writer/multi-reader shared array. An instance tag of this object is accessed with WScantag (x). Each instance tag has associated its own shared array SM [1..n], where SM [j] is initialized to , for each j. When process pi

Recursion in Distributed Computing

365

invokes WScantag (x), the value x is written to SM [i], then pi reads in arbitrary order SM [j], for each j, and the vector of values read is what the invocation to WScantag (x) returns. Also, we are interested in designing distributed algorithms that solve a given task. The processes start with private input values, and must eventually decide on output values. For an execution where only a subset of processes participate an input vector I species in its i-th entry, I[i], the input value of each participating process pi , and in the entries of the other processes. Similarly, an output vector O contains a decision value O[i] for a participating process pi , and elsewhere. If an algorithm solves the task, then O is in (I). A famous task is consensus. Each process proposes a value and the correct processes have to decide the same value, which has to be a proposed value. This task is not wait-free solvable using read/write registers only [22]. We introduce other tasks as we discuss them.

Recursive Distributed Algorithms

We start with two basic forms of recursive distributed algorithms: linear branching in Section 3.1, and binary branching in Section 3.2. Here we only analyze the behavior of these algorithms, in later sections we show how they can be used to solve specic tasks. A multi-way branching version is postponed to Section 5.2, and a more general multi-way branching version is in Section 6. 3.1 Linear Recursion

Consider the algorithm, called IS, of Figure 1. In line 1 processes communicate through a single-writer/multi-reader shared array, invoked with the operation WScan (we omit its tag, because always the invocation is local to the recursive call). Process pi writes its id i, and stores in view the set of ids read in the array (the entries dierent from ). In line 2 the process checks if view contains all n ids. The last process to write to the array sees n such values, i.e., |view| = n, it returns view, and terminates the algorithm. Namely, at least one process terminates the algorithm, but perhaps more than one (all process that see |view| = n), in each recursive call of the algorithm. In line 3 processes that have |view| < n invoke the algorithm recursively (each time a dierent shared array is used). When n = 1, the single process invoking the algorithm returns a view that contains only itself.

Algorithm IS(n); (1) view WScan(i); (2) if |view| = n then return view (3) else return IS(n 1) Fig. 1. Linear recursion (code for pi )

366

E. Gafni and S. Rajsbaum

1 invoke

3 3

IS(3)
1 2

outputs 1,2,3

invoke

IS(2)
1

outputs 1,2

invoke

IS(1)

outputs 1

Fig. 2. Linear recursion, for 3 processes 1, 2, 3

As we shall see in Section 4, this algorithm solves the immediate snapshot task. For now, we describe only its behavior, as linear recursion. Namely, the recursion tree is a simple path; one execution of the algorithm is illustrated in Figure 2 for 3 processes, 1, 2 and 3. In Figure 2 all three processes invoke IS(3), each one with its own id. In this example, only process 3 sees all three values after executing WScan, and exits the algorithm with view 1, 2, 3. Processes 1, 2 invoke IS(2), and a new read/write object instance through WScan. Namely, in an implementation of the write/scan task with a shared array, in the rst recursive invocation IS(3) a shared array is used, in the second recursive invocation IS(2), a second shared array is used, and nally, when process 1 invokes IS(1) alone, it uses a third shared array, sees only itself, and terminates with view 1. The total number of read and write steps by a process is O(n2 ). In more detail, a process pi that returns a view with |view| = k executes (n(n k + 1)) read/write steps. Process pi returns view during the invocation of IS(k). Thus, it executed a total of n k + 1 task invocations, one for each recursive call, starting with IS(n). Each WScan invocation involves one write and n read steps. 3.2 Binary Branching

Let us now consider a small variation of algorithm IS, called algorithm BR in Figure 3. It uses tag {L, R, }. The rst time the algorithm is invoked by all n processes, with tag = . Recursive invocations are invoked with smaller values of n each time, and with tag equal to L or to R. Until at the bottom of the recursion, the algorithm is invoked with n = 1 by only one process, which returns with a view that contains only its own id. In line 3, after seeing all n processes, process pi checks if its own id is the largest it saw, in view. If so, it terminates the algorithm. If it is not the largest id, it invokes recursively an

Recursion in Distributed Computing

367

instance of BR identied by the tag = R, and size n 1 (at most n 1 processes invoke it). Line 5 is executed by processes that saw less than n ids in their views obtained in Line 2; they all invoke the same instance of BR, now identied by the tag = L, and size n 1 (at most n 1 invoke it).
Algorithm BRtag (n); (1) view WScan(i); (2) if |view| = n then (3) if i = max view then return view ; (4) return BRR (n 1); (5) else return BRL (n 1) Fig. 3. Branching recursion algorithm (code for pi )

This time the recursive structure is a tree, not a path. An execution of the algorithm for 4 processes is illustrated in Figure 4. Each of the nodes in the tree has associated its own shared array. In the rst recursive call, all processes write to the rst shared array, and processes 3, 4 see only themselves, hence invoke BRL (3), while processes 1, 2 see all 4 processes, and as neither is the largest among them, they both invoke BRR (3). The rest of the gure is self-explanatory. Here we are not interested in the task solved by Algorithm BR, only in that it has the same recursive structure as the renaming algorithm of Section 5, and hence the same complexity. Each process executes at most n recursive calls, and hence at most n invocations to a write/scan task. Thus, the total number of read and write steps by a process is O(n2 ).

Snaphots

In this section we describe an immediate snapshot [6,27] recursive algorithm. As the specication of the snapshot [1] task is a weakening of the immediate snapshot task, the algorithm solves the snapshot task as well. Immediate snapshot task. An immediate snapshot task IS abstracts a shared array SM [1..n] with one entry per process. The array is initialized to [, . . . , ], where is a default value that cannot be written by a process. Intuitively, when a process pi invokes the object, it is as if it instantaneously executes a write operation followed by a snapshot of the whole shared array. If several processes invoke the task simultaneously, then their corresponding write operations are executed concurrently, followed by a concurrent execution of their snapshot operations. More precisely, in an immediate snapshot task, for each pi , the result of its invocation satises the three following properties, where we assume i is the value written by pi (without loss of generality) and smi is the set of values or view it gets back from the task. If SM [k] = , the value k is not added to smi . We dene smi = , if the process pi never invokes the task. These properties are:

368

E. Gafni and S. Rajsbaum

4 2 1 3

BR(4) BRL(3) BRL(2)


4 43 43 2 1

BRR(3)

BRL(2)
2 1

BRL(1) BRR(1) BRL(1) BRR(1)


Fig. 4. Branching recursion, for 4 processes

Self-inclusion. i : i smi . Containment. i, j : smi smj smj smi . Immediacy. i, j : i smj smi smj . The immediacy property can be rewritten as i, j : i smj j smi smi = smj . Thus, concurrent invocations of the task obtain the same view. A snapshot task is required to satisfy only the rst two properties. The set of all possible views obtained by the processes after invoking an object implementing an immediate snapshot task can be represented by a complex, consisting of sets called simplexes, with the property that if a simplex is included in the complex, so are all its sub-simplexes. The set of all possible views, for 3 processes, is represented in Figure 5. Each vertex is labeled with a pair i, smi . The simplexes are the triangles, the edges, and the vertexes. The corners of each simplex are labeled with compatible views, satisfying the three previous properties. In the case of 4 processes, the complex would be 3-dimensional, including sets up to size 4, and so on for any number of processes. For more details about complexes and their use in distributed computing, see e.g. [20]. Recursive algorithm. A wait-free algorithm solving the immediate snapshot task was described in [6]. We include it in Appendix A for comparison. We encourage the reader to try to come up with a correctness proof, before reading the recursive version, where the proof will be trivial. Actually, algorithm IS of Figure 1 solves the immediate snapshot task. Theorem 1. Algorithm IS solves the immediate snapshot task for n processes in O(n2 ) steps. Moreover, a process obtains a snapshot of size k from the algorithm in (n(n k + 1)) steps.

Proof. The complexity was analyzed in the previous section. Here we prove that IS solves the immediate snapshot task. Let S be the set of processes that terminate the algorithm in line 2, namely, with |view| = n. Each pi S, terminates

Recursion in Distributed Computing


1,{1}

369

2,{1,2}

3,{1,3}

3,{1,2,3}

2,{1,2,3}

1,{1,2}
1,{1,2,3}

1,{1,3}

2,{2} 3,{2,3} 2,{2,3}

3,{3}

Fig. 5. All immediate snapshot subdivision views, for 3 processes

the algorithm with a view smi that contains all processes. Thus, for each such pi , the self-inclusion property holds. Also, for any two pi , pj in S, we have smi = smj , and the properties of containment and immediacy hold. By induction hypothesis, the three properties also hold for the other processes, that call recursively ISn1 . It remains to show that the two last properties of the immediate snapshot task hold for a pi S, and a pj S. First, property containment holds: clearly smj smi , because pi does not participate in the recursive call. Finally, property immediacy holds: j is in smi (pi sees every process participating), and we have already seen that smj smi . And it is impossible that i smj , because pi does not participate in the recursive call.

Renaming

In the renaming task [4] each of n processes than can only compare their ids must choose one of 2n 1 new distinct names, called slots. It was proved in [20] that renaming is impossible with less than 2n 1 slots, except for some special values of n [10]. The algorithm of [4] solves the problem with 2n 1 slots, but is of exponential complexity [15]. Later on, [7] presented a recursive renaming algorithm based on immediate snapshots, of O(n3 ) complexity. We restate this algorithm in Section 5.2, and show that its complexity is actually O(n2 ). Also, we present a new renaming algorithm in Section 5.1 that is not based on immediate snapshots, also of O(n2 ) complexity; it is very simple, but requires knowledge of n. To facilitate the description of a recursive algorithm, the slots are, given an integer F irst and Direction {1, +1}, the integers in the range F irst+[0..2n 2] if Direction = +1, or in the range F irst + [(2n 2)..0] if Direction = 1. Combining the two, we get slots F irst+Direction[0..2n2]. Thus, the number of slots is 2n 1 as required; i.e., |Last F irst| + 1 = 2n 1, dening Last = F irst + Direction (2n 2).

370

E. Gafni and S. Rajsbaum

5.1

Binary Branching Renaming Algorithm

The algorithm has exactly the same structure as the binary branching algorithm of Figure 3, using WScan. It partitions the set of processes into two subsets, and then solves renaming recursively on each subset. The algorithm Renaming(n, F irst, Direction) is presented in Figure 6. It uses tags of the form {, }, to represent the intuition of renaming left to right and right to left as indicated by the value of Direction. Given F irst and Direction, the algorithm is invoked by k processes, where k n, and each process decides on a slot in the range F irst + Direction [0..2k 2]. The algorithm ends in line 4, with a slot selected. In the algorithm, the processes rst write and scan a shared array, in line 1. According to the size of the view they get back, they are partitioned in 2 sets the processes that get back a view of size n and the processes that get back a view of size less than n. If k processes, k < n, invoke it then, of course, nobody can get a view of size n. In this case they all go to line 6 and solve the problem recursively executing Renaming (n1, F irst, Direction). Thus, such recursive calls will be repeated until k = n. The variant of the algorithm described below evades these repeated calls, using immediate snapshots instead of write/scans. Consider the case of k = n invoking the algorithm. In this case, some processes will get a view of size n in line 1. If one of these, say pi , sees that it has the largest id i in its view Si (line 4), terminates the algorithm with slot Last. The other processes, Y , that get a view of size n, will proceed to solve the problem recursively, in line 5, renaming from slot Last 1 down (reserving slot Last in case it was taken by pi ), by calling Renaming (n1, Last1, Direction). The processes X, that get a view of size less than n, solve the problem recursively in line 6, going up from position F irst, by calling Renaming (n 1, F irst, Direction). Thus, we use the arrow in the superscript to distinguish the two distinct recursive invocations (to the same code). The correctness of the algorithm, in Theorem 1, is a simple counting argument, that consists of the observation that the two ranges, going down and up, do not overlap. Theorem 2. Algorithm Renaming solves the renaming task for n processes, in O(n2 ) steps. Proof. Clearly, the algorithm terminates, as it is called with smaller values of n in each recursive call, until n = 1, when it necessarily terminates. A process executes at most n recursive calls, and in each one it executes a write/scan. Each write/scan involves O(n) read and write steps, for a total complexity of O(n2 ). If n = 1, then |Si | = 1 in line (1) so the algorithm terminates with slot Last = F irst in line (4). At the basis of the recursion, n = 1, the algorithm terminates correctly, renaming into 1 slot, as Last = F irst when n = 1. Assume the algorithm is correct for k less than n. The induction hypothesis is that when k processes, k k, invoke Renaming(k, F irst, Direction), then they get new names in the range F irst + Direction [0..2k 2]. Now, assume the algorithm is invoked as Renaming(n, F irst, Direction), with n > 1, by k n processes. Let X be the set of processes that get a view

Recursion in Distributed Computing

371

Algorithm Renaming(n, F irst, Direction); (1) Si WScan(i); (2) Last F irst + Direction (2n 2); (3) if |Si | = n then (4) if i = max Si then return Last; (5) return Renaming(n 1, Last 1, Direction); (6) else return Renaming(n 1, F irst, Direction) Fig. 6. Write/scan binary branching renaming (code for pi )

smaller than n in line (1), |X| = x. Notice that 0 x n1. If k < n then all get a view smaller than n, and they all return Renaming(n 1, F irst, Direction) in line (6), terminating correctly. So for the rest of the proof, assume k = n. Let Y be the set of processes that get a view of size n in line (1), |Y | = y, excluding the process of largest id. Thus, 0 y n 1. The processes in X return from Renaming(n 1, F irst, Direction) in line (6), with slots in the range F irst + [0..2x 2]. The processes in Y return from Renaming(n 1, Last 1, (1) Direction) in line (5), with slots in the range [Last 1 (2y 2)..Last 1]. To complete the proof we need to show that F irst + 2x 2 < Last 1 (2y 2). Recall Last = F irst + Direction (2n 2). Thus, we need to show that 2x 2 < 2n 2 1 (2y 2). As x + y n, the previous inequality becomes 2(n) 2 < 2n 2 1 + 2, and we are done.

5.2

A Multi-way Branching Renaming Algorithm

In the previous Renamingn algorithm the set of processes, X, that get back a view of size less than n, will waste recursive calls, calling the algorithm again and again (with the same values for F irst and Direction, but smaller values of n) until n goes down to n , with n = |X|. In this recursive call, Renamingn , the processes that get back a view of size n , will do something interesting; that is, one might get a slot, and the others will invoke recursively Renaming, but in opposite direction. Therefore, using immediate snapshots, we can rewrite the Renaming algorithm in the form of Figure 7. This is the renaming algorithm presented in [7]. Notice that in isRenamingtag the subindex tag is a sequence of integers: in line 4, the new tag tag |Si | is the old tag, appending at the end Si . These tags are a way of naming the recursive calls, so that a set of processes that should participate in the same execution of isRenaming, can identify using the same

372

E. Gafni and S. Rajsbaum

Algorithm isRenamingtag (F irst, Direction); (1) Si Immediate Snapshot(i); (2) Last F irst + Direction (2|Si | 2); (3) if i = max Si then return Last; (4) return isRenamingtag|Si| (Last 1, Direction) Fig. 7. Immediate snapshot multi-way branching renaming (code for pi )

tag. The rst time isRenaming is called, tag should take some default value, say the empty set. Although the analysis in [7] bounded the number of steps by O(n3 ), we observe a better bound can be given: Theorem 3. Algorithm isRenaming solves the renaming task for n processes, in O(n2 ) steps. Proof. As explained above, the algorithm is equivalent to Renaming, and hence correctness follows from Theorem 2. Now, to show that the complexity is O(n2 ) steps we do an amortized analysis, based on Theorem 1: a process obtains a snapshot of size s from the Immediate Snapshotn algorithm in (n(n s+ 1)) steps. Assume a process runs isRenaming until it obtains a slot, invoking the algorithm recursively k times. In the i-th call, assume it gets a snapshot of size si (in line 1). For example, if s1 = n, then k = 1, and using the result of Theorem 1, the number of steps executed is n. In general, the number of steps executed by a process is n times [n s1 ] + [(n s1 ) (n s2 )] + [(n s2 ) (n s3 )] + + [(n sk1 ) (n sk )] which gives a total of O(n(n sk )).

SWAP

Here we consider two tasks, Tournament and Swap . A process can invoke these tasks with its input id, where is an id of a processes that does not invoke the task, or 0, a special virtual id. Each process gets back another process id, or . A process never gets back its own id. Exactly one process gets back . We think of this process as the winner of the invocation. The induced digraph consists of all arcs i j, such that process i received process j as output from the task; it is guaranteed that the digraph is acyclic. We say that j is the parent of i. As every vertex has exactly one outgoing arc, except for the root, , which has none, there is exactly one directed path from each vertex to the root. The Swap task always returns a directed path, while the Tournament can return any acyclic digraph.

Recursion in Distributed Computing

373

Afek et al [2,29] noticed that these two tasks cannot be solved using read/write registers only, but can be solved if 2-consensus tasks are also available, namely, tasks that can be used to solve consensus for 2, but not for 3 processes.1 They presented a wait-free algorithm that solves Swap using read/write registers and Tournament tasks. The following is a recursive version of this algorithm, of the same complexity. The Swap algorithm is in Figure 8. Process pi invokes the algorithm with Swaptag (i), where tag = 0. In line 1 process i invokes Tournamenttag (i) and in case it wins, i.e., gets back tag, it returns tag. By the specication of the tournament task, one, and only one process wins. All others invoke recursively a Swap task: all processes with the same parent invoke the same Swap (i) task.
Algorithm Swaptag (i); (1) Tournamenttag (i); (2) if tag = then return ; (3) else return Swap (i) Fig. 8. The Swap algorithm (code for pi )

Assuming initially for each process pi , i = 0, we get an invariant: the induced digraph (arcs from pi to i ) is a directed tree rooted in 0. Initially, all processes point to 0 directly. Each time Tournamenttag is executed, it is executed by a set of processes pointing to tag. The result of the execution is that exactly one process p keeps on pointing to tag, while the others induce a directed graph rooted at p. An execution for 5 processes appears in Figure 9. Notice that although it is a tree, it has a unique topological order, as opposed to the tree of Figure 4. The result of executing Tournament0 is that 1 wins, 2, 3, 4 point to 1, while 5 point to 3. The result of executing Tournament1 is that 2 wins, while 3, 4 point to 2. The result of executing Tournament2 is that 3 wins, while 4 point to 3. The result of executing Tournament3 is that 4 wins, while 5 point to 4. Finally, 5 executes Tournament4 by itself and wins. For the following theorem, we count as a step either a read or write operation, or a 2-consensus operation, and assume a Tournament task can be implemented using O(n) steps [2,29]. Theorem 4. Algorithm Swap solves the swap task for n processes, in O(n2 ) steps. Proof. The algorithm terminates, as in each invocation one process returns, the winner of the Tournament. Also, the basis is easy, as when only one process invokes the Tournament, it is necessarily the winner. Assume inductively that
1

The task of Afek et al is not exactly the same as ours; for instance, they require a linearizability property, stating that the winner is rst.

374

E. Gafni and S. Rajsbaum

45 12 3

SwAp 0 SwAp 1 SwAp 2


4 2 4 3

outputs 0

outputs 1

4 3

5 3 outputs 2

SwAp 3
5

outputs 3

SwAp 4

outputs 4

Fig. 9. Branching deterministic recursion, for 5 processes

the algorithm solves swap for less than n processes. Consider an invocation of Swap, where the winner in line 1 is some process p. Consider the processes W that get back p in this line. Every other process will get back a descendant of these processes. Thus, the only processes that invoke Swaptag with tag = p are the processes in W . Moreover, it is impossible that two processes return the same tag, say tag = x, because a process that returns x does so during the execution of swapx , after winning the tournament invocation, and only one process can win this invocation. Acknowledgments. We thank Michel Raynal and the anonymous referees for their comments on an earlier version of this paper.

References
1. Afek, Y., Attiya, H., Dolev, D., Gafni, E., Merritt, M., Shavit, N.: Atomic Snapshots of Shared Memory. J. ACM 40(4), 873890 (1993) 2. Afek, Y., Weisberger, E., Weisman, H.: A Completeness Theorem for a Class of Synchronization Objects (Extended Abstract). In: Proc. 12th Annual ACM Symposium on Principles of Distributed Computing (PODC), Ithaca, New York, USA, August 15-18, pp. 159170 (1993) 3. Attiya, H., Bar-Noy, A., Dolev, D.: Sharing Memory Robustly in Message-Passing Systems. J. ACM 42(1), 124142 (1995) 4. Attiya, H., Bar-Noy, A., Dolev, D., Peleg, D., Reischuk, R.: Renaming in an Asynchronous Environment. J. ACM 37(3), 524548 (1990)

Recursion in Distributed Computing

375

5. Bar-Noy, A., Dolev, D., Dwork, C., Strong, H.R.: Shifting Gears. Changing Algorithms on the Fly to Expedite Byzantine Agreement. Inf. Comput. 97(2), 205233 (1992) 6. Borowsky, E., Gafni, E.: Generalized FLP Impossibility Results for t-Resilient Asynchronous Computations. In: Proc. 25th ACM Symposium on the Theory of Computing (STOC), pp. 91100. ACM Press, New York (1993) 7. Borowsky, E., Gafni, E.: Immediate Atomic Snapshots and Fast Renaming (Extended Abstract). In: 12th Annual ACM Symposium on Principles of Distributed Computing (PODC), Ithaca, New York, USA, August 15-18, pp. 4151 (1993) 8. Borowsky, E., Gafni, E.: A Simple Algorithmically Reasoned Characterization of Wait-Free Computations (Extended Abstract). In: Proc. 16th ACM Symposium on Principles of Distributed Computing (PODC 1997), Santa Barbara, California, USA, August 21-24, pp. 189198 (1997) 9. Borowsky, E., Gafni, E., Lynch, N., Rajsbaum, S.: The BG Distributed Simulation Algorithm. Distributed Computing 14(3), 127146 (2001) 10. Caeda, A., Rajsbaum, S.: New combinatorial topology upper and lower bounds n for renaming. In: Proceedings of the 27-th Annual ACM Symposium on Principles of Distributed Computing (PODC), Toronto, Canada, August 18-21, pp. 295304 (2008) 11. Chou, C.-T., Gafni, E.: Understanding and Verifying Distributed Algorithms Using Stratied Decomposition. In: Proceedings of the Seventh Annual ACM Symposium on Principles of Distributed Computing (PODC), Toronto, Ontario, Canada, August 15-17, pp. 4465 (1988) 12. Coan, B.A., Welch, J.L.: Modular Construction of an Ecient 1-Bit Byzantine Agreement Protocol. Mathematical Systems Theory 26(1), 131154 (1993) 13. Dobson, J., Randell, B.: Building Reliable Secure Computing Systems out of Unreliable Insecure Components. In: Proc. IEEE Conference on Security and Privacy, Oakland, USA, pp. 187193 (1986) 14. Elrad, T., Francez, N.: Decomposition of Distributed Programs into Communication-Closed Layers. Sci. Comput. Program 2(3), 155173 (1982) 15. Fouren, A.: Exponential examples for two renaming algorithms (August 1999), http://www.cs.technion.ac.il/~ hagit/publications/expo.ps.gz 16. Gafni, E.: Round-by-Round Fault Detectors, Unifying Synchrony and Asynchrony (Extendeda Abstract). In: Proc. 17th Annual ACM Symposium on Principles of Distributed Computing (PODC), Puerto Vallarta, Mexico, June 28-July 2, pp. 143152 (1998) 17. Gafni, E., Rajsbaum, S., Herlihy, M.: Subconsensus Tasks: Renaming Is Weaker Than Set Agreement. In: Dolev, S. (ed.) DISC 2006. LNCS, vol. 4167, pp. 329338. Springer, Heidelberg (2006) 18. Gallager, R.G., Humblet, P.A., Spira, P.M.: A Distributed Algorithm for MinimumWeight Spanning Trees. ACM Trans. Program. Lang. Syst. 5(1), 6677 (1983) 19. Herlihy, M., Rajsbaum, S.: The Topology of Shared-Memory Adversaries. In: Proc. 29th ACM Symposium on Principles of Distributed Computing (PODC), Zurich, Switzerland, July 2528 (to appear, 2010) 20. Herlihy, M.P., Shavit, N.: The Topological Structure of Asynchronous Computability. Journal of the ACM 46(6), 858923 (1999) 21. Lamport, L., Shostak, R.E., Pease, M.C.: The Byzantine Generals Problem. ACM Transactions on Programming Languages and Systems 4(3), 382401 (1982) 22. Loui, M.C., Abu-Amara, H.H.: Memory requirements for agreement among unreliable asynchronous processes. In: Preparata, F.P. (ed.) Advances in Computing Research, vol. 4, pp. 163183. JAI Press, Greenwich (1987)

376

E. Gafni and S. Rajsbaum

23. Manber, U.: Introduction to Algorithms: A Creative Approach. Addison-Wesley, Reading (1989) 24. Randell: Brian Recursively structured distributed computing systems. In: Proc. IEEE Symposium on Reliability in Distributed Software and Database Systems, pp. 311 (1983) 25. Rajsbaum, S., Raynal, M., Travers, C.: An impossibility about failure detectors in the iterated immediate snapshot model. Inf. Process. Lett. 108(3), 160164 (2008) 26. Rajsbaum, S., Raynal, M., Travers, C.: The Iterated Restricted Immediate Snapshot Model. In: Hu, X., Wang, J. (eds.) COCOON 2008. LNCS, vol. 5092, pp. 487497. Springer, Heidelberg (2008) 27. Saks, M., Zaharoglou, F.: Wait-Free k-Set Agreement is Impossible: The Topology of Public Knowledge. SIAM Journal on Computing 29(5), 14491483 (2000) 28. Stojmenovic, I.: Recursive algorithms in computer science courses: Fibonacci numbers and binomial coecients. IEEE Trans. on Education 43(3), 273276 (2000) 29. Weisman, H.: Implementing shared memory overwriting objects. Masters thesis, Tel Aviv University (May 1994)

Non-recursive Immediate Snapshots Algorithm

A wait-free algorithm solving the immediate snapshot task was described in [6]. We present it here for comparison with the recursive version presented in Section 4. The algorithm is in Figure 10.

Algorithm Immediate Snapshot(i); repeat LEVEL[i] LEVEL[i] 1; for j {1, . . . , n} do leveli [j] LEVEL[j] end for; viewi j : leveli [j] LEVEL[i]}; until (|viewi | LEVEL[i]) end repeat; return({j such that j viewi }) Fig. 10. Non-recursive one-shot immediate snapshot algorithm (code for pi )

It is argued in [6] that this algorithm solves the immediate snapshot task, with O(n3 ) complexity.