You are on page 1of 28

An overview of distributed MST algorithms

Arkanath Pathak, Buddha Prakash, Utpal, Palash Mittal


In this paper, we discuss the problem of finding a Minimum Spanning Tree over a weighted undirected graph. We also discuss and compare three classic distributed algorithms for the problem at
hand in sections 1, 1 and 1. The problem of finding a distributed algorithm for a minimum weight
spanning tree is a fundamental problem in the field of distributed network algorithms. Trees and
MSTs are used in a wide variety of algorithms in the distributed graph structure domain. Around
1983, when the influential and classic approach was provided by Gallager et al. [7], MST algorithms were already being used in broadcast algorithms for communication networks. With the
help a minimum cost tree, the cost associated for a broadcast can be reduced by a significant
amount [2, 3]. Here, the edge weights are associated to the cost of using a channel in a specific
direction. In addition to the broadcast application, there are many potential control problems for
networks whose communication complexities are reduced by having a known spanning tree. Spanning trees themselves are essential components in classic distributed graph problems like Leader
Election [10], network synchronization [1], Breadth-First-Search [3] and Deadlock Resolution [4].


Problem Definition

Let G = (V, E) be an undirected graph, where V is the finite nonempty set of nodes and E V V
is the set of edges defined over V . A graph G0 = (VG , EG ) is a spanning sub-graph of G if VG = V
and EG E. A spanning tree of a graph is a connected acyclic spanning sub-graph. The original
graph G is assumed to be connected for this problem domain, otherwise a spanning tree can not be
formed. A spanning tree T = (V, ET ), ET E, is the minimum cost spanning tree if the sum of
edge costs over ET is minimum. Formally,
T = argmin


, where w(e) is the weight assigned to the edge e. The edges are assumed to have distinct weights,
which makes the minimum weight spanning tree unique. This property can easily be ensured by
various workarounds, .e.g. appending node ids to the edge weights which insures a proper ordering,
where ties are broken by the node ids. In the late 1950s, much before the realization of importance
of the problem in the communication networks domain, the classic approaches towards finding
MST were proposed by Dijkstra [6], Prim [12] and Kruskal [9]. These approaches primarily relied
on the following two lemmas. Note that the proofs are informal and they assume the edge costs to
be distinct, hence ensuring the presence of a unique MST.

Lemma 1.1 (MST Cut Lemma) Let P and Q be two disjoint node sets that together give the
union of the original graph nodes, then among all the edges between P and Q, the node with
minimum edge cost must be present in the MST.
Proof Let T be an MST that does not contain e (the edge with the minimum cost), then adding e
to it will produce a cycle. Traverse along that cycle, and whenever it crosses from P to Q, replace
that edge with e. This produces an MST again, a contradiction.
Lemma 1.2 (MST Cycle Lemma) The costliest edge of a connected undirected graph must not
be in the MST.
Proof Let T be the MST that contains the costliest edge, e. The graph obtained by removing e
from T (T e) gives two connected disjoint components, traverse along the cycle to get a cheaper
edge that can be replaced with e. Hence, a contradiction.
Any of the above two lemmas can be used to construct the MST. Prims algorithm [12] is a nearly
direct implementation of the Cut Lemma (Lemma 1.1). These approaches have a time complexity
of O(ElogV ).
However, from here on, we focus on the distributed approaches towards the problem in the static
asynchronous network domain [1]. In this construct, the graph is assumed to be representative
of a point-to-point communication network, where the set of nodes V represents processors of the
network and the set of edges E represents bidirectional non-interfering communication channels
operating between neighbouring nodes [3]. The communication is only through the channels and
no memory is shared between the processes. However, a node does know the identity of its neighbours. Also, the approaches are confined only to event-driven algorithms, hence, there is no help
of a global clock to gain knowledge for events taking place in other processes.
For the distributed approaches, the complexity of the algorithm is usually measured either in terms
of communication or time. Communication complexity is the number of messages sent during
the algorithm. In some models, it is further assumed that a single message can contain at most
O(logV ) bits [3]. The definition for Time complexity varies with the usage. It is usually based on
either the number of rounds (relative to an algorithm) or time units (global clock).


Types Of Solutions

All the three distributed algorithms for the MST problem that we review function according to the
following general scheme. The general scheme was introduced in the classical GHS algorithm by
Gallager et al. [7]. There is no pre-specified root in the distributed network. Each node stores
its own copy of the algorithm and the algorithm is started independently by all the nodes. The
algorithm proceeds by merging fragments (trivially connected subgraph of the MST). It is interesting to note that the algorithm by Garay et al. [8] also involves breaking down of the fragments
into smaller fragments to ensure an upper bound on the diameter of each fragment. Initially, all
fragments are singletons (consisting of only a single node). The merge step involves each fragment
determining the best (minimum weight edge) edge for the entire fragment among all its outgoing edges, and merging the two fragments along this selected edge. This edge is determined by

propagating an initiate message to each node in the fragment. Each node gathers the reports of all
its children and propagates the info of the best minimum outgoing edge reported by its children or
determined by itself. The algorithm upon termination finds and propagates the information about
the unique MST for the distributed network to all the nodes.
Even though all the three algorithms follow the same general scheme, they differ in terms of problem breakdown and the approaches adopted to solve the sub-problems. In a general sense, there
are two methods for computing the MST.
1. Combining fragments: add edges until the MST is constructed.
2. Eliminating cycles: delete edges until only the MST is left.
Both Galagers [7] and Awerbuchs [1] algorithms use the first approach whereas Galays [8] algorithm combines the two approaches.
Awerbuchs and Galays algorithms achieve optimal performance, both in terms of message passing and time complexity. The algorithms proceed by breaking down the problem into three smaller
sub-problems which are solved in three different phases. In the different phases, the algorithms
establish a trade-off between message passing and time complexity.
Awerbuchs algorithm starts with the Counting phase where it determines the number of nodes
in the network. Counting phase is followed by a second phase where the MST is developed, by
)). At this stage
following the GHS algorithm until the fragments become of a large size (O( log(V
the algorithm switches to the third phase which consists of two relatively complex procedures:
Root Distance and Leader Distance, for adding the remaining edges of the MST. We discuss the
implementation details and subtleties of the procedures later in the report.
Garays algorithm starts with the Controlled GHS phase, which is a modified and controlled
version of the GHS algorithm. The first phase produces a forest of a bounded number of fragments where each fragments diameter is also bounded by an upper limit. This is followed by the
second phase which involves the elimination of short cycles in the fragment graph. The third
phase involves global edge elimination in which the remaining fragment graph is reduced to the



Before jumping to the individual algorithms, let us first review the typical issues faced in this
domain. In the distributed domain, no node has the knowledge of the state of the whole graph,
which is required for both Prims and Kruskals algorithm. In the message-passing model, it is
also difficult to process a single node at a time. Hence, the algorithms need to take care of the
different orderings of events which can take place in different nodes. Some distributed algorithms
bear similarities to Borvkas algorithm [5] for the classical MST problem. For the distributed
approaches, there are issues to be tackled for each algorithm. A common issue is due to the
assumption of known distinct edge costs. This is not always the case for the channels. However,
as mentioned earlier, workarounds exist to ensure proper ordering of the edges. A common issue

is to identify each fragment, so as to insure the internal and outgoing edges. This issue is resolved
by associating the identity of a fragment to the root node or edge.

Gallager et al. (1983) - GHS Algorithm

In this section, we describe the classical article by Gallagher, Humblet and Spira published in 1983
[7]. The approach is built over the Cut Lemma (Lemma 1.1), and as mentioned before, develops
a set of fragments (or sub-trees) which ultimately converge into the MST. Before going into the
working, we list the assumptions that will be made in the algorithm:
Each node has the knowledge of all its neighbours, including the weights (or costs) of the
edges (or channels) connecting to them.
Each node has its own copy of the full algorithm in the initial state.
The channels are assumed to be asynchronous, bidirectional, FIFO, and without any errors.
The weights of the edges in the network are mutually distinct.
With these assumptions, we move towards the working of the algorithm.



The algorithm involves some complications due to the detailed logistics, however, we first try to
give an overview of the working. After which, we will build upon the specifics in the next section.
At any point in the algorithm, there are a set of sub-trees, which the authors call fragments. These
fragments merge with each other to form the MST ultimately. Each fragment will hence be formed
of a set of nodes. Initially, a fragment will just be a single node of the graph. Also, each node will
have some identifying information (to be defined later) for the fragment it belongs to, and will
be aware of the edge leading to the core (a special edge, to be defined later) of the fragment. Due
to the Cut Lemma (Lemma 1.1), we know that a fragment should merge only along the outgoing
edge which has the minimum edge cost. To reduce the complexity, the authors impose a further
constraint for a fragment to merge only with a bigger fragment. Due to this asymmetry, this merging is often called as absorbing or hooking into the bigger fragment. This property of merging into
bigger sets is a classic optimization technique used in data structures like Union-Find. Hence, the
edge must also be connecting to a bigger fragment. Such an edge, if found, is called the best-edge
for the fragment. Therefore, each fragment will get merged along its best-edge. It can hence be
observed that finding the best-edge for a fragment will form a crucial module for the algorithm.
The core of a fragment, introduced in the previous paragraph, can be treated as a root of the
fragment, since each node will maintain an inbound edge leading to the core. Also, the identification (introduced in the previous paragraph) for a fragment is nothing but the weight of the core
edge. Since the weights are distinct the identification is also unique. The definition of the core
edge relies on an attribute called the level, defined for each fragment. The level provides a lower
bound on the logarithm of the number of nodes in the fragment, we shall shortly realise how. A

level of a fragment is initialized as 0 (when the fragment comprises of a single node). Each fragment gets absorbed into a fragment of higher or equal level. If the fragment gets absorbed into a
fragment of a higher level, the fragment just becomes a part of the bigger fragment and assumes
the identity of the bigger fragment. However, if the levels of the two fragments are equal, a new
fragment is formed with the level incremented by 1. We can now define the core, which is the
ID for the fragment. When a fragment has its level set as 0 (single node), there is no core edge.
However, during each increment of the level (caused by the merging of fragments of equal levels),
the core is updated to the edge along which the merge took place (connecting the two fragments).
Note that if the levels of the two fragments are equal during the merging, the best-edge for both
fragments will change. It hence follows that a level L + 1 fragment always contains, at least, two
level L fragments, and hence each level L fragment contains, at least, 2L nodes. Thus, the level of
a fragment provides a lower bound on the log of the number of nodes.
To give a one line summary for GHS algorithm, it maintains a set of mutually exclusive fragments, each fragment has a core edge and at any point of time each fragment is waiting to get
merged along its best-edge (which may not be known at all times), and these fragments ultimately
merge to give a single fragment, the MST.


Algorithm: Specifics

We now discuss the specifics of the algorithm, including the types of messages that are sent and
the actions were taken by each node. Any node can exist in three states during the course of the
algorithm. The three states are named Sleeping, Find and Found. Initially, every node is in the
Sleeping state. It is then either awakened by itself or someone else, after which it never returns
to the Sleeping state. At any time after the node has wakened, it will either be in the Find state or
in the Found state; in fact, it will switch between the two until the MST is found.
We now describe the algorithm for the search for the best-edge for a fragment. If the fragment
consists of a single node, the best-edge is simply the least cost outgoing edge which leads to a
higher or equal levelled fragment. To do so, it iterates over all of the adjacent edges, chooses the
edge with the minimum cost, and sends a Connect message to the adjacent node. It subsequently
goes into the state Found and waits for a response from the fragment at the other end. Now, lets
consider the algorithm for a fragment with more than one node (non-zero level). The type of message used for this case is Initiate. The process for this is started whenever two fragments of level
L 1 merge to form a bigger fragment, with a new core. The two nodes forming the core edge
broadcast the Initiate message to the other nodes of the fragment. This broadcast is done along
the outward (opposite of the inward edge) edges of the tree. The initiate message also carries the
information < ID, Level >, where ID is the weight of the core. When a node receives the Initiate
message, it changes its state to Find. This essentially starts the process of finding the minimumweight adjacent edge for each node. Each node labels each of the adjacent edges into either of the
following three classes: Branch, Rejected and Basic. Branch means an edge is in the fragment
tree. Rejected means an edge, which has been discovered to be pointing to a node of the same
framework. The Basic edges are the remaining edges which are not labelled as Branch or Rejected. Now, to find the best-edge candidate for that node, it finds the minimum-weight Basic edge
and sends a Test message to the node on the other side of the edge. The Test message contains the

< ID, Level > information of the fragment. If the node on the other side has the same ID (same
fragment), the node replies with a Reject message, and both nodes label the edge as Rejected.
Note that if a node sends the Test messages and receives a Test message from the other side as
well, with the same identity, the node need not send a Reject message as a reply and just label the
edge as Rejected. Now, if the node on the other side has a different identity, it will either respond
by sending an Accept message (if its level is greater or equal), or it will delay making any response
until its fragment reaches a level greater or equal than the one received in the message. Since the
response is delayed, the node which sent the message is blocked and hence the whole fragment is
blocked. This essentially means that a fragment will finish finding the best-edge if and only if none
of the outgoing best-edge candidates of the fragment lead to a fragment with a lower level.
When the individual nodes have found their respective minimum-weight edges (the candidates
for best-edge), the nodes need to cooperate to find the best-edge for the fragment. This is achieved
by propagating Report messages towards the core. A Report(W ) message is sent to the inbound
edge, where W is the weight of the minimum-weight outgoing edge encountered yet. W will be
when there are no outgoing edges yet encountered. A node will wait to receive Report messages from all its branches, except the inbound edge, and it will be the minimum of those weights
along with the minimum-weight edge found by the node itself (of its outgoing edges). The global
minimum of all of these weights is then propagated again in a Report message towards its inward
edge. When a node sends a Report message, it changes its state to Found. Furthermore, the nodes
save the edge leading to the best-edge candidate so that the path can be traced back.
Ultimately, when both nodes of the core edge have exchanged Report messages, these nodes act
to inform the node having the minimum-weight outgoing edge. Also, it is now certain that the
fragment must merge along that best-edge, hence the inbound edges need to be reordered towards
that node since the core edge will either be in the adjoining fragment or it will be the best-edge. To
do so, a Change-core message is propagated towards the node with the best-edge. For each edge
that is encountered along this path, the inbound edge is reversed to point towards the best-edge.
Finally, the node with the best-edge sends a Connect(L) message towards the best-edge. L,
here, is the level of the fragment of the sending node. It may also happen that the fragment on
the other side has the same level. This causes the best-edge to become the new core of a newly
formed fragment with level L + 1. To achieve this, Initiate messages are broadcasted with the
new < ID, Level > information to both the fragments. This achieves two purposes, sending the
information update to all the nodes, and the initiation of a new search since the level as changed.
On the other hand, if the connecting fragment has level L0 > L, the fragment needs to get absorbed
in the connecting fragment. To do so, Initiate message is broadcasted to only the joining fragment
(smaller level). This achieves the purpose of both updating the < ID, Level > information of
the joining fragment as well as the initiation of search for the joining fragment since the level is
now updated. Note that the nodes in the connecting fragment remain unchanged since they were
already in the blocked search of searching for the best-edge.
We now formalize these details by giving a pseudo-code for the algorithm.



In Figure 1, we have shown an example run of the algorithm on a graph with 5 nodes.


Algorithm: Pseudo-code

We give a pseudo-code, inspired by the one originally provided by the authors [7], in Algorithm 1.
Comments (beginning with .) are also present for some lines of the pseudo-code.
Notations used in the Algorithm 1 (GHS):
Any procedure with RESPONSE-X is automatically triggered when a message X is received
at a node.
Procedure WAKE-UP can be triggered automatically if a node is sleeping.
n is used as the default notation for the node at which the procedure executes.
N odeState(n): State of the node n, enumerator variable with possible values as SLEEPING,
FIND and FOUND. Initialized with SLEEPING.
EdgeState(e): State of the edge e, enumerator variable with possible values as BRANCH,
REJECTED and BASIC. Initialized with BASIC.
W eight(e): Variable storing the weight of the edge e.
ID(n): Variable storing the identity of the fragment which contains node n.
Level(n): Variable storing the level of the fragment which contains node n.
BestEdge(n): Variable storing the edge leading to the best-edge of the fragment which
contains node n. Used for tracing back from core.
BestW eight(n): Variable storing the weight of the best-edge of the fragment which contains node n.
T estEdge(n): Variable storing the outgoing edge at which a test message has been sent. Set
back to nil after response received.
InboundEdge(n): Variable storing the inbound edge which leads to the core of the fragment.
F indCount(n): Variable storing the count of Initiate messages sent by the node n. Must
receive Report messages from all of these before reporting to inbound edge.

(a) Initial graph, all nodes are in sleeping state.

(b) Node B spontaneously wakes up, sends Connect message to A.

(c) Node A wakes up and connects with B. Initiate mes- (d) A B becomes the core. Nodes A and B send Test
sage with new < ID, Level > information is sent to both. messages. Node E wakes up and merges with Node C.
Node C also wakes up independently.

(e) C accepts the Test message since both have Level 1. D (f) B responds with Initiate message to D. A and B report
wakes up and sends the Connect message. D rejects the to each other and Change-Core message is sent towards
Test message since level is lower (0 < 1).

(g) Change-Core reaches A (node with best-edge) and Ini- (h) During the the propagation of Initiate messsages (not
tiate message with incremented ID is sent to both frag- shown), inbound edge direction for E is reversed. MST is
ments. Ds Test message will be rejected later.
formed since no outgoing edges.

Figure 1: Example run for GHS with a graph containing 5 nodes and 6 edges


Communication and Time Analysis

The authors show that the total number of messages is upper-bounded by 5V log2 V + 2E, where
V is the number of nodes and E is the number of edges. The proof behind this goes as follows.
The Reject message is sent at most once in each direction of an edge, hence, there are at most
2E Reject messages. Each node will send at most one Accept, Initiate, Report, Connect and Test
message unless the fragment it belongs to changes its ID (hence, changes the level). Since there
can be at max log2 V level changes, the number of messages except Reject is again bounded by
5V log2 V .
The time complexity is shown to be O(V log V ) time units. The proof behind this lies in the
fact that it takes at most O(lV ) time units until all nodes reach level l. This can be proved with
the help of induction on the number of levels, since the propagation of cooperation signals within
a fragment, requires O(V) time units. Since the level l is upper-bounded by log V , total time units
is bounded by O(V log V ).

Awerbuch (1987)

In this section, we describe the classical article by Awerbuch [4] published in 1987. We give a
brief overview of the algorithm followed by the algorithm specifics.



The main contribution of this work over past works is to develop a linear time algorithm for finding Minimum Spanning Tree in the asynchronous network; with the best previous one having
(E + V logV ) message complexity and taking (V logV ) time. The GHS algorithm explained

Algorithm 1 GHS Algorithm

1: procedure WAKE -U P
. Called at spontaneous waking of a node
Level(n) := 0
F indCount(n) := 0
N odeState(n) := FOUND
Find the adjacent edge m with such that w(m) is minimum.
EdgeState(m) := BRANCH
send Connect(0) towards m.
8: procedure R ESPONSE -I NITIATE(< ID, Level >, State) received through edge j
Level(n) := Level
ID(n) := ID
N odeState(n) := State
InboundEdge(n) := j
. Initiate message comes from the path leading to core edge
BestEdge(n) := nil
. Initially, no best-edge, set only after a valid best-edge found
BestW eight(n) :=
Send Initiate(< ID, Level >, State) on all adjacent edges (except j) of n which have
state set as BRANCH
. Broadcast along the branch edges
if State = FIND then
F indCount(n) := number of adjacent edges (except j) of n which have state set as
Find the adjacent edge m such that w(m) is minimum and EdgeState(m) = BASIC
if there is no such edge m then
. No edges to send T est to, good to go
T estEdge(n) := nil
Execute REPORT
T estEdge(n) = m
Send T est(< ID(n), Level(n) >) on m
25: procedure R ESPONSE -T EST(< ID, Level >) received through edge j
if N odeState(n) = SLEEPING then
Execute WAKE-UP
if Level(n) Level and ID(n) 6= ID then
Send Accept on j
if ID(n) = ID then
Send Reject on j
EdgeState(j) := REJECTED
if Level(n) < Level and ID(n) 6= ID then
Delay the response by placing the received message on end of queue
35: procedure R EPORT . Checks if got response from all the branches as well as outgoing edges
if F indCount(n) = 0 and T estEdge(n) = nil then
N odeState(n) := FOUND
. Now need to propagate the results back to core
Send Report(BestW eight(n)) on InboundEdge(n)
39: procedure R ESPONSE -R EPORT(W eight) received through edge j
if j 6= InboundEdge(n) then
F indCount(n) := F indCount(n) 1


if W eight < BestW eight(n) then

BestW eight(n) := W eight
BestEdge(n) := j
Execute REPORT
. Check if Report received from all branches
else if N odeState(n) 6= FIND and W eight > BestW eight(n) then
. Received at core
else if BestW eight(n) = then
. No outgoing edges
halt, MST found
procedure R ESPONSE -ACCEPT received through edge j
T estEdge(n) = nil
. Good to go
if W eight(j) < BestW eight(n) then . Check with branches who have already reported
BestEdge(n) := j
BestW eight(n) := W eight(j)
Execute REPORT
procedure R ESPONSE -R EJECT received through edge j
if EdgeState(j) = BASIC then
EdgeState(j) := REJECTED
Find the adjacent edge m such that w(m) is minimum and EdgeState(m) = BASIC
if there is no such edge m then
T estEdge(n) := nil
Execute REPORT
T estEdge(n) = m
Send T est(< ID(n), Level(n) >) on m
procedure C HANGE -C ORE
. Best-edge found, inform the node with best-edge
if EdgeState(BestEdge(n)) = BRANCH then
Send Change-Core on BestEdge(n)
. Propagate
. Reached node with best-edge, send Connect
Send Connect(Level(n)) on BestEdge(n)
EdgeState(BestEdge(n)) := BRANCH
procedure R ESPONSE -C HANGE -C ORE received through edge j
InboundEdge(n) := j
. Recursive implementation
procedure R ESPONSE -C ONNECT(Level) received through edge j
if N odeState(n) = SLEEPING then
Execute WAKE-UP
if Level(n) > Level then
Send Initiate(< Level(n), ID(n) >, N odeState(n)) on j
if N odeState(n) = FIND then
F indCount(n) := F indCount(n) + 1
else if EdgeState(j) = BASIC then
Delay response by placing the message at the end of queue
. New core, note that Connect will be received by sending node as well
Send Initiate(< Level(n) + 1, W eight(j) >,FIND) on j

in the previous paper presented basic fundamental ideas and concepts to do so. The best previous
algo was given by Chin and Ting and Gafni. The algorithm given here is suboptimal by a factor of (V logV ) in time. This is due to reason that small trees sometimes wait for bigger trees
leading to complex combinatorial structure as a consequence of waiting for relations between trees.
The improvement in performance of MST algorithm in this paper is primarily due to 2 new innovations: Root Update and Test Distance. The algorithm consists of two stages: Counting stage,
which computes the number of nodes in the network and uses this information to find the MST
in Minimum Spanning Tree stage. Both are optimal in communication and time. The Counting
stage first finds some spanning tree and elects a leader in the network; which helps to compute the
number of nodes in the system.
In most distributed MST algorithms, a spanning forest of rooted trees is maintained; each tree
being a subtree of the MST. Initially, every tree consists of a single node. In the course of the
algorithm, the subtrees try to find the best edge(minimum weight edge among all leading to other
trees). The best edge is guranteed to be in the MST given weights are unique. The tree then hooks
itself on the other side of that edge, becoming a sub-tree in the bigger tree. Hooking is a sequence
of manipulation of father pointers. Core edges (two trees hooking onto each other) create a cycle
of two in pointer graph; for which root with bigger identity is unhooked; hence, the larger id root
becoming the root of combined tree. A naive implementation of this algorithm requires O(V 2 )
messages and time complexity; the worst case being the tree of size V /2 being hooked onto other
trees V /2 times, each requiring linear work.
Classical idea to improve this is to use the Union-Find algorithm; leading to a double size of
the combined tree each time pointer of a node is changed. Since each node undergoes a maximum
of atmost log2 V pointer changes, we can achieve a communication complexity of O(E + V logV )
and time complexity of O(V logV ) if we ensure best edge of tree leads to bigger or equal tree.
To achieve this, the previous paper introduces the technique of levels. The reason that its time
complexity is not linear(O(V logV )) is that there might be a bunch of sub-trees of the same level
(say l), each hooked onto the next one on the chain, resulting in a tree of level l + 1, regardless
of the length of the chain. A tree of level 1 with V /2 nodes may be created, which may undergo
logV 1 changes in level, each needing (V ) time. Chin, Ting and Gafni addressed this problem
by updating the level to the logarithm of the cardinality of the tree, each time that computation of
the best edge is performed. However, the time complexity remained the same. The logV factor is
due to the fact that updating the level of long chain comes too late. The minimum weight property
can help to achieve a linear time algorithm because then, instead of hooking itself on to its minimum weight edge, each tree will hook itself on edge leading to neighbouring tree of the maximum
level. This is the main idea behind the Counting stage of our algorithm.


Algorithm Specifics

The algorithm operates in two stages- Counting stage and MST stage. We first explain the MST
stage and then the Counting stage.


MST stage

The MST stage is performed in two phases. The first phase runs algorithm similar to GHS
algo(in above paper)[7] , the only difference being it is terminated when all trees reach the size
of (V /logV ).
The second phase brings new algorithmic ideas, in which aggressive update of levels is done in an
accurate fashion, such that small trees are prevented from waiting for the big trees and speeds the
algorithm. The counting stage is needed in order to know the number of nodes V. The details of
the algorithm are as followed:
1) Root initialisation: In the course of the algorithm (second phase, MST stage) as trees coalesce and hence new nodes become roots of the resulting trees. As soon as a node r becomes root
with level l of tree T, it broadcasts an initialisation message containing (r,l) parameters over T,
which is further relayed onto trees that hook themselves onto T.
Upon delivering the initialisation message, an internal node j remembers those parameters in local
variables Levelj , Rootj , and starts execution of a local search procedure.
2) Local candidate selection: The local search procedure tries to find the minimum weight edge
outgoing from node j to node i in a separate tree such that the is level is greater than l. Actually,
node j scans its incident edges one by one. It does so by passing a special test message to node k
on the other side of the edge and getting the reply from k. 3 broad cases arise:
a) Rootk = r : k is in T. Reply is negative.
b) Levelk <l : k delays response to that message until Levelk reaches l. If this level increase at k
is due to hooking of ks tree onto T, then k will have Rootk = r. Hence, reply is negative.
c) Levelk >l : Edge (j,k) is declared to be local candidate for best edge of T.
3) Best edge selection : Names of local candidates are collected at the root. The root waits until
all the nodes get replies from all their neighbours and all the possible candidates have reached it.
If there is no local candidate, the algorithm terminates since tree spans the network and hence is
the MST.
Else, the root selects the minimum edge (v,w) with v being the internal endpoint. Root sends
special message(pointer reversal message) to v, reversing all the father points from r to v, so that
v becomes the new root. 2 cases arise:
a) (v,w) is a core edge and v is its biggest endpoint : w hooks itself onto v and v becomes root of
the resulting tree. Level of v increases by 1 and (1) that is Root initialisation is done.
b) (v,w) is not a core edge or v is not the bigger endpoint : v hooks itself onto w. T becomes a
subtree in the bigger tree.
If w has received an initialisation message, then v relays it over T making T participate in the
best edge selection of entire tree. Until best edge selection happens, Test-Distance procedure is
iterated by v. This is where the aggresive update of levels takes place and te innovation of paper
lies. Upon each invocation of Test-Distance, node v sends an exploration token to father w. The
token initially carries counter value 2l(v)+1 . Upon arrival of the token at a certain node, the node
subtracts the number of sons from the counter. If the counter is positive, and the receiving node is
not a root node then that node forwards the token to the father. Thus moving up, either the counter

becomes =0 and the token dies or a positive counter reaches the root. If the token is alive and
it reaches root, then acknowledgment is sent back from root to v, upon which v sends a special
message over T, which causes every node in T to increase by its level by 1. The Test-Distance
procedure is revoked again and again with increased level until the token does not die. It is noted
that the Test Distance takes place until a new root in the tree is decided.
4) Root Update Procedure: This process is activated either when initialisation message has advanced for distance bigger than 2m+1 or if some node detected more than 2m+1 internal edges in
local candidate selection, m being the level of the tree root. In either case process of best edge
selection and Test Distance Procedure are interrupted, and level of the root is increased by 1 and
then Root initialization process is revoked.

Algorithm: Pseudo-code

We give a pseudo-code for the phase 2 of MST Stage of the algorithm . Comments (beginning
with .) are also present for some lines of the pseudo-code.
Notations used in the Algorithm 2 (MST(Phase 2)):
Any procedure with RESPONSE-X is automatically triggered when a message X is received
at a node.
Procedure WAKE-UP can be triggered on a node at the start to initialise all the variables.
n is used as the default notation for the node at which the procedure executes.
W eight(e): Variable storing the weight of the edge e.
ID: Variable storing the identity of n.
Level: Variable storing the level of n.
BestW eight(n): Variable storing the weight of the best-edge of the fragment which contains node n.
Root: Variable storing the root of tree of n.
count(best edge): Count of number of best-edges.
local cand arr: Array storing all the local candidate edges with their weights.
parent: Variable storing the parent id of node.
V : Variable storing the total number of nodes.


(a) Broadcast: level l is reset or r is made root. (b) Local candidate selection case 1:
Rootj = Rootk ,Levelj = Levelk

(c) Local candidate selection case 2: Levelk < Levelj

(d) Local candidate selection case 3: Levelk > Levelj , (j k) stored as local candidate


(e) Best candidate selection Step 1: All local candidates sent up to root. Minimum
edge v-w selected.

(f) Best candidate selection step 2 case 1:

Levelv = Levelw (core-edge),IDv = IDw k Levelv < Levelw

(g) Best candidate selection step 2 case 2:
Levelv > Levelw k Levelv = Levelw (core-edge),IDw > IDv

(h) Test distance step 1: Sum of degrees of nodes on w-root path < 2Level(v)+1

(i) Test distance step 2: Broadcast to increase level of all nodes os subtree of v


(j) Root update

Figure 2: MST Stage, Awerbuchs Algorithm

Figure 3: Counting Stage: Link Search, Awerbuchs Algorithm


Algorithm 2 Phase 2, MST Stage (Awerbuch)

1: procedure WAKE -U P(id,V)
. Called each node at starting of algorithm
Level := 0
Root := n
count := dict
local cand arr := []
ID := id
count(best edge) :=0
parent := id
V := V
10: procedure ROOT-I NITIALISATION . Called just after phase-1 of MST or after level of root is
reset or root node is reset.
BroadcastInitiate(< ID, Level >, 0)overthetree
12: procedure R ESPONSE -I NITIALISATION(Initiate(< id, level >, val)) received through edge
if Initiate(< id, level >, val) received for the first time then
Level := level
Root := id
val := val + 1
if val > 2level+1 then
Send Root U pdate M essage to father upto path of root
Call Procedure local-candidate-selection
Broadcast Initiate(< ID, Level >, val) over the tree
no internal edge := 0
For each edge k incident to n,
Send T est M essage(< Root, Level >) to k
Receive Response T est M essage(< Reply >) from k
if res > 0 then
add << n, k >, weight(< n, k >) > to local cand arr
no internal edge := no internal edge + 1
if no internal edge > 2level+1 then
Send Root U pdate M essage to father upto path of root
End For
Send Best Edge < local cand arr > to parent
34: procedure BEST- EDGE - SELECTION(Best Edge < local cand array >) received from
son i
if Root != ID then
Send Best Edge < local cand arr > to parent, storing the path
if count of total number of best edge arrays received = V 1 then



Select minimum edge v w , v being the internal node

Remove first node in path of v from its sons and set it as its parent
Send P ointer Reversal < v, w > to first node in path of v
count(best edge) := count(best edge) + 1
procedure R ECEIVE -P OINTER -R EVERSAL(P ointerReversal < v, w >) received through
edge i
Add i to set of sons
if ID!=v then
Set first node in path to v as parent
Send P ointer Reversal < v, w > to first node in path of v
Set father as empty
Send Joining < Root, ID, Level > to w
Receive < join > from w
if join > 0 then
Set w as father
Call procedure Test-Distance(2Level+1 , v)
Add w to its list of sons
Level := Level + 1
Call procedure Root-Initialisation
procedure R ECEIVE -J OINING(Joining < Root, ID, Level >) received from v
if (Level == this.LevelandID < this.ID)or(Level < this.Level) then
Add father to list of sons
Reverse nodes till the path of root so as to reset the father and son pointers
Set v as father
Send 1 to n
Add v to its list of sons
Send 1 to n
procedure T EST-D ISTANCE(val, v) called or on receiving Test-Distance< val, v > through
edge i
if Root == IDandval > 0 then
Send < Ack T est > to v using the saved path
if val > 0 then
temp := val deg(val)
Send < temp, v > to father




procedure T EST-D ISTANCE -U PDATE(< Ack T est >) received

Level := Level + 1
Broadcast an increase in level of 1 in the entire subtree
Send Test-Distance2Level+1 , v to father
procedure ROOT-U PDATE(Root U pdate) received through son i
if Root == Id then
Level := Level + 1
Call procedue Root Initialisation
Send Root U pdate to its parent


Counting stage

The algorithm again maintains a forest of trees which ultimately combine to form a spanning tree.
The root of each tree is the leader of the tree. Initially, each node in the network forms a tree
consisting of one node. Levels are maintained for each tree, supposedly reflecting the size of the
tree. Level of a tree containing a single node is 0. As the algorithm proceeds , the bigger level
trees clash the smaller level ones capturing their nodes. Schematically, the algorithm has 3 basic
a) Link Search : This procedure is called after each increase in level. Each node scans the links in
order of their weights in increasing order until it finds a edge leading to another tree of bigger or
equal level; called the feasible link of the node at that level.
Exploration messages are sent along the edges to implement scanning of the edges. Links already
known to be internal are not scaned anymore. If a node of smaller level is detected in the course
of scanning, then the bigger tree starts invasion of the smaller tree through that link; while at same
time node in bigger tree continues search for feasible link, attempting to find a link to another tree
of bigger or equal level. Once each node finishes its search for feasible link, it reports the result
of the search to the root of the tree. The reporting is done through convergecast , where leaf node
sends the report whenever it finishes the search, while internal nodes do it after receiving reports
from all from their children. The report actually contains either the identity of the feasible link and
the label of the tree on the other side of the link or simply says that no feasible link has been found.
The root node collects such reports from all nodes of its tree, including the nodes that have just
been captured or are going to be captured.
b) Level Update : Whenever the time spent by Link search procedure is high, it is interrupted.
Whenever a node is detected such that sum of its height and degree in tree exceeds the value 2k+1 ,
where k is the level of the tree; link search procedure is interrupted.
The procedure succeeds only when the tree is not being absorbed by a bigger level tree and
aborts otherwise. The procedure operates similarly to two-phase commit protocals. It locks
the nodes which have not been captured by some other tree. The locking phase takes place in 2
phases; each phase involving one broadcast and one convergecast.
In the first broadcast, nodes are conveyed that locking mechanism has started. A node receiving
the first broadcast is locked if it has not been invaded by some another tree. Once a node is
locked, all the incoming exploration messages are buffered and processed immediately after node

is unlocked.
This is followed by a convergecast, where the leader finds out whether all locks have been obtained. The locking succeeds if all the locks have been obtained. If successful, then the new level
is computed which is actually the (intger) value of the logarithm of a number of nodes(cardinality)
of the tree.
The second broadcast informs all the nodes if the locking was successful. If locking was succesful,
then each node udates its level. In any case, the nodes become unlocked.
The second convergecast is needed for the purpose of synchronisation; that is to ensure all the
nodes have completed the procedure.
In case the Level-Update aborts, the leader becomes inactive with no additional procedure being
executed in its tree. This is because the leader can never again become the network leader as its tree
is absorbed by bigger level trees. Upon termination of Level-Update, either the level is increased
or the tree becomes inctive.
Thus, 2 events may take place: either uninterrupted execution of Link-Search or tree is invaded
by another tree. In the latter case, the tree leader is killed.
If none of the tree nodes found a feasible link(in the former case), then the tree must cover the
entire network with the termination of the algorithm as the spanning tree is found. Root is declared
as the leader. Its name is broadcasted over the network and the total number of nodes is counted.
Otherwise, some feasible links have been found. Two cases arise:
1) If all feasible links lead to trees of the same level, then the preferred link is elected with the
minimum weight; the tree on the other side of the edge is called the preferred tree.
2) If there exists a bigger leel tree on the other side, the tree becomes inactive.
c) Marriage Procedure : If the tree is active at this point, that is ,i f all feasible links lead to
trees of the same level , then the Marriage Procedure merges the pairs of trees of same level, having the same preferred link. In such pair, the tree with bigger identity conquers the tree with
smaller identity.

Garay et al. (1998)

In this section, we describe and report the paper by Garay, Kutten and Peleg published in 1998 [8]
which describes a sub-linear time distributed algorithm for The MST problem. A lot of work on
distributed network algorithms are focused on achieving an O(n) bound on an n-vertex distributed
network. Both the algorithms we have previously discussed try to achieve this optimal O(V )
bound. In the paper, the authors try to identify inherent graph parameters which accurately describe the behavior of distributed MST construction and then propose an optimal algorithm with a
bound in terms of these inherent graph parameters.
They identify that diameter of the graph is indeed one such inherent parameter in the construction of an MST and present a distributed MST algorithm whose time complexity is sub-linear in
V and linear in Diam. The motivation of the work coming from the fact that there exists trivial
O(Diam) algorithms for various other important distributed network problems such as Leader
Election, Breadth First Search Tree Construction etc. One other motivation being that in most real

large area networks Diam  V . So any such improvement would hugely improve the performance in real-world distributed systems.
For the algorithm to execute in the declared time and message passing complexity we need to
make a few assumptions. We need to follow all the assumptions made by the GHS algorithm that
we enlisted before. Besides them we also need to make the following assumptions
1. We will assume that the size of the messages has an upper bound of O(log V )
2. Also a node may send at most one message on each edge at each time unit
3. Edge weights are polynomial in V , so an edge weight can be sent in a single message.
With these assumptions, we move towards the working of the algorithm.


The Sub-linear Algorithm for MST Construction

The algorithm involves a lot of complications due to the detailed logistics, however, we first try to
give an overview of the working. With this in mind, we will build upon the specifics in the next

Phase I: Controlled-GHS

Overview: Garay et al give a really innovative three phase algorithm which is much more complex than the algorithms we have previously discussed and involves a number of subtleties. The
first phase of the algorithm is the Controlled GHS phase which is a modified controlled version
of the GHS algorithm. As with the basic GHS algorithm, Controlled-GHS also simultaneously
initiates from all the nodes with singleton fragments and execute a total of I phases. Each of the I
phases consist of the following two stages:
1. In the first stage the basic GHS algorithm is executed until the stage where each fragment in
the network has found its minimum weight outgoing edge(an outgoing edge of a fragment F
being an edge with one endpoint in F and another at a node outside it). So at the end of this
stage, we get a forest structure of fragments which is referred as F F in the remainder of the
2. Each of the fragments in the resulting forest is broken down into small O(1) trees and merge
operation of the GHS algorithm is performed only on these small trees. The trees are broken
down by computing a dominating set M (T ) on each tree T of the fragment forest F F , and
then the merge operation is carried out with each fragment F M picking one neighboring
fragment F
/ M and merging with it. The breaking down of the fragment in this stage
ensures that the diameter of each fragment remains small.
Small-Dom-Set Construction: A Procedure Small-Dom-Set is used for computing a small
dominating set on each fragment. This procedure achieves the following goals. Given a tree T
with a vertex set V (T ) find a set of vertices M (T ) such that.


1. M dominates V (T )
2. |M |

|V (T )|

The procedure is based on the following. For a vertex v in a tree T , let Child(v) denote the set
of vs children in T . We use a depth function L(v) on the nodes, defined as follows:
if v is a leaf,
L(v) =
1 + minuChild(v) (L(u)), otherwise
Also, we denote the set of tree nodes at i as L(i). Now we can proceed to give the algorithm of the
Algorithm: Small-Dom-Set
1. Mark the nodes of T with depth numbers L(v) = 0 , 1 , 2.
2. Select an MIS, Q, in the set R of unmarked nodes;
3. Then, M := Q L1;
Output of Controlled-GHS: For the computation of the dominating sets we use a distributed
implementation of Procedure Small-Dom-Set. The algorithm employs the distributed Minimal Independent Set (MIS) algorithm of Panconesi and Srinivasan [? ] for calculating the MIS.
It is important to note that the first phase of the algorithm achieves the following.
Lemma 4.1 In each phase of Controlled-GHS
1. the number of fragments, at least, halves.
2. Diam(F ) increases by a factor of at most 3, for every fragment F.
Lemma 4.2 Also, when algorithm Controlled-GHS is activated for I phases, it takes O(3I .2log V )
time, and yields up to N = 2VI fragments, of diameter at most d = 3I .
The above results can be easily proved from the basic properties of the procedures of fragment
breakdown and small dominating set construction used in Controlled-GHS.

Phase II: Small Cycle Elimination on Fragment Graph

Overview: Improving the time bound of an MST construction requires that we solve the key
problem. The problem being that since the MST of the network may be possibly as high as O(V )
we need cannot afford communication on the tree structure itself as it would require O(V ) time.
The algorithm solves this problem by deviating from the GHS algorithm at an appropriately chosen
point and switch to an algorithm which eliminates edges which are definitely not going to be a part
of the MST.


Let F G denote the fragment graph that is the outcome of Phase I. The vertices of this graph are
the fragments constructed in Phase I, and its edges are all the inter-fragment edges. On observation, we find that cycles and multiple edges(from different nodes belonging to the same fragment)
connecting any two fragments might exist in this graph. The algorithm uses a complex procedure
to identify and eliminate these cycles.

Cycle Elimination procedure


For eliminating cycles the procedure depends on the following

Lemma 4.3 Given a weighted graph G = (V, E), if e is a bottleneck edge of G then e
/ M ST (G).
One of the nodes is distinguished as the fragments center r(F ) which is also considered the root
of the fragment. The procedure eliminates all short cycles of length at most l and also concentrate
via T (F ) , all the information pertaining to every other fragment up to distance l from F in r(F ).
The procedure starts by eliminating all cycles of length 2 and then goes on to eliminate all cycles of length at most l. We first consider the procedure of eliminating all cycles of length 2 as
after that extending the procedure to cycles of longer length would be much easier. The nodes of
the fragment collect information on the edges connecting F to the adjacent fragments and send it
upwards on the tree T (F ) to the center r(F ). In order to execute the procedure, each fragment
node v creates the record Path(F) containing edge information, for each F F G adjacent to F .
It is important to note that out of all records of fragments adjacent to the node in its subtree, it
sends exactly one record concerning each such fragment. It is easy to verify the following basic
properties of the above pipelining policy.
Lemma 4.4 Each node v F sends to its parent exactly one record P athl(F ) for each fragment
F that is adjacent to nodes in vs subtree in T (F ); these records are sent up in increasing the order
of fragment id.
For eliminating the remaining small cycles the algorithm basically repeats the procedure described in the previous section for l phases.

Phase III: Global edge Elimination

At the end of the second phase of the algorithm, we obtain a tree which contains all the edges
which exist in the final MST but it also contains a small number of additional edges which were
not eliminated in Phase II. To remove the remaining edges and to reduce the total number of edges
to the required V 1 we follow these steps:
Build a breath-first search tree B on G, the original graph
from every fragments center r(F ), upcast the list of (uneliminated) external edges adjacent
to F on B
the final computation (elimination of edges) is performed centrally at Bs root, who then
broadcasts the resulting MST to all nodes, over the tree B.



In Figure 4 we show a snapshot of the system running Controlled-GHS phase of the algorithm on
a distributed network. In Figure 5, we show the edge elimination procedure of the second stage of
the algorithm.

(a) Stage I: A particular Fragment tree formed after merg- (b) Label each node on the fragment tree with levels their
ing of smaller fragments.
respective levels.

(c) Find dominating set by taking the union of the MIS (d) Breaking down the fragments tree on the basis of their
and first level nodes in the fragment tree
dominating set.

Figure 4: Example run for Stage I-Controlled GHS with a graph containing 15 nodes and 22 edges


Message Passing and Time Complexity

The complexities of all three parts of our algorithm are as follows, for the given parameter I specified in the first part:


Figure 5: Maximum weight edge elimination in the Fragment Graph. All small cycles are detected
and edges eliminated.

Part I: 3I O(2 log V )

Part II: 3I + 2VI log V 2
Part III: Diam(G) + 2VI
To optimize the running time we choose I such that 3I = 2VI ie. I = lnlnV6
For this value of I we get a bound of O(Diam(G) + V 0.614 ) on the time complexity.


In this paper, we discussed three distributed algorithms which solve the problem of finding the
Minimum Spanning Tree for a connected asynchronous network. It his classic work, Angluin [?
] showed that there exists no deterministic distributed algorithm to solve the MST problem with a
bounded number of messages if the distributed network graph has neither distinct edge weights nor
distinct node identifiers. Therefore, we assume that each edge is associated with a distinct weight
known to adjacent nodes. Even though having distinct edges is not an essential requirement, we
assume this as it guarantees a unique MST in the network. All the algorithms also operate in
the condition that the size of messages is upper bounded by O(log V). With these assumptions,
the algorithms attempt to optimally solve the problem of finding MST on distributed network. We
realize that all the three algorithms use ideas from the GHS algorithm, and additionally also involve
other complex procedures and subtleties, to achieve optimal performance in terms of message
passing and time complexity. The classical algorithm by Gallagher et al. has an optimal message
passing complexity of O(E + V log V ) but a suboptimal running time complexity of O(V
log V ). The algorithm by Awerbuch [4] achieved the optimal running time and communication

complexity by breaking down the problem into three parts and solving the sub-problems in three
phases. The different phases represent a trade-off between the demands of the initial part of the
problem (involving large numbers of small fragments, where bounding the number of messages is
most important) and the last part (involving a small number of large fragments where we need to
bound the running time). However, Garay et al. identified the diameter of the graph as an inherent
parameter in the construction of the MST and presented an algorithm whose time complexity is
sub-linear in V and linear in Diam. The motivation for the algorithm comes from the fact that
there are several O(Diam) algorithms for various other important network problems [11].

[1] Baruch Awerbuch. Complexity of network synchronization. Journal of the ACM (JACM),
32(4):804823, 1985.
[2] Baruch Awerbuch. Reliable broadcast protocols in unreliable networks.
16(4):381396, 1986.


[3] Baruch Awerbuch. Optimal distributed algorithms for minimum weight spanning tree, counting, leader election, and related problems. In Proceedings of the nineteenth annual ACM
symposium on Theory of computing, pages 230240. ACM, 1987.
[4] Baruch Awerbuch and Silvio Micali. Dynamic deadlock resolution protocols. In Foundations
of Computer Science, 1986., 27th Annual Symposium on, pages 196207. IEEE, 1986.
[5] Cuneyt F Bazlamacc and Khalil S Hindi. Minimum-weight spanning tree algorithms a survey
and empirical study. Computers & Operations Research, 28(8):767785, 2001.
[6] Edsger W Dijkstra. A note on two problems in connexion with graphs. Numerische mathematik, 1(1):269271, 1959.
[7] Robert G. Gallager, Pierre A. Humblet, and Philip M. Spira. A distributed algorithm for
minimum-weight spanning trees. ACM Transactions on Programming Languages and systems (TOPLAS), 5(1):6677, 1983.
[8] Juan A Garay, Shay Kutten, and David Peleg. A sublinear time distributed algorithm for
minimum-weight spanning trees. SIAM Journal on Computing, 27(1):302316, 1998.
[9] Joseph B Kruskal. On the shortest spanning subtree of a graph and the traveling salesman
problem. Proceedings of the American Mathematical society, 7(1):4850, 1956.
[10] Navneet Malpani, Jennifer L Welch, and Nitin Vaidya. Leader election algorithms for mobile
ad hoc networks. In Proceedings of the 4th international workshop on Discrete algorithms
and methods for mobile computing and communications, pages 96103. ACM, 2000.
[11] David Peleg. Time-optimal leader election in general networks. Journal of parallel and
distributed computing, 8(1):9699, 1990.
[12] Robert Clay Prim. Shortest connection networks and some generalizations. Bell system
technical journal, 36(6):13891401, 1957.