You are on page 1of 14

258

IEEE TRANSACTIONS ON COMPUTERS,

VOL. 59, NO. 2,

FEBRUARY 2010

Distributed Recovery from Network Partitioning in Movable Sensor/Actor Networks via Controlled Mobility
Kemal Akkaya, Member, IEEE, Fatih Senel, Aravind Thimmapuram, and Suleyman Uludag, Member, IEEE
Abstract—Mobility has been introduced to sensor networks through the deployment of movable nodes. In movable wireless networks, network connectivity among the nodes is a crucial factor in order to relay data to the sink node, exchange data for collaboration, and perform data aggregation. However, such connectivity can be lost due to a failure of one or more nodes. Even a single node failure may partition the network, and thus, eventually reduce the quality and efficiency of the network operation. To handle this connectivity problem, we present PADRA to detect possible partitions, and then, restore the network connectivity through controlled relocation of movable nodes. The idea is to identify whether or not the failure of a node will cause partitioning in advance in a distributed manner. If a partitioning is to occur, PADRA designates a failure handler to initiate the connectivity restoration process. The overall goal in this process is to localize the scope of the recovery and minimize the overhead imposed on the nodes. We further extend PADRA to handle multiple node failures. The approach, namely, MDAPRA strives to provide a mutual exclusion mechanism in repositioning the nodes to restore connectivity. The effectiveness of the proposed approaches is validated through simulation experiments. Index Terms—Movable sensors and actors, relocation, fault tolerance, connectivity, node failure, partitioning.

Ç
1 INTRODUCTION
N

I

recent years, there has been a growing attention in deploying heterogeneous movable nodes within the wireless sensor networks (WSNs) for various purposes. The types of these nodes vary from small mobile motes, from Robomote [1] and [2] to powerful actors [3] which can take certain actions. While the former has given rise to deploy Movable/Mobile Sensor Networks (MSNs) ([1], [4]), where all the sensors can move on demand in addition to their sensing capabilities, the latter has introduced networking of mobile actors with static sensors called Wireless Sensor and Actor Networks (WSANs) [3]. Actors in WSANs are movable units such as robots, rovers, and unmanned vehicles which can process the sensed data make decisions, and then, perform appropriate actions. The response of the actors mainly depends on its capabilities and the application. For instance, actors can be used in lifting debris to search for survivors, extinguishing fires, chasing an intruder, etc. Examples of WSAN applications include facilitating/conducting Urban Search And Rescue (USAR), detecting and countering pollution in coastal areas, detec-

. K. Akkaya and A. Thimmapuram are with the Department of Computer Science, Southern Illinois University Carbondale, 1000 Faner Dr. Mailcode 4511, Carbondale, IL 62901. E-mail: kemal@cs.siu.edu, taravind@siu.edu. . F. Senel is with the Department of Computer Science and Electrical Engineering, University of Maryland Baltimore County, Baltimore, MD 21250. E-mail: fsenel1@umbc.edu. . S. Uludag is with the Department of Computer Science, University of Michigan-Flint, Flint, MI 48502. E-mail: uludag@umich.edu. Manuscript received 2 July 2008; revised 12 Feb. 2009; accepted 11 Mar. 2009; published online 24 July 2009. Recommended for acceptance by S. Fahmy. For information on obtaining reprints of this article, please send e-mail to: tc@computer.org, and reference IEEECS Log Number TC-2008-07-0327. Digital Object Identifier no. 10.1109/TC.2009.120.
0018-9340/10/$26.00 ß 2010 IEEE

tion and deterring of terrorist threats to ships in ports, destruction of land and underwater mines, etc. In both MSNs and WSANs, connectivity of the network is crucial throughout the lifetime of the network in order to meet the desired application-level requirements. For instance, in MSNs, sensors need to periodically, or in response to a query, send their data to the sink node so that all spots of the region can be monitored accurately. In addition, they need to perform aggregation/fusion on the data they receive from their neighbors and relay the fused information toward the sink. This requires that the whole network with the sink node and sensors should be connected throughout the lifetime of the network. Similarly, as far as the WSANs are concerned, in most application setups, actors need to coordinate with each other in order to share and process the sensors’ data, plan an optimal response, and pick the most appropriate subset of actors for executing such a plan. For instance, in a forest monitoring application, in case of a fire, the actors should collaboratively decide the best possible solution in terms of the number of actors to employ, their traveling time, and distance to the fire spot. Again, this process requires that all the actors should form and maintain a connected interactor network at all times to become aware of the current states of others. In such connected MSNs and WSANs, failure of one or multiple nodes (i.e., sensors and actors, respectively) may cause the loss of multiple internode communication links, partition the network if alternate paths among the affected nodes are not available, and stop the actuation capabilities of the node if any. Such a scenario will not only hinder the nodes’ collaboration but also has very negative consequences on the considered applications. Therefore, MSNs and WSANs should be able to tolerate the failure of mobile nodes and self-recover from them in a distributed, timely, and energyefficient manner: First, the recovery should be distributed
Published by the IEEE Computer Society

nodes move only when needed and follow a prescribed travel path. Such a dominatee is found through either a greedy algorithm by following the closest neighbors or a dynamic programming approach by exploring all possible paths at the expense of increased message cost. throughput. has been introduced in [7] to serve as a mobile relay in WSNs. Two types of mobility were considered in these efforts: inherent and controlled mobility. And finally. animals. A similar work that employs mobile relay nodes is presented in [13]..e. etc. and 3) We propose a novel algorithm to handle the failure of multiple simultaneous failures. the use of Packbot enables reaching isolated nodes or blocks (network partitions) and links them to the rest of the network. often referred to as Mobile Ubiquitous LAN Extensions (MULEs). Since partitioning is caused by the failure of a node which is serving as a cutvertex (i. With multiple Packbots. a gateway for multiple nodes). 259 since these networks usually operate autonomously and unattended. each node determines whether it is a cut-vertex or not in advance in a distributed manner. In this respect. in principle. and other similar devices. the energy overhead of the recovery process should be minimized to extend the lifetime of the network. and connectivity [5]. The proposed algorithm is both distributed and proactive and is thus message efficient. lengthens the sensors’ lifetime. can coordinate the execution of multiple recovery efforts by introducing mutual exclusion on the use of nodes. In the case of unsuccessful reservation. we further extend PADRA to handle multiple simultaneous node failures. whose absence does not lead to any partitioning of the network to replace the failed node when it actually fails. A Packbot comes close to sensors to collect their data. In this case. In addition. carries all data reports to the base station. in [9]. The . the objective of C2 AP [12] is to maximize the coverage of actor nodes while maintaining a connected interactor topology. The approach. Simulation results confirm that both PADRA and MPADRA can perform very close to optimal solution in terms of travel distance with only local knowledge. The Packbot’s proximity to the sensor nodes significantly reduces the energy consumed in wireless transmission. 2) We use connected dominating set and cascaded movement idea together in order to restore the failure of cut-vertices. given the expected latency in data delivery caused by the travel time while touring sources and by the relatively slow mechanical motion relative to wireless transmissions. nodes on a data route are repositioned in order to minimize the total transmission energy while still preserving the connectivity with the base station. A robot..: DISTRIBUTED RECOVERY FROM NETWORK PARTITIONING IN MOVABLE SENSOR/ACTOR NETWORKS VIA CONTROLLED. the goal is to avoid breaking any internode links. COCOLA [10] deals with the effect of moving a node on the network connectivity. although the probability is lower. Once such cut-vertex nodes are determined. Unlike [7]. This may be due to the lack of communication paths between the source and the destination or simply to preserve the very limited energy supply aboard the source node. Inherent mobility can be further classified as random and predictable depending on the travel path. More specifically. secondary failure handler) is involved to restore the connectivity.. Due to its practicality. rapid recovery is desirable in order to maintain the responsiveness to detected events. coverage. The designated neighbor picks a node. These mobile relays. The replacement is done through a cascaded movement where all the nodes from the dominatee to the cut-vertex are involved. repair disconnected networks. For such a purpose. While Packbots. As we exploit such mobility to maintain connectivity in this paper.AKKAYA ET AL. called Packbot. may be passing by cars. Controlled mobility of the internal nodes within the WSN has mostly been exploited to improve the network lifetime and coverage [9]. Such a case is very challenging and warrants further investigation. 2 RELATED WORK Node mobility has been exploited in wireless networks in order to improve various performance metrics such as network lifetime. the discussion in this section will be limited to schemes that consider network connectivity. they do not need to travel around the network. and thus. This is achieved by utilizing the Connected Dominating Set (CDS) of the whole network. with minimal messaging cost. we present a distributed PArtition Detection and Recovery Algorithm (PADRA) to determine possible partitioning in advance and self-restore the connectivity in case of such failures with minimized node movement and message overhead. the bulk of the published work in this area has pursed controlled mobility. the mobile relays stay within at most 2 hops of the sink. [15]. either external mobile nodes are introduced into the system [7]. None of these approaches pursues the repositioning of existing nodes to restore the network connectivity that gets severed by the failure of a node.e. humans. the use of a mobile relay is not practical in wireless networks. Meanwhile. on the other hand. In this paper. can. namely MPADRA. called dominatee. concurrent failure of multiple nodes is still possible. In controlled mobility. and thus. [8] or existing (internal) movable nodes in the network are used [12]. Some studies considered connectivity as a constraint while striving to improve other performance metrics. mobile nodes are used as carriers to relay data from sources to the destination [6]. Basically. However. Our contributions in this paper are as follows: 1) We propose a new distributed cut-vertex detection algorithm with a low false alarm ratio. and then. The goal is to share the movement load so that the energy of the selected dominatees will not drain quickly as a result of a long mechanical motion. The idea is to reserve the nodes before they actually move to restore the connectivity so that they cannot be moved for other recovery purposes. PADRA assumes that only one node fails at a time and no other nodes fail until the connectivity is restored. each node designates the appropriate neighbor to handle its failure when such a contingency arises in the future. two important issues are: 1) how to deal with a loss of multihop links if the failed nodes are neighbors and 2) how to coordinate multiple recovery efforts if the failed nodes are located in different parts of the network. the latency still stays high even if a distinct Packbot is designated for every individual data source. another failure handler (i. An algorithm has been proposed for determining the trajectory of the Packbot to serve multiple nodes. Second. For example.. and thus.

connectivity is not considered in this work.1 System Model and Assumptions We consider a WSAN or an MSN. movement of these nodes would not cause sensing coverage holes in terms of some sensing modalities. if u’s failure does not cause any partitioning. a joint consideration of connectivity and coverage holes is left as a future work. we mean two different metrics: 1) P Total Movement distance of all the Nodes (TMNs): i2S Mi and 2) the Maximum of the Movement distance of all Individual nodes (MMI): maxi2S Mi . In the case of WSANs. our paper focuses on providing 1-connectivity which provides more room for coverage when compared to 2-connectivity. where S denotes the set of nodes in the network and Mi denotes the total movement distance for a particular node i”. Therefore. each node will know in advance how its failure will be handled. no handling will be needed. As a result. NO. Nonetheless. The approach is based on moving a subset of the robots to restore the 2-connectivity that has been lost due to the failure of a robot. determines the cut-vertices in advance through the underlying CDS. some of the research in mobile robot networks can be applied to WSANs. denoted by r. The same problem is also studied in [19] which rather presented a distributed approach. In order to minimize. of a node refers to the 3. then this node may deplete all of its energy and can die quicker than the rest of the nodes in the network. Our approach. and handling multiple simultaneous failures. we introduce another metric. First. However.. this is. and then. DARA does not provide a mechanism to detect cut-vertices. [15]. We also note that a preliminary version of this paper has appeared in [18]. Unlike [17] and [19]. our goal is twofold: 1) Determine locally whether such failure causes any partitioning within the network G and 2) if there is a partitioning. The idea is similar to ours in the sense that it explores cascading movement when replacing the failed node. then it designates a set of nodes to be repositioned so that the connectivity is restored in the network. Regarding the handling of multiple node failures causing partitioning. in MSNs. in DARA. determine (again locally) the set of movements to restore the connectivity with minimum travel distance. our approach MPADRA is the first work. the nodes have the ability to move but they are not moving most of the time as in Mobile Ad Hoc Networks (MANETs). It is assumed that this information is available at the node which may require the knowledge of the whole topology. Second.260 IEEE TRANSACTIONS ON COMPUTERS. Our approach looks for a dominatee rather than a leaf node to replace the failed cut-vertex. no sensor nodes are involved. MMI and would like to minimize this metric first. a serious damage to the network connectivity will be inflicted. The mobility capability is only exploited whenever needed. Note that the idea of cascaded movement is similar to that of [15]. FEBRUARY 2010 closest work to ours is reported in [16]. As the actors employed in WSANs can also be robots. Such nodes are referred to as cut-vertices. it is also important to share this movement load among all the moving nodes evenly so that fairness can be achieved.2 Problem Definition When the lost node is a leaf node. and thus. the movement is assumed to be more costly than message transmission [15]. 2. Our problem can be defined as follows: “n mobile nodes that know their locations are randomly deployed in an area of interest and form a connected network G. on the other hand. try to minimize TMN so as to extend the lifetime of the whole . The radio range. if the failure of u leads to partitioning. i. However. is assumed to be 2r. there are many differences from our work. if a particular node moves very long distances to fix the connectivity. Interference range for each node. Ai . Each node periodically broadcasts heartbeat messages to its neighbors. If such an heartbeat message is not received within a certain amount of time. The rationale is to put a cap on MMI (e. r is assumed to be constant and same for all nodes.g. no other nodes will be affected. VOL. when the failed node serves as a gateway node in the network.e. Since moving a node for a relatively long distance can drain significant amount of energy. Note that in this paper.. Finally. we consider the network with the sink node and the sensors to be connected. One example of providing fault tolerance in such networks has been studied in [17]. to address this challenging problem. r). the selection of the node to replace the failed node is done based on the neighbors’ node degree and distance to the failed node which may require excessive replacements until a leaf node is found. Determining such nodes and how they will be repositioned are our second goal. we present a CDS-based approach which informs a particular node u in advance whether a partitioning will occur or not in the case of its failure.e. we do not consider the coverage holes formed as a result of node failures as this problem has already been extensively studied in [14]. it is performed based on block movements and requires a centralized approach. The loss of a cut-vertex partitions the network into disjoint subnetworks. However. as mentioned above. the node is assumed to be failed. While our goal is to minimize TMN. Meanwhile. Obviously. in fact. we assumed that all the nodes have same sensing capabilities. maximum euclidean distance that its radio can reach.. We assume that the nodes (i. However. However. We also assume that there is no obstacle on the path of a moving node and the nodes can reach their exact locations by maintaining a constant speed. This work presents DARA which also strives to restore the connectivity when a cut-vertex node fails. to the best of our knowledge. The idea is to determine intermediate sensor nodes along the path and to replace those nodes gradually. This paper extends that work by providing more analysis. While the idea of movement of robots is similar to ours. Otherwise. 3 PARTITION DETECTION AND RECOVERY 3. a cascaded movement is proposed if there are sufficient number of sensors on the way. The nodes in such networks have a limited on-board energy supply. Note that in any case. With travel distance. 59. actors or sensors) are randomly deployed in an area of interest and form a connected network. an interactor network. namely. This work identifies some spare sensors from different parts of the network that can be repositioned in the vicinity of the holes. performance enhancements. on the other hand. In the case of a failure of a particular node.

this indicates that there are alternative paths which can be used during the failure of u to maintain the connectivity of the network. it will declare itself as a cutvertex. the local can be partially or completely eliminated. To restore the connectivity of the network. u will declare itself as a cutvertex. A3 is a dominator with neighbors A2 and A4 which have neighbors. delaying the heartbeat message indicates a problem either in the channel or at the node due to high load. Thus. PADRA identifies a failure handler (FH) within the network that would start the recovery process when Ai fails.: DISTRIBUTED RECOVERY FROM NETWORK PARTITIONING IN MOVABLE SENSOR/ACTOR NETWORKS VIA CONTROLLED. Specifically. For instance. How to pick the FH is explained in connection with the recovery process next.4. we follow a distributed approach for such a purpose. Further.e.. 1a. there can be two cases: a. network size. this approach requires flooding the whole network and can be costly in terms of the message overhead.3. This is shown in Fig. any dominator which falls into category 2b will be assumed as a cut-vertex without performing any local DFS. . For preventing such circumstances. Obviously.2 Recovery Process When a node Ai fails (i. In the second case. In this way. When Ai fails. this can be a false alarm due to not being able to send the heartbeat message. the recovery process can still help to alleviate these problems without incurring any partitioning in the network. For partial elimination. If DFS is used.AKKAYA ET AL. then there is a high probability that u will be a cut-vertex. determination of a cut-vertex will be handled as follows: 1. If a node u is a dominator.4. Ai the recovery process. u will not be a cutvertex. u may have at least one dominatee v as its neighbor. no heartbeat messages can be H heard within a certain period of time). We use a distributed algorithm [20] in order to determine the CDS of a given network G. 1a where A2 is a dominator with at least one dominatee (i. A2 will be a cut-vertex. the unnecessary movement of H the AF is inevitable. network. then the approach will be referred to as PADRA+ hereafter. etc. 1b. denoted as AF will initiate i . its FH.. A node needs to know only its 2-hop neighbors which can be done by transmitting two messages.4.).. b. Thus. for complete elimination. A3 is also a cut-vertex in (a) but will not be a cut-vertex in (b). In such a case. The idea in our recovery is to find the closest dominatee and use it as a replacement for the failed node so that the 2. will initiate the recovery process. (a) Node A1 is a dominatee and cannot be a cut-vertex. the CDS is constructed and cut-vertices are identified as explained in Section 3. It is important to note that for the cases of 2b (i.4 Handling Partitioning: Greedy Approach 3. Therefore. 1b). For instance.. the connectivity of the network can be maintained as long as CDS is connected. In order to decide whether u is a cut-vertex or not. Next. A1 ) which does not have any neighbors. As will be elaborated later in Section 3.e. If v does not have any neighbors. in Fig. A2 is a dominator and has a dominatee A1 which is not connected. the network recovery time is minimized as each node in the network will know how to react to a failure before it happens. 3.1 In-Advance Designation of a Failure Handler PADRA achieves optimized recovery of a node failure by preplanning the failure handling process. Therefore. If all of them can be reached.e. Fig. H Ai and AF can talk to each other on the type of the problem i and can take more actions. i interference/jamming issues should be carefully considered during the deployment phase of the network. However. for the applications where message transmission is a concern (due to security reasons. determination of the cut-vertex is challenging since the dominators/dominatees can be connected via some other nodes within the network. Thus. Thus. The rationale for such a proactive approach is that the neighbors of a cutvertex Ai will not be able to communicate after the failure of Ai .. each node will know whether it is a dominator (an element of CDS) or a dominatee 1 hop away from a dominator. for each cutvertex Ai . In that case. 3. Note that this algorithm requires only local information. Otherwise. then it cannot be a cutvertex since its absence will not violate the connectivity of the network. the node is not failed and recovery would not be needed. all the neighbors of u will be either dominators or dominatees which have some neighbors. These details are beyond the scope of this paper. Therefore. A2 is a cut-vertex. failure of u may not cause any partitioning. For such circumstances. Our approach will be based on the concept of CDS of the network. If a node u is a dominatee. TMN will be sacrificed in order to provide a better MMI. 261 Fig. a DFS is required to check whether A3 is a cut-vertex or not. However. this means that in some cases. We would like to note here that in some cases. node A1 is a dominatee and cannot be a cut-vertex in Fig. 3. Each cut-vertex Ai then picks an FH among its H FH neighbors. Especially. 1. one of the neighbors of u should start a local DFS to look for all the remaining neighbors of u. Right after the network is deployed. DFS will only be performed for the dominators which fall into category 2b but do not have any dominatees within 2-hops. AF i . As every node can reach the nodes in a CDS. we describe our approach for cut-vertex determination.3 Cut-Vertex Determination Determining whether a node is a cut-vertex or not can be easily done by using depth first search trees (DFS). in case of the existence of other wireless devices in the environment which can potentially interfere with Ai . we may have an increased message overhead depending on the topology of the network. As a result of running Dai’s algorithm [20]. In the first case. When this information is available.

then Ai will move once and no further movements will be performed so that the cycle can be broken. That is. we propose a hybrid solution which will combine the advantages of both approaches. the cut-vertex node designates its closest neighbor to handle its failure. We propose a greedy approach for determining the closest H dominatee for a cut-vertex node. as shown in Fig. Obviously. Note that if AF is itself a i dominatee. the recovery will be simplified immensely. A sample execution of PADRA. Note that this approach may not work when the network contains a cycle of dominators. etc.4. 2. then TMN will be the sum of all the distances on the shortest path from the dominatee to the failed node. it sends a message to A3 and designates it as the node to handle its failure. While all the nodes will move the same distance of r with this approach. the approach does not provide fairness and may deplete the moving nodes’ energy rather quickly as compared to other nodes in the network. if node A3 moves directly to the location of the failed node A1 . If none of the neighbors of Ai has a dominatee. Ai prefers the closest one that has a dominatee among its neighbors (Ai will know that from the 2-hop neighbor list that it has). connectivity is preserved. A3 will replace A2 . Otherwise. Therefore. FEBRUARY 2010 Fig. In case of such a failure. we introduce an extra confirmation message ACK to be received before a node Ai can start to move. If the neighbors of Ai are all dominators. Clearly. 2b.4 Handling Cycles Since the existence of a cycle may cause a series of replacements which will never be able to pick a dominatee node. VOL. (a) Node A2 fails. This is apparent in Fig.. the closest dominatee in terms of distance to the failed cut-vertex node should be determined. the closest neighbor will apply the same idea in order to find the closest dominatee to itself. it will significantly increase TMN since all the nodes in that partition will be moving. A8 will replace A5 . node Ai should pick a node Aj which has not moved before. A5 will replace A3 . In order to minimize the movement distance. Fig. A possible solution to further reduce MMI here could be to move the partition as a block toward the failed node where 3. Ai just picks the closest node as its FH. The idea is to share the load of movement overhead among the nodes on the path to the failed cut-vertex node in order to extend the lifetime of each node and thus the whole network. When it actually fails. . (b) A3 replaces A2 . A5 replaces A3 . where r is the radio range of the nodes. as will be shown shortly by sacrificing from TMN. It can be easily shown that when cascaded movement idea is used and MMI is kept at most as r. a new neighbor other than Aj should be picked. is the minimum among all the other links of A2 . denoted as jA2 A3 j. Here. 3. Note that with this approach only the nodes along the path to the dominatee transmit and receive messages which reduce the message traffic significantly.262 IEEE TRANSACTIONS ON COMPUTERS. r).4. in Fig. 4. For example. at the expense of increased message overhead. 2. The approach will work as follows: In case of a failure.. TMN-MMI example. Black nodes are dominators and white nodes are dominatees. the closest neighbor will be A3 if the distance between A2 and A3 . one can come up with a better MMI (i. the best FH choice for Ai is a dominatee. and thus. 4-hops neighbors.e. 2a. Once A2 determines its closest neighbor.e. If there are no other neighbors. subject to Mi  r 8i 2 S . Fig. A8 replaces A5 . If node Aj has moved before. In other words. The idea is to use cascaded movements from the closest dominatee to the failed node in order to maintain a maximum of r units of movement for the individual nodes (i. NO. We now describe how we address this problem. 3. TMN and thus MMI will be 2r. However. a loop avoidance mechanism should be defined to stop the replacements of the nodes. the quality of the FH selection improves with the consideration of 3-hops neighbors. (b) Cycle detection. it picks its closest neighbor and this continues until a dominatee is hit. since it can move to the location of Ai and keep the network connected. and finally. Node Ai will understand that this indicates a cycle. the closest dominatee will start a cascaded movement toward the location of the failed node.. An example is given in Fig. it will send a negative acknowledgment back to Ai and will not move anymore. The basic idea of our recovery mechanism is as follows: If there is a dominatee among the neighbors of the cutvertex. The approach can then be formulated as an P optimization problem as follows: min i2S Mi . Its FH A3 starts the recovery process. the closest neighbor of the failed node will be the leading node in this movement. 4. That is. it will be designated as the node to replace the cutvertex upon failure since it will be the closest dominatee to the failed node. The idea is as follows: In order to replace itself. MMI) and at the same time minimize the TMN by decreasing the number of moving nodes. 59. since the movement distance can be very large. However. for node A2 . 3. (a) Initial network. (c) Cycle elimination.3 Relocation of the Closest Dominatee Moving the closest dominatee directly to the location of the failed cut-vertex will definitely restore the connectivity with the minimum TMN. MMI will be unacceptably high. and finally. A3 will pick A5 and A5 will find A8 as the closest dominatee to A3 . Therefore.

there is no risk of getting cycles. Note that A3 . Basically. cost. nullÞ 4 else if isDominatorðF H Þ ¼ true then // Check if FH has a dominatee neighbors 5 if 9j 2 N ðF H Þ ^ isDominateeðjÞ ¼ T rue then 6 MoveðF H. i. jÞ þ ClosestDominateeðjÞ 9 Receiveði. Ei 6¼ 0. jÞ 7 else 8 j ClosestNeighborðF H Þ 9 MoveðF H. With dynamic programming. In this topology. ci is the total travel distance to replace i. While PADRA-DP utilizes a dynamic programming approach. we only consider 2-hop neighbor information in the selection of the FH. as shown in Algorithm 2. and determination of the closest dominatee.) function is called when a node is moving.. 4 ALGORITHM ANALYSIS 3. i. and Optimal solution is O(nr). the determination of the closest dominatee is based on a greedy approach in PADRA and PADRA+. the FH node starts a search process among the subtrees of its neighbors in order to find the closest dominatee. The worst case TMN for PADRA.5 Detailed Pseudocode Due to space constraints. Ei ¼ 0. We will refer this approach as PADRA-DP hereafter to differentiate it from PADRA using greedy approach. Therefore. in some circumstances. as shown in Table 1. respectively. The idea is to find the least cost path to the closest dominatee. i) in Algorithm 1 will start a recursive call to all of its neighbors asking for a dominatee as seen in Algorithm 2. we skip the detailed pseudocode for designating a FH for a node. there are three factors which affect the performance with respect to different algorithms. the FH would be A5 which triggers movement till An is hit. With this approach. pick the one that will trigger the less number of movements. The FH will start replacing the node with the ID of minindex which will be done recursively until the dominatee is replaced. The FH picks the least cost dominatee among these options rather than using a greedy approach. dij is the euclidean distance between i and j. it utilizes greedy approaches in the selection of FH which is also true for PADRA and PADRA-DP as well. Each caller can compute the minimum cost of finding a dominatee node by using the formula given in (1). The worst case topology for TMN is depicted in Fig. j needs to replace FH 10 end TABLE 1 Factors Affecting the TMN Performance  ci ¼ minj2Ei fdij g. then the selection is based on a greedy approach which basically designates the closest neighbor as the FH. we eliminate the “ACK” message. Basically. the variants of PADRA will not provide the optimal (i. It then moves to the new location and broadcasts an “ARRIVED” message and updates its CDS accordingly. ð1Þ where i is a dominator.. Optimal solution would consider all the neighbors. mapÞ 11 end 12 minindex getIndexOfMinCostðmapÞ 13 return minindex 4.0 DOMINAT EE F OUND0 Þ 10 putintoMapði. the minimum) TMN.5 Handling Partitioning: Dynamic Programming The algorithm that we proposed in Section 3.1 Travel Distance Analysis When the TMN is considered. PADRA+. Nonetheless. The algorithm. named Recovery as shown in Algorithm 1. 5 for PADRA.. However. the TMN for the optimal solution is also the same as the variants of PADRA. If a dominatee does not exist within 2-hop neighborhood. it suffers from false alarms and poor FH selection. the node first informs its predecessor node to replace itself by sending a “LEAVING” message. ClosestDominatee(FH) 1 if i is a dominatee then 2 cost 0 3 else if i is a dominator and jEi j > 0 then 4 cost minðDistðDominateeðiÞÞÞ 5 else 6 Broadcastði. These are number of false alarms in cut-vertex determination. is run on the FH of the failed node i. We argue that we can find the optimal solution at the expense of increased message complexity by using a dynamic programming approach. Algorithm 1. and thus. PADRA+. While PADRA+ ensures zero false alarms in determining the cutvertices. 263 3. we show that in the worst case.e. In that case.AKKAYA ET AL. Proof. in the worst case. Obviously. Algorithm 2. i) // FH ! failure handler designated upfront for node i 1 if i fails then 2 if isDominateeðF H Þ ¼ true then 3 MoveðF H. minj2Di fdij þ cj g. selection of FH. and Ei and Di are the number of dominatees and dominators of node i. 0 CLOSEST DOMINAT EE 0 Þ 7 forall j 2 N ðiÞ do 8 cost Distði.4. and PADRA-DP. assuming that A4 fails. As summarized in Table 1. the optimal solution will outperform all the approaches in terms of TMN in average but it would require either a centralized approach or excessive and unnecessary flooding of the whole network by all the nodes in advance. Theorem 1.4 is a greedy CDS-based approach. PADRA-DP. Move(. Recovery(FH. Similarly. node i in line 8 of Procedure Recovery(FH. jÞ // Since FH replaced i. each subtree returns its cost of reaching a dominatee to the FH. i. and then. selecting the closest neighbor dominator may not always provide the optimal solution in terms of TMN.: DISTRIBUTED RECOVERY FROM NETWORK PARTITIONING IN MOVABLE SENSOR/ACTOR NETWORKS VIA CONTROLLED.

5. and thus. it informs the FH with a message. 4. two messages will be enough for replacements. In case of PADRA-DP. T messages are needed to reach the nodes within the subtree leaded by the closest neighbor. This is again done by accessing the 2-hop neighbor table. “LEAVING” and “ARRIVED” messages per hop) messages are sent to restore the connectivity. cannot be picked since it will not trigger the worst number of movements. Adding 4n messages for CDS and n for closest node designation. T MNOptimal ¼ 3r.2 Message Complexity Analysis Before calculating the message complexity based on the worst case topology. VOL. The worst case for the optimal solution would be the case where the node in the middle of the line topology in Fig. and PADRA-DP can again be observed when the topology is a line. In each approach. This means that a total of ðn À 3Þ messages will be transmitted and come back. a DFS should be performed in one of the subtrees leaded by the closest neighbor of the node. Let us assume that A3 failed and A4 was the failure handling node. every cut-vertex determines an FH for itself. Each node determines whether it is a cut-vertex or not. FEBRUARY 2010 TABLE 2 Type and Count of Messages Used Fig. T the number of nodes in the subtree. In addition. a total of ðn À 1Þ messages will be sent. For dominatees. In PADRA+. the total number of messages in the worst case for PADRA-DP will be 8n À 9 which is O(n). in Table 2. and PnÀ4 PADRA-DP is given by i¼1 r ¼ ðn À 4Þr ¼ OðnrÞ. in general. given T ¼ n À 3 and H ¼ 1. For A3 . For each approach. Proof. In this case. as shown in Fig. A4 will start a replacement process until An is hit. from A3 to AnÀ2 . Then. since there will be no cycles. and thus. Once the node decides. a total of 2T þ 2k messages will be needed. totally 3k messages are needed. where k is the number of hops to the closest dominatee. 5. 2ðn À 3Þ messages will be sent to restore the connectivity. Thus. Worst-case message complexity of PADRA is O(n). u t Theorem 3. assuming that the distance between the nodes is equal to the maximum value of the transmission range r. 59. NO.t u Theorem 4. totally. Adding 4n messages for CDS and n for PFH designation. TMN for PADRA. Finding the closest dominatee and replacement. and k the number of hops to the closest dominatee. an additional 2T messages are needed to search the subtree and get the optimal result. CDS determination. 5 was the failure handling node. Therefore. u t 4. Each time the closest neighbor is replaced until a dominatee is found. 3. The total number of messages for AnÀ2 is n À 1 as well if 2. we summarize how many messages are needed at each step of the algorithm. the total number of messages in the worst case for PADRA will be 7n À 6 which is O(n). ðn À 3Þ messages will be sent to restore the connectivity. the number of nodes performing DFS will be ðn À 4Þ. Each node sends four messages in order to determine whether it is a dominator or a dominatee [20]. Thus. In PADRA+. Given H denotes the number of neighbors. totaling 2n À 6 messages. Once the closest dominatee and the path to that dominatee are available at node A4 . Then.. Thus.264 IEEE TRANSACTIONS ON COMPUTERS. no messages are sent. Thus. PADRA+. The worst case behavior of PADRA. when a failure happens. Worst-case topology for TMN with dominatees at two ends. this will cost traversing each node until An in the worst case when DFS is performed through A4 . Thus. Aj such that 1 j ¼ nþ FH is selected as 2 ) fails. 2. 5 (i. Proof. 2k (i. This means that a total of n À 3 nodes will be replaced. FH Determination. Based on such topology. Proof. there is no cost in terms of messaging. The FH node determines the closest dominatee by a greedy search. These steps are as follows: 1. however. This requires three messages (including the ACKs) at each node. . This is also true for dominators in PADRA or PADRA-DP since each node just checks its 2-hop neighborhood table set in step 1 and decides whether it should be a cut-vertex or not. Thus. Worst-case message complexity of PADRA-DP is O(n). Obviously. whether theP ðnþ1À1Þ 1 AjÀ1 or Ajþ1 . we can calculate the message complexity as follows: The node sends one broadcast message and receives H replies from its neighbors. A3 will be picked as the FH. For the optimal solution. and thus.e. Cut-vertex Determination. we provide the type and number of messages sent. Let us again assume that A3 failed and A4 with respect to Fig.. ðT þ H þ 1Þ messages will be needed. we introduce the following theorems: Theorem 2. Worst-case message complexity of PADRA+ is O(n2 ). A4 will start a replacement process based on dynamic programming to determine the closest dominate. a DFS is performed for each cut-vertex. dynamic programming is used. the T MNOptimal will be i¼21 r ¼ ðnÀ 2 Þr which is also O(nr). PADRA+. At each replacement a request and an ACK messages were used. The overall message types and counts are provided in Table 2.e. 4. a node which is not a cut-vertex does not designate an FH. it will start a replacement by following the next hop on the path to the closest dominatee. Finally. and thus. again with respect to Fig.

the replacements are done in time since no ACK is r þ ðp þ tÞnÞ. We name the approach as Multiple PADRA (MPADRA) thereafter. 265 we assume that AnÀ3 will start the DFS. In other words. . total time will be 3ðp þ tÞ. In addition.e. thunderstorms. One possible solution here is to wait until one of the failures is fixed and continue from where the recovery process stopped. The worst-case time complexity of PADRA. In this way. PADRA+. etc. for A4 and A5 . respectively. but also cause unnecessary delay in restoring the connectivity. u t 4. while the first node is moving its predecessor will also start moving. u t 2 5 MULTIPLE SIMULTANEOUS NODE FAILURES Theorem 5. Such simultaneous execution may fail to restore the connectivity since the nodes do not have the up-to-date state information of their neighbors and race conditions may occur. and PADRA-DP is Oðr s þ ðp þ tÞnÞ. as soon as the message is received by a dominatee. In that case. i¼1 nÀ8Þ Þ and the worst case message which reduces to ðð3n À10 4 2 complexity of Oðn ). the recovery process may introduce race conditions where the same node. the messaging delay is higher. where s is the speed of nodes. say Ai . In the optimal cascading case. Therefore.. the total will be which is Oðs þ ðp þ tÞnÞ. there may be situations (i. Even worse. In that case.e.: DISTRIBUTED RECOVERY FROM NETWORK PARTITIONING IN MOVABLE SENSOR/ACTOR NETWORKS VIA CONTROLLED. the total messa4 ging cost will be ðn À 1Þ þ ðn À 2Þ þ Á Á Á þ ðn À ðnÀ 2 ÞÞ þ . ðn2 À 2n þ 2Þ > ðð3n À10 4 has less messaging cost when compared to the optimal solution. 2 nÀ8Þ Þ. in case of flooding. during this reservation. totaling ðn2 À 2n þ 2Þ with two more messages for replacement. . This requires extra messaging and may delay the recovery process. For instance.) where two or more cut-vertices may fail around the similar times which will require the execution of PADRA simultaneously by the FHs of the failed nodes. the FH will send a message to its closest neighbor. As a result. 2T ) time. say Aj and Ak . Messaging is also a concern but it will bring minimal overhead when compared to movement. Assuming a failure occurred. the total time will be 3ðn À 4Þðp þ tÞ. The other node will back off and look for an alternative dominatee. the other will not be able to proceed in the cascaded relocation process. this not only requires an alerting framework to inform the node about the end of the recovery process. a mutual exclusion mechanism is needed to make sure that only one of the nodes among Aj and Ak can use Ai for replacing itself. the node which sent the message first will have the priority and be able to lock that node for itself. the total messaging costs are ðn À 2Þ and ðn À 3Þ.. Similarly. As a result of such competition.. where r is the distance represented by the transmission range and s is the speed of the node. Thus. Nevertheless. p propagation delay dor distance r. and move to replace the failed node. update the network state appropriately. the replacements can be done in parallel. However. However. Otherwise. Therefore. In our case. some of the nodes can get stuck and have to wait for the others to finish the replacements. we will explain in details how PADRA should be modified to handle simultaneous failures of multiple nodes. Since there can be at most ðn À 4Þ hops. multiple FHs may compete to access the same set of nodes to replace their failed nodes. Theorem 6.3 Time Complexity Analysis The period of time it takes to restore the connectivity is also a key concern since the network will be disconnected during this transition time. and t the transmission delay. and provide parallel replacements is needed. the rest will introduce even smaller number of messages. etc. þ ðn À 2Þ þ ðn À 1Þ.. Thus.1 MPADRA Overview Depending on the location of the failures of the nodes. Ai will not know which one to replace. For PADRA-DP. . get an ACK. Then. Note here that such reservation may cause other FH to reserve a dominatee from a longer path. The main factor here is the time it takes for a node to reach its final destination which depends on the speed of the node. u t needed. Therefore. and thus. we strive to lock all the nodes on the path through the closest dominatee so that no other recovery processes can use those nodes for replacement. First the closest dominatee is found in 2ðn À 4Þðp þ tÞ (i. The time it takes the message to reach the dominatee can be computed as follows: At each node. The goal of this reservation is similar to RSVP protocol [21] where the nodes on a path are reserved for a certain connection. some deadlock situations may occur when two nodes wait for each other. This can be expressed as 2 ðnÀ4Þ 3 2 X 24 ðn À iÞ5. Adding the movement r time.AKKAYA ET AL. even if a First-Come-First-Served mechanism is applied. only two nodes will introduce a total of ðn À 1Þ messages. where p is the propagation delay and t is the transmission delay. Total will be which is again Oðs PADRA assumes that only one node fails at a time and no other nodes fail until the connectivity is restored due to a particular cut-vertex failure. even if it replaces one. PADRA+ Obviously. the node receiving this message will do the same thing and move to replace the FH. 5. The total number of messages sent in PADRA+ in the worst case is less than that of the optimal solution. an FH will reserve the nodes to be replaced before replacements are made as opposed to PADRA. it will take it at most r s time to replace its successor node. a mechanism which will handle race conditions. We present a solution which is based on two phases: In the first phase. In this section. three messages are sent. This will require ðn À 1Þ broadcasts for A1 and An and ðn À 2Þ broadcasts for the rest. may be requested to replace multiple nodes. Such delay may not be tolerated in mission critical wireless sensor/actor networks. phase race conditions occur if a node receives a reservation message from two nodes at similar times. each needs to know the whole topology. the connectivity will not be restored despite a number of replacements have already been made. Therefore. Proof. MMI will directly affect the time for the network to be recovered. That is. Proof.

is unlocked). time:  > ðs Note that Figs. 7a. In this case. 7. Thus. vP F H is also trying to reserve uP F H and vCD (uCD ). the only way to restore the connectivity is to use a secondary (the second closest node) FH (SFH) which will look for another dominatee on another path. 7a. such a dominatee may not be found. and thus. this is inevitable.. it applies the same procedure and tries to lock a dominatee if any. the other PFH vP F H also tries to lock a dominatee in the same way. For instance. 59. In that case. any node receiving a “RELEASE” message in locked state will go back to unlocked state. For instance.e. uP F H and uCD are locked. SFH. Given that there are cases where PFH fails to restore the connectivity. However.. if vP F H requests uP F H . we define a time-out value for SFHs. with local information available at the nodes. update the CDS. Closest Dominatees. The time to replace the failed node will be r s . both uP F H and vP F H will independently and simultaneously execute Phase 1 of MPADRA and be able to reserve uCD and vCD (and any nodes on the paths to uP F H and vP F H ). it will send a NACK message back to the originator of “RESERVE” message. vCD Þ share some common links/nodes. 2) P ðuP F H . and thus. uCD Þ passes through v and P ðvP F H . Considering the round-trip time required for reservation of the path. depending on the location of the failed nodes u and v.  should be set at least to the sum of travel time and the round-trip r þ 2ðp þ tÞðn À 2ÞÞ. NO. depending on the topology. Note that this also applies to failure of u. (a) The states of a node in MPADRA. This is the simplest case which does not require any special treatment. as seen in Fig. as shown in Fig. 7b. while uP F H is trying to reserve uCD . Case 2: Situations where paths for recovery have common nodes/links. Then. In that case. it sends it to its closest dominator. in the second phase. VOL. If u fails first. Therefore. If v fails first. 7c. then vP F H will get stuck and will not be able to reserve a dominatee. 6b. 7b. FEBRUARY 2010 TABLE 3 Notation for the Nodes Involved in MPADRA As our approach is not centralized. namely. then uP F H will not be able to reserve any node for handling the failure of u. in Fig. the nodes which are locked (if any) should be unlocked starting backward from the node which got stuck. P ðuP F H . If such replacement is not done within  . and the path to that closest dominatee for such nodes. u and v which fail at approximately the same time. it will back off and try the second closest neighbor y and can reach and reserve dominatee x. A node receiving a “RESERVE” message will either accept the request (i. they will apply the cascaded motion as in PADRA. In Fig. and 7d illustrate special cases of 2 where the PFHs are not able to restore the connectivity. In that case. it will not send any messages but change its state to locked (See Fig. 6. which is the SFH of v. the primary FH (PFH) will time out and SFH will take over the recovery process.2 MPADRA Details Let us assume two nodes. uCD Þ and P ðvP F H . the following cases are possible: 1) The paths to the closest dominatees of uP F H and vP F H do not share any common links (i. 6a). Otherwise. 5. if such an option is not possible. 7. has failed.e. uCD times out and starts a new recovery process.266 IEEE TRANSACTIONS ON COMPUTERS. As soon as a PFH uP F H . it starts the first phase of the recovery process by sending a “RESERVE” message to its closest dominatee. In the meantime. and unlock the nodes that have been locked for replacement purposes. Selection of the value of  can be based on an estimate of the time it takes a PFH node to replace the failed node. To do that the final node which got stuck starts sending a “RELEASE” message back to the node that tried to reserve it before. SFH should be involved to start the recovery process. Once a node is locked. We introduce the following notation in Table 3 for representing the PFH. negative acknowledgment) depending on its state. in the worst case. vCD Þ as seen in Fig. During this first phase. it will be ðn À 2Þðp þ tÞ.. and thus. This process continues until a dominatee has been locked. Fig. (b) Case 1: u and v execute in parallel without facing any race condition. vCD Þ Fig. Once the nodes to be replaced are reserved for FHs. We now explain how MPADRA can handle possible multiple failure scenarios. whichever makes the reservation and locks the nodes first will be able to recover its failed node. In the meantime. which is a 1-hop neighbor of the failed node u detects that u. Basically. Basically. in Fig.e. node vSF H . it starts a new recovery process for restoring the connectivity due to failure of v. not respond) or send a NACK message (i. . If the node has not already been reserved by another node (i. MPADRA will work in two phases as mentioned above: Phase I: Path reservation. will wait for time  to see the vP F H replacing v. P ðuP F H . it will not be able to do so. vP F H locks nodes uP F H and uCD .e. This indicates that the PFH will fail to restore the connectivity. In this case. 2. then vSF H will assume that vP F H failed to restore the connectivity. uCD Þ and P ðvP F H .. If there is no such dominate. 6a. there are race conditions where uP F H and vP F H compete to reserve a node. However.

j runs Recovery (i. For two SFHs. u t . Assuming one of the PFHs fails to restore connectivity. r which is Oðs þ ðp þ tÞnÞ.. as shown in Fig. both the node and its corresponding PFH can fail at the same time. in the worst case. invoker. all the partitions will be recovered. This is a generic code which will work at any node during the execution of MPADRA. cannot reserve any path for both u and v. Detailed pseudocode. i) 1 if state ¼¼ unlocked then state locked 2 if isDominateeðiÞ ¼¼ true then Moveði. u t Theorem 8. if there is no available neighbor Ã= state unlocked // reservation failed Unicastði. Once the paths are reserved successfully (i.e. a dominatee is locked) for each PFH (or SFH). the dominate node broadcasts a “LEAVING” message and starts moving toward its new location. 5. However. 2r total time needed will be at most s þ 4ðn À 2Þðp þ tÞ. each will introduce a 2Þ TMN of rððnÀ 2 À 1Þ. 7d. For such cases. Proof. the worst case will be two independent recovery processes running in parallel. if SF H ðiÞ ¼¼ T rue then Backoff and double timer else Unicastði. Proof. Algorithm 3. again the SFHs will time out and handle the failures.. In some cases. That is. j) 6 else if 9j 2 N ðiÞ ^ isDominatorðjÞ ¼¼ T rue then 7 Unicastði.. the number of messages required is 4n À 8. Assuming a line topology as in Fig.. Algorithm 3 will be invoked at a PFH if it detects a failure or at a node which receives a “RESERVE/NACK/LEAVING” message or at an SFH which has timed out after the failure of its invoker. 7d. Initially. the total will add up to 10n À 8 which is O(n).. every node receiving the “LEAVING” message on the path starts moving to their new locations. The worst case time complexity of MPADRA is r Oðs þ ðp þ tÞnÞ. now the replacements can be done safely without running into race conditions. j. If ðnÀ 2 ðnÀ2Þ are involved. it also broadcasts a “LEAVING” message and starts moving. In this way.AKKAYA ET AL. One of the SFHs starting the reservation phase 2Þ nodes will need to send one “RESERVE” message.e..e. invokerÞ else if state ¼¼ locked and 0 NACK 0 msg is received then = Ã try other dominatees and dominators. an SFH will need to back off and retry reserving the nodes on the path after a certain amount of time with the hope that some of the failures have already been fixed. two nodes in the middle fail and their PFHs are in the opposite directions. 0 RELEASE 0 Þ else if state ¼¼ locked and 0 RESERV E 0 msg is received then Unicastði. both SFHs uSF H and vSF H need to get involved and recover the failed node u and v. its timer will be doubled. totaling rðn À 4Þ for restoring connectivity. Finally. eventually. and then. The connectivity is restored when PFH replaces the failed node. 0 RESERV E 0 Þ // send message to j. As soon as such “LEAVING” message is received at the node to be replaced. Phase II: Replacements. PFHs for u and v are same (i. 0 RELEASE 0 Þ 5. and thus. i is the node that runs this procedure. As soon as a node reaches its new location. 0 NACK 0 Þ // already reserved else if state ¼¼ locked and 0 LEAV ING0 msg is received then Moveði. it needs to reserve the path which requires ðn À 2Þðp þ tÞ time. Once this happens. As a result. a total of 2 messages will be needed to reserve a path assuming no race conditions. If it still cannot reserve the path in the second trial. invoker. invoker. we adapt and use the idea of exponential back-off mechanism of traditional Ethernet algorithm. In this case. 267 passes through u. We skip the details of the algorithm due to space constraints. Replacements will require ðn À 2Þ (i. the corresponding SFH will need to wait for r  ¼ ðs þ 2ðn À 2Þðp þ tÞÞ before it times out as explained before. j) 8 9 10 11 12 13 14 15 16 17 18 else if 8j 2 N ðiÞjj ¼¼ ðF AILEDorLOCKEDÞ then // i is stuck. If we add the cost of CDS ð4nÞ and PFH ðnÞ and SFH selection ðnÞ.3 MPADRA Analysis Theorem 7. And 2Þ messages to release the locks (i. j. uP F H ¼ vP F H ). The replacements are done starting from the dominatee node.e.: DISTRIBUTED RECOVERY FROM NETWORK PARTITIONING IN MOVABLE SENSOR/ACTOR NETWORKS VIA CONTROLLED. ðnÀ 2 LEASE”) will be required. Total will be 2ðn À 2Þ for one SFH. Therefore. Recovery(invoker. Then. in Fig. The failures are handled by the SFHs of u and v. Worst-case message complexity of MPADRA is O(n). “REfinally. This case is a situation where uP F H will fail to reserve vP F H and vice versa. The time it r takes to reach the new location will be at most s . Thus. it broadcasts an “ARRIVED” message to its neighbors for updating the network states as in PADRA. The pseudocode for MPADRA is given in Algorithm 3. 7c. Assuming the line topology in Fig. we would like to note that SFHs may get stuck in reserving a path when the number of failed nodes is more than two. invokerÞ 3 else if isDominatorðiÞ ¼¼ true then // Check if i has a dominatee among its neighbors 4 if 9j 2 N ðiÞ ^ isDominateeðjÞ ¼¼ T rue then 5 Unicastði. Similarly. The TMN in the worst case is rðn À 4Þ in MPADRA. “LEAVING” and “ARRIVED” messages for each moving node) messages. u t Theorem 9. 0 RESERV E 0 Þ // send message to j. j runs Recovery (i. start replacements which requires another ðn À 2Þðp þ tÞ time to inform SFH to replace the failed node. respectively. invoker is null since it is dead and i ¼ PFH=SFH. some parallelism can be exploited as each node knows where to go. Proof. Basically.

NO. (g) Total movement distance of the nodes with varying radio range in MPADRA. then the number of false alarms will be higher for PADRA. # of nodes is 60. 6. 8a show that PADRA can detect all the nodes which are really cut-vertices. However. The results in Fig. Mainly. For each topology. and PADRA+ with DARA [16] and the optimal cascaded movement solution which provides the least travel distance. making the performance even worse when the network size grows. we note that the performance of PADRA-DP is comparable to PADRA+. we counted the number of cut-vertices found by PADRA and PADRA+.268 IEEE TRANSACTIONS ON COMPUTERS. 8. # of false alarms for cut-vertex detection as the performance metrics. 8b shows that different versions of PADRA performed very close to the optimal cascading while significantly outperforming DARA. # of nodes is 40. (a) Percent of nodes which are falsely identified as cut-vertices in PADRA. the false alarm ratio is around 15 percent when the transmission range is 200 m. In addition. Fig. such false ratio is not very high. one of the cut-vertices is picked to be failed in such a way that there will be no dominatees among the neighbors of the cut-vertex.5 percent). This can attributed to the failed node.1 Experiment Setup and Performance Metrics In the experiments. We compared PADRA. We used TMN. This is also the case with PADRA-DP since it does not employ a greedy approach for determining the closest dominatee. VOL. since the nodes are assumed to be more powerful and the determination of cutvertices is done upfront. 8b. Each simulation is run for 30 different network topologies and the average performance is reported. in Fig. the false alarm ratio is almost negligible (i. However. 1. Note that this is not the case in DARA and needs to look for a leaf node to stop. 6 EXPERIMENTAL EVALUATION 6. can be used with sensor networks where the transmission range is smaller and messaging energy cost can be significant. (e) Total number of messages with varying number of nodes. total number of messages. PADRA+ can be employed as the transmission range of the robots will be higher when compared to normal sensors. if the . However. then PADRA-DP will reduce the travel time compared to PADRA+ as a result of the usage of greedy and dynamic programming approaches.. we can conclude that in our target application of forest monitoring. if degree of connectivity is high and the network contains cycles.. the messaging cost/delay is not of concern. indicating its scalability. For instance. Based on these observations. FEBRUARY 2010 Fig. 8a. (h) Total coverage change in MPADRA. The reason for such good scalability is due to termination of the replacements when a dominatee is hit. 59. (d) TMN with varying node transmission range. not just the leaf nodes). 2. Optimal cascading performs slightly better than PADRA and PADRA+ as it determines the shortest path to the closest dominatee each time.e. If the node is identified as cut-vertex in both approaches. for 60 nodes and 50 m transmission range. (b) TMN with varying number of nodes. TMN. Our approach maintains similar TMNs even when the network size grows. we created random topologies and for each topology. (c) TMN with varying node transmission range. we looked at the false alarm rate when PADRA is used. We created connected networks of these robots consisting of varying number of nodes randomly placed in an area of interest of size 700 m  700 m. there is a high probability of reaching it earlier than a leaf node and this is independent of the network size. (f) Total movement distance of the nodes in MPADRA. PADRA. In order to perform this experiment. but it also falsely identifies more nodes as cut-vertices although they are not. particularly when the node transmission range and the number of nodes are smaller.e. However. on the other hand. In other words. PADRA-DP. We first checked the effectiveness of our cut-vertex detection algorithm. r ¼ 100 m. r ¼ 100 m. # of nodes is 40. as seen in Fig. Since the dominatees can be anywhere in the network (i.2 Performance Evaluation of PADRA False alarms for cut-vertices. The dominatees and dominators are determined through running the distributed algorithm in [20]. we considered a forest monitoring application where movable robots with fire extinguishers are deployed within sensors. We evaluated the movement performance of our approach by varying the number of nodes (20-100). Each robot is collecting data from sensors and needs to be in communication with other robots.

MPADRA. Since we are using PADRA. One explanation for this increase is the increased transmission range. We also counted the number of messages sent during connectivity restoration in MPADRA in order to verify its consistency with the analytical component. on the other hand. and replacing the failed nodes. This result is also consistent with the Theorems 3 and 4 proved in Section 4. with the increasing number of nodes and radio range. we increased the number of nodes and counted the messages. as seen in Fig. If only one of them is a dominator. and PADRA-DP keep the same rate of increase as the optimal solution. As expected. then no action will be needed as no partitioning will occur. These results are consistent with the message complexity OðnÞ. optimal solution is a centralized approach and requires lots of information exchange as will be explained later. surprisingly the travel distance also increases. Even for PADRA+. If there are two dominators. the length of the path to send reservation and replacement messages is decreased which helps to save some messages. Total messages. and thus. TABLE 4 # of Messages in MPADRA for (a) Node Size and (b) Radio Range 6. We varied both the number of nodes and the radio range. and thus. if there is a cut-vertex. the movement distance with optimal solution is slightly better. if the two failed nodes are both dominatees. PADRA+ minimizes this ratio by doubling the message cost as we discuss next. travel distance gets even worse with the increased transmission range. We also observed that the number of messages in PADRA-DP is significantly less than that of PADRA+. designating primary failure handlers. 8e. We also assess the message overhead and network coverage change during these experiments. Second. Definitely. Since the path for DARA replacements is longer. we observed that on the average. it assumes all the dominators as cut-vertices. then running PADRA will be enough. 8d. it is much better. Similarly. the travel distance from a node to another. As a result. This is due to being able to find closer nodes for replacement.. as shown in Fig. However. although it does introduce false alarms in identifying cut-vertices. The coverage range of a node is assumed to be 50 m. This reduces the chance for the failure of path reservation in our approach. When the radio range is increased with a fixed number of nodes. For the experiments. then PADRA+ will not replace any nodes. indicating a linear message complexity. We observe that distributed solution to fix double failures costs more as compared with single node failure costs in terms of TMN. Total number of messages. and thus. then they will be assumed as cut-vertices and MPADRA will be executed. We also compared the number of messages sent when determining the cutvertices. The number of messages for each node count is shown in Table 4a. and thus. There are two more observations from the graphs. that as the transmission range increases. any picked pair can be handled. The simulation results confirmed that our approach requires significantly less number of messages for the whole failure handling process when compared to optimal cascading. cut-vertices with higher transmission ranges are rare since the network connectivity is improved. and thus. such disadvantage can be compensated through the use of shortest paths to dominatees. Note that the ratio of number of messages to the number of nodes keeps linear in the figure. the number of messages does not change significantly as seen in Table 4b. When comparing PADRA with PADRA+. Note that there may be some cases where simultaneous failure of two noncut-vertices may also cause partitioning in the network. 8c.. PADRA+. it usually connects blocks that are far apart which eventually increases travel distance. First. the TMN decreases or keeps stable with the increasing number of nodes and radio range. Therefore.AKKAYA ET AL. 269 failed node is falsely identified as a cut-vertex (i. we picked every pair of cut-vertices in a topology and tried 30 different topologies. Note. the false alarms for PADRA-DP will be higher which makes PADRA+ to slightly outperform PADRA-DP. We compared the performance to the optimal solution in terms of TMN. has the ability to perform the handling in parallel. This is not surprising since it is like running PADRA twice to restore connectivity. as seen in Fig. In addition. closer nodes to replace the failed nodes. This time. 8f and 8g show that MPADRA restores the connectivity of the network successfully for every pair of failed nodes and maintains the same TMN performance ratio to the optimal solution even if the network scales. These nodes are picked among the cut-vertices and PADRA is used to determine these cut-vertices. Due to the same reason of not getting many cut-vertices as the transmission range increases. As far as PADRA-DP is concerned.3 Performance Evaluation of MPADRA We also performed the same experiments in order to assess the performance of MPADRA..: DISTRIBUTED RECOVERY FROM NETWORK PARTITIONING IN MOVABLE SENSOR/ACTOR NETWORKS VIA CONTROLLED. two random nodes are selected to fail simultaneously. will be better in terms of travel distance. while DARA performs worse at higher transmission ranges. There is only a minor decrease when the radio range gets larger. However. This is due to having more connectivity. PADRA. . false alarm) in PADRA-DP. the latter performs better as it minimizes the error in identifying the cut-vertices.e. the travel distance decreases. as proved in Theorem 9. TMN. PADRA outperformed DARA due to the same reason as mentioned above. however. The results in Figs. We also conducted a similar experiment by varying the transmission range (50-200 m). First. although it doubled the number of messages compared to PADRA. the performance gap among MPADRA and Optimal solution starts to decrease since there will be more dominatees to pick when the connectivity is improved.

IEEE Workshop Sensor Network Protocols and Applications (SNPA ’03). REFERENCES G.e. Basically. Mysorewala. Younis. and thus. The results show that the coverage change due to movements is very small in MPADRA. Montestruque. [9] D. we recorded the initial and final network coverage for different number of nodes. S. In addition.” Elsevier Ad Hoc Network J. we will be testing the performance of the approach in a real setup which will be consisting of a network of mobile robots. However. July 2007. “Intelligent Fluid Infrastructure for Embedded Networks. Xue. if the network is partitioned. pp. [4] M. and thus. 6. [5] M.e. Yang. 946-958. Computers. 59. Srivastava. and D. pp. We maintained a state diagram at each node in order to eliminate race conditions. pp. [10] K. Roy. Akkaya. D. “Coverage and Latency Aware Actor Placement Mechanisms in Wireless Sensor and Actor Networks. Brunette. 290 . we presented a local. [3] I.R. “Strategies and Techniques for Node Placement in Wireless Sensor Networks: A Survey. 2007.” Proc. IEEE Int’l Conf. [1] 7 CONCLUSION AND FUTURE WORK In this paper. Younis.F. 56. and L. Therefore. vol. Ma and Y. “Relay Node Placement in Wireless Sensor Networks. IEEE Int’l Performance. As a future work. MobiSys ’04. To assess the effect of movements on coverage. first reserves the nodes on the path to the closest dominatee before the replacements are performed.T. vol. However. 8h. Lin. Giordano. “Data Mules: Modeling a Three-Tier Architecture for Sparse Sensor Networks. Morse. “Wireless Sensor and Actor Networks: Research Challenges. no. “Coverage-Aware and ConnectivityConstrained Actor Positioning in Wireless Sensor and Actor Networks. [8] E. we plan to work on heuristics which will improve the TMN performance of MPADRA. distributed. This is needed since a node may be requested to move by both failure handlers. 7. MobiHoc ’04. MPADRA is suitable for our targeted forest monitoring application since the actors can only know their local neighborhood. optimal solution can be computed at the UAV and communicated to the appropriate actors. no. Computing.” IEEE Trans. each PFH will have to communicate with the central server which needs to send a message to the appropriate nodes about when and where to move. the messaging complexity of MPADRA is linear. [2] M. Finally.” Proc. Robotics and Automation. We analyzed the performance of PADRA and MPADRA both analytically and experimentally..” Elsevier Ad Hoc Network J. B. V. the replacements can be done safely. NO. Shah.. A small prototype network consisting of three mobile PDX Robots [22] has already been created. the message complexity of the optimal solution is not applicable here. Jan. We plan to study a new approach which can handle both connectivity and coverage at the same time at the expense of slightly increased movement and messaging cost.270 IEEE TRANSACTIONS ON COMPUTERS. Popa. “Adaptive Triangular Deployment Algorithm for Unattended Mobile Sensor Networks. June 2006. FEBRUARY 2010 Note that we have not made a comparison to the number of messages for Optimal solution. [12] K. Lewis. (IPCCC ’07). MPADRA. The proposed approach. Sibley and M. 134-138. 2. “Robomote: A Tiny Mobile Robot Platform for Large-Scale Ad-Hoc Sensor Networks. Somasundara. 351-367. A distributed solution is not possible even if we assume that each node knows the whole topology (i. does not require access to the whole topology information.” Proc. disconnections and race conditions can occur. 2003. with O(n2 ) total message complexity). Total coverage. Rosen. McMickell. “Towards Mobility as a Network Control Primitive. VOL. Jain. the optimal solution is only possible through a central server which will know all the nodes and be able to communicate with the nodes remotely. Akkaya and M. vol. J. June 2004. [7] A. Sixth Int’l Conf. 152-164. the closest dominatee/neighbor is delegated to perform failure recovery on behalf of that node.” Proc.A. S. 621-655. 2008. A. Apr. [6] R. May 2008. and F. [11] M. Computers.. Goldenberg. B.” IEEE Trans. In this case. when a lot of failures are considered.C. Yang. and movement-efficient protocol PADRA to handle the failure of any node in a connected WSAN. we provided a technique based on CDS of the network which can decide whether a node is a cut-vertex or not before the failure happens. 2004. Jea. Sensor Networks. 2. “Micabot: A Robotic Platform for Large-Scale Distributed Robotics. pp.” Int’l J. 2002. this may introduce further partitioning problems as the PFHs will not be able to coordinate. pp. While the centralized optimal solution is performing better in terms of travel distance. we note that the effect on coverage is negligible with two failures. Robotics and Automation (ICRA). 56. in target tracking applications where actors can communicate through an unmanned aerial vehicle (UAV) intermittently. We further extended PADRA to handle multiple simultaneous failures in a distributed manner. Software Engineering. Simulation results for MPADRA confirmed that it can restore connectivity in a distributed manner. We also plan to look at coverage issues in conjunction with connectivity when movement of a subset of nodes would cause sensing coverage holes in terms of some sensing modalities. Akyildiz and I. 3. our approach outperformed another approach DARA in terms of travel distance which requires the knowledge of 2-hops for each node. 4. if a node finds out that it is a cut-vertex. Once the nodes are reserved for handling a certain failure. 2004. In addition. Estrin. vol. minimizing the message complexity. 1600-1605. 3. This is because each node would need to flood the network before it starts its replacements.A. pp. A. Networking and Parallel/Distributed Computing and First Int’l Workshop Self-Assembling Wireless Networks (ACIS-SAWN).B. as seen in Fig. M. Goodwine. and Comm. “Deployment Algorithms and In-Door Experimental Vehicles for Studying Mobile Wireless Sensor Networks. These results were consistent with the analytical results derived. Conf. This is not surprising since the topology change is very minimal (i. vol. As it is unknown at the time of failure whether such failures would cause a partitioning or not. Kansal. The failure recovery is done by determining the closest dominatee node and replacing it with the failed node in a cascaded manner so that the movement load is shared among the nodes that sit on the path to the closest dominatee.” Proc. Simulation results confirmed that PADRA performed very close to the optimal . D.S. no.” Proc. Lloyd and G. We plan to extend this network by adding more sensors to test all the proposed protocols in this paper.H.” Proc. Rahimi. IEEE Int’l Conf. Artificial Intelligence. 2007. Younis and K. and W. pp. and Y. no. 1. May 2003.H. Akkaya and M. and thus. solution in terms of travel distance while keeping the approach local. this may necessitate a topology adjustment in terms of coverage using [14] if the initial network is more than 1-covered. However. That is. only dominatee nodes are deleted from the topology). Kasimoglu.298..

and M. IEEE Wireless Comm. Sept. [22] “General Purpose Robot. 18. Zhang. “Using Mobile Relays to Prolong the Lifetime of Wireless Sensor Networks. He is currently working toward the PhD degree at the Department of Computer Science. and fault-tolerance in wireless sensor/actor networks.mobilerobots. “An Extended Localized Algorithms for Connected Dominating Set Formation in Ad Hoc Wireless Networks. Uludag. Wang. 4. His research interests include guaranteed and stochastic routing in wired. July/Aug. topology aggregation. Braden. [15] G. wireless mesh and sensor networks. He is a member of the IEEE. Fatih Senel received the MS degree in computer science from Southern Illinois University Carbondale in 2008.AKKAYA ET AL.” RFC 2205. “Localized Movement Control for Fault Tolerance of Mobile Robot Networks. vol. Mar. 2008. Mobicom. no. His research interests include clustering. “Smart: A Scan-Based Movement Assisted Sensor Deployment Method in Wireless Sensor Networks.computer. and S. For more information on this or any other computing topic. pp.. He is a member of the IEEE. He is an assistant professor at the University of Michigan—Flint. A. and sensor networks. and S. F. Akkaya.” Proc. Wireless Sensor and Actor Networks (WSANs). Herzog. Sept. com. 36-44. “Resource ReSerVation Protocol (RSVP)—Version 1 Functional Specification. [21] R.” Proc. IEEE INFOCOM. [19] S. His research interests include relocation and fault tolerance in wireless sensor/actor networks. Wu and S. Mar. 2005. L.” Proc. University of Maryland Baltimore County. G.” IEEE Trans. Dai and J. [14] J. “Distributed Recovery of Actor Failures in Wireless Sensor and Actor Networks. Inc. pp.” Proc. topology control. T. Younis. Thimmapuram.D. Local Computer Networks (LCN ’07).” IEEE Networks. . S. and Networking Conf. Akkaya. Currently. [20] F. He is currently with Object Technology Solutions. First IFIP Int’l Conf.org/publications/dlib. and K. S. Chu. Abbasi.” Proc. 15. 1997. Zhang.” Proc. Oct. 2007. V.: DISTRIBUTED RECOVERY FROM NETWORK PARTITIONING IN MOVABLE SENSOR/ACTOR NETWORKS VIA CONTROLLED. 2004. he is an assistant professor in the Department of Computer Science at Southern Illinois University Carbondale. Wang. His research interests include energy-aware routing. et al. (WCNC). Nov. [17] P. Wu. “A Distributed Connectivity Restoration Algorithm in Wireless Sensor and Actor Networks. 2004. Sept. please visit our Digital Library at www. Pioneer dx. IEEE Conf. Jamin. 10. 2005. INFOCOM. 2008. security and quality of service issues in ad hoc wireless. Oct. Srinivasan. Suleyman Uludag received the PhD degree from DePaul University in 2007. 2007. Cao. and W. [18] K. Berson. Basu and J. . Senel. and channel assignment in wireless mesh networks. Aravind Thimmapuram received the MS degree in computer science from Southern Illinois University Carbondale in 2007. [16] A. K.. no.” http://www. 2005. 908-920. Yang. vol. “Sensor Relocation in Mobile Sensor Networks. Kemal Akkaya received the PhD degree in computer science from the University of Maryland Baltimore County in 2005. Parallel and Distributed Systems. Porta. “Movement Control Algorithms for Realization of Fault-Tolerant Ad Hoc Robot Networks. 271 [13] W.L. Redi.