You are on page 1of 13

Cooperative Keep-Alives - An Efficient

Outage Detection Algorithm for Overlay

Ivan Dedinski, Hermann De Meer

University of Passau, Germany

Chair for Computer Networks and Computer Communication
Faculty for Mathematics and Informatics

Supported by EuroNGI
Supported by EPSRC, contract No. GR/S69009/01

Abstract: One of the challenges of today’s overlay systems is still scalability. A

key issue in almost every P2P architecture for example is the connection count per
single peer. If the connection count is too high, the management overhead for the
connections (in terms of keep-alive messages, routing overhead) increases. If the
count of the connections per node is too low, the resilience of the system against net
splits decreases, the system can hardly route in an optimal way. And if keep-alive
messages are not sent frequently enough, outdated information could be propagated,
which again could cause net splits. This paper presents a new keep alive algorithm
that reduces the costs for sending keep-alive messages and, at the same time, pre-
serves the effectiveness and reliability of the standard keep-alive mechanisms in
today’ s overlay networks. This algorithm can be used to increase the number of
connections per node and thus improve the connectivity and routing efficiency in the
network while keeping the keep-alive overhead low. When used without increasing
the connection count, the algorithm reduces drastically the keep-alive traffic.

1 Introduction
Overlay connection topologies (e.g. P2P systems) are a hot topic for many network researchers and
Internet users, mainly because they demonstrate in a fascinating way the power of decentralized and
selforganized resource location and usage. Decentralization and selforganization however do not come
without a price. Overlay networks usually have to manage a high number of connections (compared to
client-server networks), to be able to stay consistent and provide the service they are designed for. De-
pending upon the connection topology, an overlay application can or can not efficiently perform search
requests, adapt to the properties of the physical infrastructure or guarantee to a certain extent resilience
against node failures. Overlay connections usually keep TCP connections open to ensure the connectiv-
ity between two network nodes and also the responsiveness of the nodes. TCP achieves this by sending
keep-alive messages at regular intervals if no data is transferred. An overlay connection can also be im-
plemented with connection-less protocols like UDP, where the application has to take care of keep-alives.

Despite the additional overhead, UDP keep-alives are more flexible because their timeout intervals can
be dynamically adjusted. Flexible timeout intervals for keep-alives are required by the algorithm pre-
sented in the paper. Because of the dynamic nature of selforganizing overlay networks (nodes often join
and leave the network), an overlay network topology should be able to recover fast from inconsistencies
like connections to dead nodes (dead nodes are nodes that are not responding, e.g., because they have left
the network without informing the nodes connected to them). Otherwise the routing performance could
degrade or the network can even be fragmented, which could be hard to repair automatically. Currently,
the common way of fast recovery is to send frequent keep-alives. Sending frequent keep-alives over each
existing overlay connection limits the scalability of the system. One can handle this problem by limiting
the number of connections allowed per node. However, this approach also limits the flexibility of the
overlay application. For example, additional overlay connections are needed for the implementation of
efficient routing strategies, like topology-aware routing (e.g. Tapestry [1], Pastry [2]). Also the probabil-
ity of a P2P node to get disconnected from the network is higher for nodes with less connections. Last
but not least, by having more connections available shorter overlay routes between nodes can be discov-
ered. This paper presents a cooperative keep-alive algorithm (CKA) that allows to reduce the keep-alive
overhead without sacrificing most of the quality of previous keep-alive techniques. The algorithm can be
used to increase the connections per node by reducing the overhead for existing connections.
The CKA algorithm (presented in section 2) requires knowledge about the exact reason of a connec-
tion loss for an optimal performance. A connection loss may be caused by a node failure or by a link
failure along the connection route. In the second case the node may be still reachable by other nodes in
the system. For many P2P overlays, one can assume that the major cause for connection losses are node
failures. Overlay nodes are usually run by end users that are joining and leaving the network relatively
often. No guarantees exist an overlay node leaves the P2P network in a safe mode by informing the
nodes connected to it about its imminent departure. Internet physical links on the other hand are more
stable than end nodes. A network provider is usually responsible for a whole network and cares about
the proper functionality of the link infrastructure. So link outages caused by technical malfunction are
relatively seldom. The probability to detect a false node outage due to a link outage caused by conges-
tion can be reduced by re-sending possibly lost keep-alive messages. A more reliable way to distinguish
between link and node failures is to use the traceroute utility and to discover the exact network node
where packets get lost. This method, and the eventual retransmission of keep-alive messages, allows us
to distinguish between node and link failures with higher accuracy. The complexity (in terms of sent
keep-alive messages in a certain period of time) of the normal keep-alive mechanism from the viewpoint
of the network is O(c), where c is the number of the overlay connections. The number c can be ex-
pressed as n ∗ d, where n is the number of all nodes in the P2P network and d is the average count of
connections per node. With an increasing number of nodes, the average connection number per node be-
comes more important to the produced keep-alive traffic. Note that the message count does not describe
the cost of the keep-alive algorithm precisely. Sending a keep-alive message to a node by using UDP
packets is inefficient if no other information is transfered within the packet. One keep-alive transmission
involves always a keep-alive request and a keep-alive reply of at least 64 bytes each (minimum frame
size in Ethernet). And in fact, one only needs about 8 bytes per session - 4 bytes for the IP address of
the sender and 4 bytes for the IP address of the receiver, the minimal overhead results in 87.5 percent
additional overhead. So, not only reducing the keep-alive messages, but also bundling messages would
be an increased efficiency.
Section 2 introduces the basic idea of the CKA algorithm. Section 3 addresses optimizations and
details of the algorithm. Section 4 discusses the reaction of the algorithm, to node outages. In section 5
the results of a simulative study of the algorithm’s performance are presented. Section 6 compares this

work with other related research. Section 7 provides a summary of the paper.yy

2 The Cooperative Keep-Alives Algorithm - Basic

In the usual keep-alive mechanism each connection is managed independently of all other connections
in the network. For example, two nodes X and Y, both connected to a third node Z, do not share any
information regarding their connection status. Since the same management task is performed twice, it is
possible sharing the keep-alive load between X and Y to achieve the common goal: outage detection of Z.
If X and Y cooperate they can decrease their keep-alive rate. Cooperation means that when X detects the
outage of Z it informs Y and vice versa. Note, node X and node Y do not need to maintain a permanent
connection between them. If X detects an outage it does its best to inform Y. Even if Y does not receive
the message from X it still has the chance to detect the outage of Z, however not so fast. The role of
node Z in this scheme is to inform X and Y that they are both connected to Z, and to determine the new
keep-alive intervals for X and Y so that it continues to receive a keep-alive messages at a certain rate.
The effect of the cooperative approach described above is that it reduces the network traffic overhead by
introducing additional overhead at the end nodes in terms of computing power and memory usage.yy
The CKA algorithm is described in more detail in this paragraph and the necessary definitions are
provided. Each node in an overlay network is directly connected to one or more other nodes over bidirec-
tional connections. These connections are called hot since they can be used for routing and data transfer
purposes and their status has to be monitored via keep-alive messages. The CKA algorithm introduces
another type of connections, called cold, these connections do not need to be managed by using keep-
alives and are used (produce traffic) only in case of outages. Existence of a cold connection between two
nodes implies that the nodes know the network addresses of each other and can exchange messages if
necessary. Figure 2, part a), shows an example overlay topology of five overlay nodes and their respec-
tive hot connections. The set of all nodes directly connected to a certain node X is called the cold set of
X. Figure 2 part c), shows node X, its cold set (node A, node B and node Y), and the hot connections
of node X (all other hot connections are hidden). Note from Figure 2, part a), node C is not directly hot
connected to node X and is thus not part of the cold set of X. Node X manages a table containing varios
information about all nodes of its cold set, called cold set view. X keeps informed each node in the cold
set about the existence of all other nodes in the set. The cold set of X is this way fully interconnected by
cold connections. Note that node X does not have a server role in the overlay network. Figure 2, part b),
shows the same network, but from the perspective of node Y, node X beeing in the cold set of Y. In fact,
each node in an overlay has a cold set and is part of the cold sets of other nodes.
Since the nodes in the cold set of X (2, part c)) are fully interconnected with each other by cold
connections, every one of them can reach the whole set by a broadcast. Node X chooses for each of his
cold set nodes a random time in the interval [t, t + K ∗N ], where K is the previous (non-CKA) keep-alive
interval in seconds and N is the number of nodes currently connected to node X. At this random time the
node should send a keep-alive request message to X. This approach assures that on average node X will
receive messages every K seconds. If one of the nodes discovers the outage of node X it floods the cold
set of X by using the cold connections. Some of the cold connections may already be outdated (nodes
could have disconnected). But due to the cold set flooding, there is high probability to reach all cold set
nodes. Figure 1 shows a simplified java-like pseudo-code of the CKA algorithm. All methods run at each
overlay node. A keep-alive process consists of a keep-alive request (ping) and keep-alive reply (pong)
sent and received by the sendPing, sendPong, receivePing and receivePong methods, respectively. Figure
1 does not show what happens in case of a pong timeout. If after a predefined number of subsequent ping
messages (retry count) no pong is received the node starts the flooding algorithm shown on Figure 3.

Figure 1: Java-like pseudo-code for the basic CKA methods

The first time a node sends a keep-alive message to another node X running CKA, it gets as a reply a
packet containing the IP addresses of all nodes that are currently in the cold set of X. Subsequent keep-
alive responses transfer only the changes of the cold set to save bandwidth. To be able to do this, node X
has to store information in its cold set view about the changes since the last keep-alive message received
from each node.

Figure 2: Outage detection teamwork: a) Example topology, b) Cold set and connections of node Y, c)
Cold set and connections of node X

3 Improvements of the Basic Algorithm

The reliability of the algorithm can be additionally improved by the usage of the cold set view at the
overlay nodes. In the prototype implementation used to produce the evaluation results in Section 5, the
cold set view is used extensively. If a node does not send its keep-alive message to another node X on
time, X regards it as disconnected and removes it from its cold set view. The change is then propagated
to the rest of the nodes in the cold set as soon as they send their keep-alive messages. Usually, until a
node collects a stable set of other nodes to connect to (e.g. finishes its bootstrap procedure), it may often
voluntarily connect to and disconnect from intermediate nodes. Each time a node disconnects, it creates
invalid connections in the cold connection set of the node just left. The disconnecting nodes can reduce
the probability of having invalid connections in the cold set by sending a disconnect message every time
they disconnect voluntarily. The message does not need to be confirmed since it is not essential for the
functionality of the algorithm.
Changes to the cold set view of a node due to frequent connects and disconnects during a bootstrap
or stabilization phase may be very dynamic. On the other hand, changes may propagate too slowly to the
still connected nodes in the cold set, since the keep-alive frequency of a node is inversely proportional
to the number of nodes in the set. To keep its cold set properly interconnected, a node may send a
non-confirmed warning message to any other node from its cold set, containing the new keep-alive time
and the actualized cold set. Another technique used to avoid inconsistency of a cold set due to frequent
connects and disconnects is the use of a so called cold set quarantine. When a node connects to another
node X, it does not immediately come into the cold set of X. Instead, X waits to receive two subsequent
keep-alive messages from that node at the standard (no-CKA) keep-alive frequency. Afterwards the node

is included in the cold set. Two keep-alive messages are necessary to exclude all the nodes that are just
passing by (jumping from node to node).

4 Discussion on Cold Set Flooding

After a node detects an outage of node X it has to inform all other nodes in the cold set of X, which
is a non-trivial task. Since the cold set is strongly interconnected, a simple broadcast would cause a lot
of redundant messages [9]. Also the broadcasting node may be confused by a congestion and detect
a falsely outage, so immediate disconnection after receiving an outage detection message is risky. In
the worst case the whole cold set of node X could disconnect, while X beeing still alive. To solve
these problems, we suggest a simple flooding scheme, driven only by the node that detected the outage
(sequential flooding). It first sends an outage message to its known cold set neighbors. The neighbours
reply with a list of their known cold set immediate neighbors and switch to normal (high frequency, non-
CKA) keep-alive intervals to check node X as fast as possible. A small random jitter is inserted to avoid
that all nodes send their keep-alives at the same time and overload node X. The initiating node manages
a list of nodes to which it has already sent the message. If it receives a reply containing nodes not already
in the list, it sends the message to the new nodes and inserts them in the list. The java-like pseudo-code
of the cold set flooding is shown on Figure 3. The three procedures are running simultaneously (in their
own threads) at each CKA node.

5 Evaluation Results
Methodology: The evaluation framework used for CKA consists of three components - a parallel discrete
event network simulator, a Chord [5] like P2P application implementation and a CKA implementation,
which can be deactivated when producing comparative results with the normal keep-alive mechanism.
The parallel network simulator was developed at the University of Passau as a general test bed for net-
working applications. Simulators with similar functionality are Parsec [3], NS2 [8] (not parallel). The
simulator provides an IP-packet level simulation, including IP-routing, packet queues at the routers,
packet drops and RTT variation due congestions. It allows packet communication between a relatively
high number of peers, which is essential for getting statistical results. The network simulation contains
many details important for the behavior of an overlay system. An alternative simulation approach would
be to exclude the physical topology from the simulation [15]. When the simulated nodes are lightweight
this approach can reach more than 100000 nodes per single physical machine. The first reason not to use
it for the evaluation of the CKA performance is that variation of transmission delays, dropped packets,
routing effects, etc. can strongly influence the CKA behaviour and thus can not be ignored in the simula-
tion. The second reason is that a CKA node is not lightweight any more, since it has to manage a cold set
view and information of every connection. Although the memory overhead is neglectable for one single
CKA node it overloads a physical machine when simulating some hundreds of nodes. By parallelizing
the simulation 1000 - 4000 nodes with 4 physical machines could be reached. In this case the overhead
of simulating the network infrastructure makes only a small portion of the total overhead and can be
neglected. It is obvious, that a simulation of 100000 nodes can provide more solid statistical results than
a simulation of 1000 nodes. The evaluation results however, although limited by the available hardware
resources, reveal important properties of the CKA algorithm and indicate its performance benefits.
The CKA algorithm was tested with a Chord-like implementation. It is enhanced Chord because
it contains additional features like the ability of peers to choose their connections (sloppy chord). But

Figure 3: Java-Like pseudo-code for the flooding broadcast

for the CKA experiments the application was configured to act as standard Chord. Chord was chosen
because of its dependency on network connections. If many Chord peers do not have the optimal set
of connections the routing in the Chord network becomes inefficient. The probability of a net split is
much higher than in Gnutella [6], [7] for example, since Chord is less strongly interconnected. Thus,
it can be easier tested on Chord whether the CKA algorithm is flexible enough to preserve the network
connectivity. Evaluating CKA with other overlay applications would also be interesting. Higher perfor-
mance benefits when using CKA in Gnutella are to be expected, because of the much higher number of
connections per node. However, this is a future work and is not covered in this paper.

Figure 4: Network topology used for the simulation (256 nodes per subnet)

For the CKA tests, a simulated network of 1024 peers is constructed. This network consists of 4
subnetworks of 256 peers each, interconnected as shown on Figure 4. The hash space is exactly 10
bits large so that the peers build an ideal Chord ring (where each peer has equal number of incoming
and outgoing connections). This allows for repeating the experiment under similar simulated network
conditions many times. Each test is always started twice, once with and once without CKA. During all
experiments, the peers first bootstrap and build the Chord ring. This takes about 50s simulation time1 .
The normal keep-alive interval is 10sec, if a keep-alive message is lost up to 5 retransmissions each 5sec
are started. 100sec after beginning of the experiment the Chord ring is completely constructed. At that
point each of the participating peers has 10 incoming and 10 outgoing connections. Each packet leaves
its originating node after a random delay of up to 5sec. All tests were started 20 times to see whether
there are significant variations of the CKA performance, caused by that random noise. Although the
number 20 was chosen because of the resource limitations (total real simulation time of many days) and
can not be regarded as high enough to provide solid statistical conclusions, the measurements outline the
confidence intervals, which could be expected when choosing higher numbers.
Results: Table 1 shows all three different experiment sets (multiple experiments per set) that were
performed. The first column of Table 1 indicates the time, when nodes start shuting down. During
the experiments in the first set, different numbers of nodes are shut down2 at once. The nodes have
close hash IDs (all are located in the same subnet), so there are many connections between them. The
experiment set evaluates the CKA performance when nodes are shut down together with big portions of
their cold sets. The second set of experiments evaluates the CKA performance under continuous outages.
Different numbers of nodes with close hash IDs (subnet 1) are shut down in constant intervals. During
the last experiment set half of the nodes in the whole system are shut down at once (subnets 1, 2, 3 and 4).
The CKA performance is evaluated under heavy outage conditions. For space reasons, only the results
of one representative experiments of each set are presented here.
In the following all times specified are simulation times, they are valid in the simulation and have no relation to real time.
A simulated second can take more than one real hour if the machines running the simulations are heavily loaded
All nodes are shut down without having a chance to inform their neighbors

start shuting shutdown number of subnets description
nodes down [sec] interval nodes down involved
100 0 10, 20, 40, 80, 1 Shut down at once a group of
100, 128, 255 nodes in subnet 1.
100 5, 10, 20 20, 64, 100 1 Shut down a group of nodes in 5s,
10s or 20s intervals in subnet 1.
100 0 512 1, 2, 3, 4 Shut down half of all nodes (128)
in subnet 1, 2, 3 and 4.

Table 1: Experiments performed

2.5e+06 1.8e+06
2e+06 1.4e+06
keep−alive packet count

keep−alive packet count



500000 400000


0 0
0 100 200 300 400 500 600 700 800 900 1000 0 100 200 300 400 500 600 700 800 900 1000
time[s] time[s]

Figure 5: 100 Nodes shut down at once, first sub- Figure 6: 128 Nodes shut down at once, all sub-
net nets

One can see in Figures 5, 6 and 10 that in the first 50sec of the experiment the number of packets
is almost equal for the CKA and non-CKA cases. Since the network has to be constructed in the first
50sec, the packets for searching appropriate neighbors (bootstrapping) are dominating the keep-alive
packets. After 50sec, the packets are mainly keep-alive messages and one can see from Figure 5 that the
CKA algorithm causes about 10 times less packets than the non-CKA, as expected. The signalling traffic
caused by the shutdown of the 100 nodes is so small that it can hardly be noticed in the graph. In Figures
6 and 10 the impact of shutting down half of the network is visible at about 100sec. Compared to Figure
6 much less keep-alive traffic is produced, see the vertical scale. However, after the system stabilizes
again, the ratio between CKA and noCKA remains constant. In Figure 10, which is a zoomed version of
Figure 6, it can be observed that the noCKA traffic always grows faster than the CKA traffic.
Figures 7 and 8 show how fast and reliably invalid connections are discovered in the system. As
expected, with the normal keep-alive mechanism all peers connected to a node discover its outage after
maximum of 10+5∗5 seconds. So almost all outages are discovered before the 240th second. A positive
effect of the CKA algorithm is that it also discovers all outages. It is close to noCKA, but somewhat
slower, which is the consequence of the randomization of the ping time intervals for the cold set nodes.
Since a cold set of only 10 nodes is not big enough, the ping intervals are clustered and not distributed
normally by the random function. Thus, outage discovery delays of 1-2 times 10sec in average may
occur. One or two outages (less than 0.05%) are discovered very slowly, which can be explained with
the outage of bigger parts of their cold set so that nobody can inform the rest of the cold set nodes. In
this case these remaining nodes will discover the outage when they send their keep-alive message, which
can take long depending on the size of the cold set but will definitely happen. The second positive effect,

800 3000
invalid connection count

invalid connection count


400 1500


0 0
0 100 200 300 400 500 600 700 800 900 1000 0 100 200 300 400 500 600 700 800 900 1000
time[s] time[s]

Figure 7: 100 Nodes shut down at once, first sub- Figure 8: 128 Nodes shut down at once, all sub-
net nets

is that Figures 7 and 8 look very similar, although the outage number discovered in the first experiment
(100 shutdowns) is 700 and in the second experiment (512 shutdowns, 128 in each subnet) is about 2800.
This indicates the stability of the CKA algorithm within different outage conditions.

160 CKA
CKA 450000 noCKA
keep−alive packet count

invalid connection count

100 300000

80 250000

0 100 200 300 400 500 600 700 800 900 1000 0 50 100 150 200 250
time[s] time[s]

Figure 9: 100 Nodes shut down at 5sec intervals, Figure 10: 128 Nodes shut down at once in each
first Subnet subnet, zoom

Figure 9 shows the behavior of CKA compared to noCKA under continuous outages over longer
periods of time. Once again CKA detects all outages although slightly slower than noCKA. The count of
undiscovered invalid connections by CKA closely follows that of noCKA, which speaks for the stability
of CKA to dynamic changes in the network. All of the experiments described above were performed
multiple (20) times to see whether the random variation of the packet sending times has influence on
the stability of CKA. Significant variations (in both CKA and noCKA) cases could not be discovered.
Figures 12 and 11 show the results of the repetition of the biggest experiment (half of the nodes shut
down in each subnet) in the CKA and noCKA case.
The overhead of the CKA algorithm in terms of memory space is the required memory at each node
from the cold set to store the addresses of all other cold nodes for each hot connection. Also each
node has to manage a cold set view for all nodes connected to it. For example if each node in the
overlay manages 200 hot connections the storage overhead per hot connection would be about 200 ∗
4Bytes(IP addresses) = 800Bytes. For all 200 connections this yields 160KB. In addition each node
has to store the cold set view, which yields 160KB (upper limit if the data is not stored in a diff format).
The memory usage complexity is obviously O(n2 ). Most of this data however, is used only in the case

3000 3000
run 0 run 0
run 1 run 1
run 2 run 2
2500 run 3 2500 run 3
run 4 run 4
run 5 run 5
invalid connection count

invalid connection count

run 6 run 6
2000 run 7 2000 run 7
run 8 run 8
run 9 run 9
run 10 run 10
1500 run 11 1500 run 11
run 12 run 12
run 13 run 13
run 14 run 14
1000 run 15 1000 run 15
run 16 run 16
run 17 run 17
run 18 run 18
500 run 19 500 run 19

0 0
0 100 200 300 400 500 600 700 800 900 1000 0 100 200 300 400 500 600 700 800 900 1000
time[s] time[s]

Figure 11: 128 Nodes shut down at once in each Figure 12: 128 Nodes shut down at once in each
subnet, CKA, multiple experiment repetitions subnet, CKA, multiple experiment repetitions

of an outage recovery or connection update so it does not need to reside in memory and can be placed
on disk. Last but not least, the “sequential” flooding algorithm produces 2 ∗ N messages for N nodes
that have to be informed. At the other hand, with the normal keep-alive mechanism each node sends
1 + k Messages, where k is the configured retry count. So if only one message is sufficient for a node
to disconnect the flooding algorithm is more efficient in terms of message count. In reality a node,
which receives an outage information from another node, has to verify it, otherwise the disconnect of the
entire cold set of an online node could happen. So the real overhead formula (including the keep-alive
messages for the verification) is (2+k)∗N , which is still close to the overhead of the standard keep-alive

6 Related Work and Novelty

The idea of reducing traffic by message bundling and duplicate eliminating is not new but is used widely
in todays distributed systems. Cache systems are a typical example for reducing server load and un-
necessary traffic. By introducing caches in the network data paths can be shortened. The caches on
the local system can even prevent creating a connection. Works that deal with caching techniques in
P2P systems are [12], [13]. A DNS system based on Chord, that uses search path caching is described
in [11]. However these works focus mainly on the data traffic exchanged in a P2P network, not on the
keep-alive traffic. Another concept that is more flexible than just a cache is the AVP (Active Virtual Peer)
[10]. It introduces components in a P2P network (virtual peers), which can add additional functionality,
like message bundling, query routing optimization, interest group forming, etc. The AVP concept could
be used to reduce keep-alive traffic by eliminating duplicate requests to a server. Another specialized
caching mechanism for keep-alive traffic is introduced in [14]. It uses intelligent message bundling and
reduces the signalling traffic in the Gnutella network. Some overlay architectures like [1] and [2] are
constructing physical neighborhood aware topologies. A nice side effect of the neighborhood awareness
is the reduction of the keep-alive traffic, since the average hop count of the messages is reduced(the mes-
sages are travelling shorter paths). This approach targets generally all kinds of overlay traffic. It does not
provide the efficiency of the CKA algorithm, but can be combined with it to achieve even better results.
The current traffic optimization research for P2P systems underestimates the importance of the keep-
alive mechanisms used. Keep-alive messages are considered to produce not too heavy traffic load. Thus,
too less attention is given this topic. The research is more concentrated on finding the right connec-
tions for a node, than in reducing the costs per connection. A fundamental difference of all approaches

presented above in this section to the CKA algorithm is that they reduce the existing keep-alive traf-
fic by eliminating unnecessary messages, which are already travelling through the network. The CKA
algorithm on the other hand prevents the creation of unnecessary keep-alive traffic.

7 Conclusions and Future Work

With this paper we presented a new algorithm for reducing connection overhead in overlays caused by
keep-alive messages. The idea of the algorithm is to share the keep-alive load among the nodes that
are connected to a common node. It effectively trades network traffic with the cost of relatively small
additional memory storage at the end nodes. A system that uses the algorithm can benefit from the
increased number of connections per node at low memory and CPU costs in various ways - improved
resilience to net-splits, increased routing potential. The properties of the CKA algorithm were evaluated
using a discrete network simulation. Although the CKA algorithm was evaluated within a Chord-like
P2P system, we suggest that this approach is general enough to be applied effectively to a variety of
other large scale overlay systems with integrated tolerance against node faults. Although a significantly
less number of nodes was simulated as it is the case in real world P2P networks, the evaluation results are
promising and indicate the good performance of the CKA algorithm. To evaluate CKA with at least one
additional widely spread P2P application like Gnutella or e-Donkey [4] and with much higher number of
nodes, and to calculate a statistically significant confidence intervals for the CKA performance is also an
ongoing research.

[1] B. Y. Zhao, L. Huang, J. Stribling, S. C. Rhea, A. D. Joseph, and J. Kubiatowicz, “Tapestry: A
Resilient Global-scale Overlay for Service Deployment”, IEEE Journal on Selected Areas in Com-
munications, January 2004, Vol. 22, No. 1.

[2] A. Rowstron and P. Druschel, “Pastry: Scalable, distributed object location and routing for large-
scale peer-to-peer systems”, IFIP/ACM International Conference on Distributed Systems Platforms
(Middleware), Heidelberg, Germany, pages 329-350, November, 2001.

[3] R. Bagrodia, R. Meyer, et al., “PARSEC: A Parallel Simulation Environment for Complex System”,
Computer Magazine, vol. 31, No. 10, pages 77-85, 1998

[4] I. Gosling: “eDonkey/ed2k: Study of A Young File Sharing Protocol”, SANS Institute Technical
Report available at gosling GCIH.pdf, 2003.

[5] I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakrishnan. “Chord: A scalable peer-to-
peer lookup service for Internet applications.” Technical Report TR-819, MIT, March 2001.
[6] S. Saroiu, P. K. Gummadi, and S. D. Gribble. “A Measurement Study of Peer-to-Peer File Sharing
Systems.” In Proceedings of Multimedia Computing and Networking, 2002.

[7] T. Klingberg, and R. Manfredi, “The Gnutella protocol version 0.6 draft”, Gnutella developer forum
( gdf/files/Development/), 2002

[8] The Network Simulator - ns-2:, last visited 28.11.2005

[9] F. Mattern, “Verteilte Basisalgorithmen.” Springer-Verlag, 1989

[10] H. De Meer, K. Tutschku, and P. Tran-Gia, “Dynamic Operation in Peer-to-Peer Overlay Net-
works”, Praxis der Informationsverarbeitung und Kommunikation, (PIK Journal), Special Issue on
Peer-to-Peer Systems, June 2003.

[11] R.Cox, A. Muthitacharoen, R.T. Morris, “Serving DNS using a Peer-to-Peer Lookup service”,
IPTPS ’01: Revised Papers from the First International Workshop on Peer-to-Peer Systems,
Springer-Verlag, pages 155-165, 2002

[12] F. U. Andersen, H. de Meer, I. Dedinski, C. Kappler, A. M”ader, J. O. Oberender, K. Tutschku,”An

Architecture Concept for Mobile P2P File Sharing Services.” GI Jahrestagung (2), 2004

[13] N. Leibowitz, M. Ripeanu, A. Wierzbicki, “Deconstructing the Kazaa Network”, in 3rd IEEE Work-
shop on Internet Applications (WIAPP’03). 2003. Santa Clara, CA.

[14] C. Rohrs, V. Falco, “Limewire Ping Pong Scheme”,,

last visited 28.11.2005

[15] G. Kunzmann, A. Binzenhöfer, R. Henjes, ”Analysis of the Stability of the Chord protocol under
high Churn Rates”, 6th Malaysia International Conference on Communications (MICC) in conjunc-
tion with International Conference on Networks (ICON), November 2005, Kuala Lumpur, Malaysia.