Professional Documents
Culture Documents
Table 1 presents the characteristics of the Consider a small world of C clusters, each
graphs of users who shared data within various comprising, on average, G nodes. A cluster is
time intervals ranging from 1 day to 30 days. The defined as a community with overlapping data
small-world pattern is evident when comparing interests, independent of geographical or
the clustering coefficient and average path length administrative proximity. Clusters are linked
with those of a random graph of the same size together in a connected network. In this structure,
(same number of nodes and edges): the clustering we combine information dissemination techniques
coefficient of a small-world graph is significantly with request-forwarding search mechanisms:
larger than that of a similar random graph, while location information is propagated aggressively
the average path length is about the same within clusters, while inter-cluster search uses
request forwarding techniques.
3. Locating files in small-world We chose gossip [10] as the information
networks dissemination mechanism: nodes gossip location
We consider an environment with potentially information to other nodes within the cluster.
hundreds of thousands of geographically Eventually, with high probability, all nodes will
distributed nodes that provide location learn about all other nodes in the cluster. They
information as <logical filename, physical location> will also know, with high probability, all location
pairs. information provided by all nodes within the
Locating files in this environment is cluster. Hence, a request addressed to any node in
challenging because of scale and dynamism: the the cluster can be satisfied at that node, if the
number of nodes, logical files, requests, and answer exists within the cluster.
concurrent users (seen as file location requestors) A request that cannot be answered by the local
may all be large. The system has multiple sources node is forwarded to other cluster(s), by unicast,
of variation over time: files are created and multicast, or flooding. Ideally, clusters can
removed frequently; nodes join and leave the organize themselves dynamically in search-
system without a predictable pattern. In such a optimized structures, thus allowing a low cost
system with a large number of components (nodes inter-cluster file retrieval. Since any node in a
and files), even a low variation rate at the cluster has all information provided in that cluster,
individual level may aggregate into frequent the search space reduces from C*G to C.
group level changes. In this context, nodes need to store the total
We exploit the two environmental amount of information provided by the cluster to
characteristics introduced in Section 2—group which they belong. In order to reduce storage
and time locality—to advance our performance costs, we use a compact, probabilistic
objective of minimizing file location latency. We representation of information based on Bloom
also build on our assumption that small-world Filters (Section 4.2). Nodes can trade off the
structures eventually emerge in P2P scientific amount of memory used for the accuracy in
collaborations. representing information.
3
Each node needs to have sufficient topology functions, h1, h2,,…,hk with hi:X {1..m}. For a
knowledge to forward requests outside the cluster. fixed size (n) of the set to be represented, the
Not every node needs to be connected to nodes tradeoff between accuracy and space (m bits) is
from remote clusters, but, probabilistically, every controlled by the number of hash functions used
node needs to know a local node that has external (k). The probability of a false positive is:
connections. The question of how to form and − kn k
not advertised for some time is considered is a constant. Zipf distributions are widely present
removed. in the Internet world. For example, the popularity
of documents requested from an Internet proxy
4.2. Bloom Filters cache (with 0.65 < < 0.85), Web server
Bloom filters [7] are compact data structures document popularity (0.75 < < 0.85), and
used for probabilistic representation of a set in Gnutella query popularity (0.63 < < 1.24) all
order to support membership queries (“Is element exhibit Zipf distributions. For our problem, we
x in set X?”). The cost of this compact assume that file popularity in each cluster (group)
representation is a small rate of false positives: follows a Zipf distribution.
the structure sometimes incorrectly recognizes an Locality of interests. As discussed above,
element as member of the set. clusters are formed based on shared interest. We
Bloom filters describe membership of a set A therefore assume that information on the most
by using a bit vector of length m and k hash popular files is available within the cluster and
4
only requests for not-so-popular files are small-world properties that exist at the social
forwarded. level? There are at least two ways to attempt to
With these assumptions, we can estimate the answer this question. The first approach is to look
fraction of file requests served by the group as a at existing small worlds and to identify the
function of the distribution parameter and the characteristics that foster the small-world
fraction of files about which the group maintains phenomenon. The second approach is to start from
information. For example, as Figure 1 shows, 68% theoretical models that generate small worlds [15]
of all requests are served by the group when and mirror them into protocol design.
information about only top 1% most popular files The Gnutella network is an interesting case
is available at group level, for =1. Figure 1 study as it is a P2P self-configuring technological
strongly emphasizes the need for efficient, interest network that exhibits small-world characteristics
based, cluster creation. [9]. How are the small-world characteristics
generated? One possible answer is that the social
100% network formed by the Gnutella users reflects its
Requests served locally (%)
80%
small-world patterns onto the technological
network. While this is not impossible, we observe
60% 50% that a user has a very limited contribution to the
10% Gnutella network topology. Hence, we believe the
40%
1% social influence on the Gnutella topology is
20% 0.1% insignificant.
0.01% More significant for the small-world
0%
phenomenon may be Gnutella’s network
0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7
alpha exploration protocol based on PING and PONG
messages: a PING is sent to all neighbors and
Figure 2: Fraction of requests served locally (by one each neighbor forwards it further to its own
member of a group) assuming various values for
neighbors, and so on. The PONG messages return
on the same path, allowing a node to learn of its
We estimate 100s of clusters with 1,000s of neighbor’s neighbors, and hence to improve
nodes in a cluster, sharing information on about clustering. However, the influence of this
10 million files per cluster. Using Bloom filters, mechanism is limited by the (comparatively)
for 0.1% false positives rate, each node needs 2 small number of connections per node. This fact
bytes per file or 20MB of memory to store explains why, despite an aggressive exploration of
information about all files available in the cluster. the network, the clustering coefficient in Gnutella
Assuming a 10-day average lifetime for a file at a is not large (e.g., it is an order of magnitude lower
node, and a self-imposed threshold of 0.1% false than the clustering coefficients in coauthorship
positives, then the generated traffic needed to networks).
maintain this accuracy level within the cluster can A theoretical model for building small-world
be estimated at about 24 KBps at each node. graphs [15] starts from a highly clustered graph
False negatives may have two sources: the (e.g., a lattice) and randomly adds or rewires
probabilistic information dissemination edges to connect different clusters. This
mechanism and inaccuracy in the inter-cluster methodology would be relevant to us if we had
search algorithm. By appropriately tuning the the clusters already formed and connected.
gossip periodicity and fanout, the system can Allowing clusters to form dynamically based on
control the rate of false negatives by increasing shared interests, allowing them to learn about each
communication costs. others, to adapt to users’ changing interests (e.g.,
5. Creating a small world divide or merge with other clusters) are parts of
The question raised and not answered in this the problem we formulate and continue
paper is: what protocols should be used for investigating. However, if these problems are
allowing a self-configuring network to reflect the solved, possible approaches for transforming a
5
loosely connected graph of clusters into a small 4. Sloan Digital Sky Survey,
world (hence, with small average path length) are: http://www.sdss.org/sdss.html
The hands-off approach: random graphs have 5. Albert, R. and Barabasi, A.-L. Statistical
small average path length. It is thus intuitive Mechanics of Complex Networks. Rev. Mod.
Phys., in press.
that “randomly” connected clusters will form a
6. Avery, P. and Foster, I., The GriPhyN Project:
small world. Towards Petascale Virtual-Data Grids. GriPhyN
The centralized approach at the cluster level: in TR2001-14, http://www.griphyn.org.
each cluster, one or multiple nodes are assigned 7. Bloom, B. Space/time Trade-offs in Hash Coding
the task of creating external connections. with Allowable Errors. Communications of the
The agent-based approach: allow an agent to ACM, 13 (7). 422-426.
explore the network and rewire it where 8. Clarke, I., Sandberg, O., Wiley, B. and Hong,
necessary. This approach is usually rejected due T.W., Freenet: A distributed anonymous
to associated security issues. information storage and retrieval system. In ICSI
Workshop on Design Issues in Anonymity and
6. Summary Unobservability, (Berkeley, California, 2000).
We studied the file location problem in 9. Jovanovic, M.A., Annexstein, F.S. and Berman,
decentralized, self-configuring P2P networks K.A., Scalability Issues in Large Peer-to-Peer
associated with scientific data sharing Networks - A Case Study of Gnutella. University
of Cincinnati.
collaborations. A qualitative analysis of the
10. Kermarrec, A., Massoulie, L. and Ganesh, A.,
characteristics of these collaborations, quantitative Reliable probabilistic communication in large-
analysis of file sharing information from one such scale information dissemination systems. MSR-
collaboration, and previous analyses of various TR-2000-105, Microsoft Research Cambridge.
social networks lead us to speculate that a P2P 11. Loebel-Carpenter, L., Lueking, L., Moore, C.,
scientific collaboration may benefit from a small- Pordes, R., Trumbo, J., Veseli, S., Terekhov, I.,
world topology. We sketch a mechanism for low- Vranicar, M., White, S. and White, V., SAM and
latency file retrieval that benefits from the the Particle Physics Data Grid. In Proceedings of
particularities of the scientific collaboration Computing in High-Energy and Nuclear Physics,
environments and a small-world topology. While (Beijing, China, 2001).
12. Mitzenmacher, M., Compressed Bloom Filters. In
we do not provide a solution for building topology
Twentieth ACM Symposium on Principles of
protocols flexible enough to resemble the Distributed Computing (PODC 2001), (Newport,
dynamics and patterns of social interactions, we Rhode Island, 2001).
stress the relevance of this problem and we 13. Ratnasamy, S., Francis, P., Handley, M., Karp, R.
discuss some possible directions for research. and Shenker, S., A Scalable Content-Addressable
Network. In SIGCOMM, (San Diego USA, 2001).
7. Acknowledgements 14. Stoica, I., Morris, R., Karger, D., Kaashoek, M.F.
We are grateful to John Weigand, Gabriele and Balakrishnan, H., Chord: A Scalable Peer-to-
Garzoglio, and their colleagues at Fermi National peer Lookup Service for Internet Applications. In
Accelerator Laboratory for their generous help. SIGCOMM 2001, (San Diego, USA, 2001).
This work was supported by the National Science 15. Watts, D.J. Small Worlds. The Dynamics of
Foundation under contract ITR-0086044. Networks between Order and Randomness.
Princeton University Press, Princeton, New
8. References Jersey, 1999.
1. Clip2, Gnutella Protocol Specifications v0.4, 16. Zhao, B.Y., Kubiatowicz, J.D. and Joseph, A.D.,
http://www.clip2.com Tapestry: An Infrastructure for Fault-tolerant
2. The DØ Experiment, http://www-d0.fnal.gov/ Wide-area Location and Routing. CSD-01-1141,
3. Human Genome Project, Berkeley.
http://www.nhgri.nih.gov