Professional Documents
Culture Documents
Systems
Introdu tion
Peer-to-peer (P2P) systems have re
ently be
ome a very a
tive resear
h area, due
to the popularity and widespread use of P2P systems today, and their potential
uses in future appli
ations. Re
ently, P2P systems have emerged as a popular
way to share huge amounts of data (e.g., [1, 16, 17). In the future, the advent
of large-s
ale ubiquitous
omputing makes P2P a natural model for intera
tion
between devi
es (e.g., via the web servi
es [18 framework).
P2P systems are popular be
ause of the many benets they oer: adaptation, self-organization, load-balan
ing, fault-toleran
e, availability through massive repli
ation, and the ability to pool together and harness large amounts of
resour
es. For example, le-sharing P2P systems distribute the main
ost of sharing data { bandwidth and storage { a
ross all the peers in the network, thereby
allowing them to s
ale without the need for powerful, expensive servers.
Despite their many strengths, however, P2P systems also present several
hallenges that are
urrently obsta
les to their widespread a
eptan
e and usage
{ e.g., se
urity, e
ien
y, and performan
e guarantees like atomi
ity and transa
tional semanti
s. The P2P environment is parti
ularly
hallenging to work in
be
ause of the s
ale of the network and unreliable nature of peers
hara
terizing
most P2P systems today. Many te
hniques previously developed for distributed
systems of tens or hundreds of servers may no longer apply; new te
hniques are
needed to meet these
hallenges in P2P systems.
In this paper, we
onsider resear
h problems asso
iated with sear
h and se
uin data-sharing P2P systems. Though data-sharing P2P systems are
apable
of sharing enormous amounts of data (e.g., 0.36 petabytes on the Morpheus [17
network as of O
tober 2001), su
h a
olle
tion is useless without a sear
h me
hanism allowing users to qui
kly lo
ate a desired pie
e of data (Se
tion 2). Furthermore, to ensure proper,
ontinued operation of the system, se
urity measures
must be in pla
e to prote
t against availability atta
ks, unauthenti
data, and
illegal a
ess (Se
tion 3). In this paper, we highlight several important and open
resear
h issues within both of these topi
s.
Note that this paper is not meant to be an exhaustive survey of P2P resear
h.
First, P2P
an be applied to many domains outside of data-sharing; for example,
omputation (e.g., [19, 20),
ollaboration (e.g., [21), and infrastru
ture systems
(e.g., [22) are all popular appli
ations of P2P. Ea
h appli
ation fa
es its own
unique
hallenge (e.g., job s
heduling in
omputation systems), as well as
ommon issues (e.g., resour
e dis
overy). In addition, within data-sharing systems
there exists important resear
h outside of sear
h and se
urity. Good examples
in
lude resour
e management issues su
h as fairness and administrative ease. Finally, due to spa
e limitations, the issues we present within sear
h and se
urity
are not
omprehensive, but illustrative. Examples are also
hosen with a bias
towards work done at the Stanford Peers group [23, be
ause it is the resear
h
that the authors know best.
rity
Sear h
2.1 Overview
In a data-sharing P2P system, users submit queries and re
eive results (su
h
as data, or pointers to data) in return, via the sear
h me
hanism. Data shared
in the system
an be of any type. In most
ases users share les, su
h as musi
les, images, news arti
les, web pages, et
. Other possibilities in
lude data stored
in a relational DBMS, or a queryable spreadsheet. Queries may take any form
that is appropriate given the type of data shared. For example, in a le-sharing
system, queries might be keywords with regular expressions, and the sear
h may
be dened over dierent portions of the do
ument (e.g., header, title, metadata).
A sear
h me
hanism denes the behavior of peers in three areas:
{ Topology: Denes how peers are onne ted to ea h other. In some systems
(e.g., Gnutella [1), peers may
onne
t to whomever they wish. In other
systems, peers are organized into a rigid stru
ture, in whi
h the number and
nature of
onne
tions is di
tated by the proto
ol. Dening a rigid topology
may in
rease e
ien
y, but will restri
t autonomy.
: Denes how data or metadata is distributed a
ross the
network of peers. For example, in Gnutella, ea
h node stores only its own
olle
tion of data. In Chord [2, data or metadata is
arefully pla
ed a
ross
nodes in a deterministi
fashion. In super-peer networks [12, metadata for
a small group of peers is
entralized onto a single super-peer.
: Denes how messages are propagated through the network. When a peer submits a query, the query message is sent to a number
of the peer's \neighbors" (that is, nodes to whom the peer is
onne
ted),
who may in turn forward the message sequentially or in parallel to some of
their neighbors, and so on. When, and to whom, messages are sent is di
tated by the routing proto
ol. Often, the routing proto
ol
an take advantage
of known patterns in topology and data pla
ement, in order to redu
e the
number of messages sent.
{ Message routing
In an a
tual system, the general model des
ribed above takes on a dierent
form depending on the requirements of the system. Requirements are spe
ied
in several main
ategories:
des
ribe the desired data in su
ient detail. Key lookups are not expressive
enough for IR sear
hes over text do
uments, and keyword queries are not
expressive enough to sear
h stru
tured data su
h as relational tables.
: In some systems, returning any single result is su
ient (e.g., any
ast), whereas in others, all results are required. The latter
type of system requires a
omprehensive sear
h me
hanism, in whi
h all
possible results are returned.
: Every sear
h me
hanism must dene peer behavior with respe
t
to topology, data pla
ement, and message routing. However, autonomy of a
peer is restri
ted when the me
hanism limits behavior that a peer
ould
reasonably expe
t to
ontrol. For example, a peer may wish to only
onne
t
to its friends or other trusted peers in the same organization, or the peer
may wish to
ontrol whi
h nodes
an store its data (e.g., only nodes on the
intranet), and how mu
h of other nodes' data it must store. Depending on
the purpose and users of the system, the sear
h me
hanism may be required
to meet a
ertain level of autonomy for peers.
{ Comprehensiveness
{ Autonomy
In this paper, we assume the additional requirement that the sear
h me
hanism
be de
entralized. A P2P system may have
entralized sear
h, and indeed, su
h
\hybrid systems" have been very useful and popular in pra
ti
e (e.g., [16).
However,
entralized systems have been well-studied, and it is desirable that the
sear
h me
hanism share the same benets of P2P mentioned in Se
tion 1; hen
e,
here we fo
us only on de
entralized P2P solutions.
While a well-designed sear
h me
hanism must satisfy the requirements spe
ied by the system, it should also seek to maximize the following goals:
{ Quality of Servi
e
{ Robustness
By pla
ing
urrent work in the framework of requirements and goals above, we
an identify several areas in whi
h resear
h is mu
h needed. In the following
se
tion, we mention just a few of these areas.
2.2 Expressiveness
In order for P2P systems to be useful in a wide range of appli
ations, they must
be able to support query languages of varying levels of expressiveness. Thus far,
work in P2P sear
h has fo
used on answering simple queries, su
h as key lookups.
An important area of resear
h therefore lies in developing me
hanisms for ri
her
query languages. Here, we list a few examples of useful types of queries, and
dis
uss the related work and
hallenges in supporting them.
{ Key lookup: The simplest form of query is an obje t lookup by key or iden-
tier. Proto
ols dire
tly supporting this primitive have been widely studied,
and e
ient solutions exist (e.g., [2{4). Ongoing resear
h is exploring how
to make these proto
ols more e
ient and robust [5.
: While mu
h resear
h has fo
used on sear
h te
hniques for keyword queries (e.g., [11, 10, 6), all of these te
hniques have been geared towards e
ient, partial (not
omprehensive) sear
h { e.g., all musi
-sharing
systems
urrently only support partial sear
h. Partial sear
h is a
eptable in
those appli
ations where a few keywords
an usually uniquely identify the
desired le (e.g., musi
-sharing systems, as opposed to web page repositories), be
ause the rst few mat
hes are likely to satisfy the user's request.
Te
hniques for partial sear
h
an always be made
omprehensive simply by
sending the query message to every peer in the network; however, su
h an
approa
h is prohibitively expensive. Hen
e, designing te
hniques for e
ient,
omprehensive sear
h remains an open problem.
: If many results are returned for
omprehensive keyword
sear
h, users will need results to be ranked and ltered by relevan
e. While
the query language for ranked keyword sear
h remains the same, the additional information in the results (i.e., the relevan
e ranking) poses additional
hallenges and opportunities. For example, ranked sear
h
an be built on top
of regular sear
h by retrieving all results and sorting lo
ally; however, stateof-the-art ranking fun
tions usually require global statisti
s over the total
{ Keyword
{ Ranked keyword
olle
tion of do
uments (e.g., do
ument frequen
y). Colle
ting and maintaining these statisti
s in a robust, e
ient, and distributed manner is a
hallenge. At the same time, ranked results allow the system to return \top
k" results, whi
h provides the opportunity to optimize sear
h if k is mu
h
less than the total number of results (whi
h is generally the
ase, for example, in web sear
hes). Te
hniques for ranked sear
h exists for distributed
systems of moderate s
ale (e.g., [7), but future resear
h must extend these
te
hniques to support mu
h larger systems.
: A user may sometimes be interested in knowing aggregate
properties of the system or data
olle
tion as a whole, rather than lo
ating spe
i
data. For example, to
olle
t global statisti
s to support ranked
keyword sear
h mentioned earlier, a user
ould submit several SUM queries
to sum the number of do
uments that
ontain a parti
ular term. Ongoing
resear
h [8 addresses COUNT queries dened over a predi
ate { for example,
ounting the number of nodes that belong to the stanford.edu domain.
Further resear
h is needed to extend these te
hniques into more expressive
aggregates like SUM, MAX, and MEDIAN.
: As a
omplex language dened over a ri
h data model, SQL is the most
di
ult query language to support among the examples listed. Current resear
h on supporting SQL in P2P systems is very preliminary. For example,
the PIER proje
t [9 supports a subset of SQL over a P2P framework, but
they report signi
ant performan
e \hotspots" in their preliminary implementation. A great deal of additional resear
h is needed to advan
e
urrent
work into a sear
h me
hanism with reasonable performan
e, and to investigate alternative approa
hes to the problem.
{ Aggregates
{ SQL
Autonomy, e
ien
y and robustness are all desirable features in any system.
These features
on
eptually dene an informal spa
e of P2P systems, as shown in
Figure 1a, where a point in the spa
e represents a system with the
orresponding
\values" for ea
h feature. Note that the value of a system with respe
t to a
feature only provides a partial order, sin
e features
an be measured along several
metri
s (e.g., e
ien
y
an be measured by bandwidth, pro
essing power, and
storage). Hen
e, Figure 1 illustrates the spa
e by showing just a few points for
whi
h the relative order (and not the a
tual
oordinates) along ea
h feature is
fairly obvious.
The spa
e dened by autonomy, e
ien
y and robustness is not fully explored; in parti
ular, there appears to be some
orrelation between autonomy
and e
ien
y (Figure 1b ), and autonomy and robustness (Figure 1
). A partial explanation for the rst
orrelation is that less autonomy allows the sear
h
me
hanism to spe
ify a data pla
ement and topology su
h that:
There exist a deterministi
way to lo
ate data within bounded
ost (e.g.,
Chord)
There is a small set of nodes that is guaranteed to hold the answer, if it
exists (e.g., super-peer networks,
on
ept
lusters [13)
{
{
autonomy
+
efficiency
robustness - -+
+
(a)
aut.
aut.
Gnutella
Gnutella
super-peer networks
super-peer
super-peer redundancy
Chord/Viceroy
(b)
eff.
Chord
(c)
Viceroy
rob.
Fig. 1. The spa
e of systems dened by autonomy, e
ien
y and robustness (a ). Looking at a few example systems within this spa
e, there appears to be a relationship
between autonomy and e
ien
y (b), and autonomy and robustness (
)
{ There is an in reased han e of nding results on a random node (e.g., repli ation [6).
a range of peers on whi
h a do
ument may be stored (e.g., all peers within the
stanford.edu domain). At one extreme, if the range is always limited to a single
peer, then user autonomy is high, but the system
eases to be P2P and loses good
properties su
h as load-balan
ing and self-organization. At the other extreme,
if the range always in
ludes all peers in the network, SkipNet fun
tions as a
traditional P2P lookup system with low autonomy, but other good properties.
While SkipNet does not push beyond existing tradeos, its value lies in allowing
users to
hoose the point along the tradeo that meets their needs.
Se urity
Se
uring P2P data sharing appli
ations is
hallenging due to their open and
autonomous nature. Compared to a
lient-server system in whi
h servers
an be
relied upon or trusted to always follow proto
ols, peers in a P2P system may
provide no su
h guarantee. The environment in whi
h a peer must fun
tion is a
hostile one in whi
h any peer is wel
ome to join the network; these peers
annot
ne
essarily be trusted to route queries or responses
orre
tly, store do
uments
when asked to, or serve do
uments when requested. In this part of the paper, we
outline a number of se
urity issues that are
hara
teristi
to P2P data sharing
systems, dis
uss a few examples of resear
h that has taken pla
e to address some
of these issues, and suggest a number of open resear
h problems.
We organize the se
urity requirements of P2P data sharing systems into
four general areas: availability, le authenti
ity, anonymity, and a
ess
ontrol.
Today's P2P systems rarely address all of the ne
essary requirements in any
one of these areas, and developing systems that have the
exibility to support
requirements in all of these areas is expe
ted to be a resear
h
hallenge for quite
some time.
For ea
h of these areas, it will be important to develop te
hniques that prevent, dete
t, manage, and are able to re
over from atta
ks. For example, sin
e
it may be di
ult to prevent a denial-of-servi
e atta
k against a system's availability, it will be important to develop te
hniques that are able to 1) dete
t
when a denial-of servi
e atta
k is taking pla
e (as opposed to there just being a
high load), 2) manage an atta
k that is \in-progress" su
h that the system
an
ontinue to provide some (possibly redu
ed) level of servi
e to
lients, and 3)
re
over from the atta
k by dis
onne
ting the mali
ious nodes.
3.1 Availability
There are a number of dierent node and resour
e availability requirements that
are important to P2P le sharing systems. In parti
ular, ea
h node in a P2P
system should be able to a
ept messages from other nodes, and
ommuni
ate
with them to oer a
ess to the resour
es that it
ontributes to the network.
A denial-of-servi
e (DoS) atta
k attempts to make a node and its resour
es
unavailable by overloading it. The most obvious DoS atta
k is targeted at using up all of a node's bandwidth. This type of atta
k is similar to traditional
network-layer DoS atta
ks (e.g. [31). If a node's available bandwidth is used up
transferring useless messages that are dire
tly or indire
tly
reated by a mali
ious node, all of the other resour
es that the node has to oer (in
luding CPU
and storage) will also be unavailable to the P2P network.
A spe
i
example of a DoS atta
k against node availability is a
hosenvi
tim atta
k in Gnutella that an adversary
onstru
ts as follows: a mali
ious
super-node maneuvers its way into a \
entral" position in the network and then
responds to every query that passes thru it
laiming that the vi
tim node has a
le that satises the query (even though it does not). Every node that re
eives
one of these responses then attempts to
onne
t to the vi
tim to obtain the
le that they were looking for, and the number of these requests overloads the
bandwidth of the vi
tim su
h that any other node seeking a le that the vi
tim
does have is unable to
ommuni
ate the vi
tim.
The key aspe
t to note here is that in our example the atta
ker exploited a
vulnerability of the Gnutella P2P proto
ol (namely, that any node
an respond
to any query
laiming that any le
ould be anywhere). In the future, P2P
proto
ols need to be designed to make it hard for adversaries to
onstru
t DoS
atta
ks by taking advantage of loosely
onstrained proto
ol features.
Atta
kers that
onstru
t DoS atta
ks typi
ally need to nd and take advantage of an \ampli
ation me
hanism" in the network to
ause signi
antly more
damage than they
ould with only their own resour
es. In addition, if they would
like to have
ontrol over how their atta
k is
arried out, they must also nd or
reate a ba
k-door
ommuni
ation
hannel to
ommuni
ate with \zombie" hosts
that they inltrate using manual or automati
means. It is important to design future P2P proto
ols su
h that they do not open up new opportunities for
atta
kers to use as ampliers and ba
k-door
ommuni
ation
hannels.
Some resear
h has taken pla
e to date to spe
i
ally address DoS atta
ks in
P2P networks. In parti
ular, [38 addresses DoS atta
ks based on query-
oods
in the Gnutella network. However, more resear
h is ne
essary to understand the
ee
ts of other types of DoS atta
ks in various P2P networks.
Aside from DoS atta
ks, node availability
an also be atta
ked by mali
ious
users that inltrate vi
tim nodes and indu
e their failure. These types of atta
ks
an be modeled as fail-stop or byzantine failures, whi
h
ould potentially be dealt
with using many te
hniques that have already been developed (e.g. [34). However, these te
hniques have typi
ally not been popular due to their ine
ien
y,
unusually high message overhead, and
omplexity. In addition, these te
hniques
often assume
omplete and se
ure pairwise
onne
tivity between nodes, whi
h
is not the
ase in most P2P networks. Further resear
h will be ne
essary to
make these or similar te
hniques a
eptable from a performan
e and se
urity
standpoint in a P2P
ontext.
In addition, there are many proposals to provide signi
ant levels of faulttoleran
e in the fa
e of node failure in
luding CAN [3, Chord [2, Pastry [4,
and Vi
eroy [14. Se
urity analyses of these types of proposals
an be found in
[43 and [36. The IRIS [25 proje
t seeks to
ontinue the investigation of these
types of approa
hes.
A mali
ious node
an also dire
tly atta
k the availability of any of the parti
ular resour
es at a node. The CPU availability at a node
an be atta
ked by
sending a modest number of
omplex queries to bog down the CPU of a node
without
onsuming all of its bandwidth. The available storage
ould be atta
ked
by mali
ious nodes who are allowed to submit bogus do
uments for storage. One
approa
h to deal with this is to allo
ate storage to nodes in a manner proportional to the resour
es that a node
ontributes to the network as proposed in
[28.
We might like to ensure that all les stored in the system are always available
regardless of whi
h nodes in the network are
urrently online. File availability
opy of that do
ument. For example, if Charles Darwin was the rst author to
ever submit a do
ument with the title \Origin of Spe
ies," then his do
ument
would be
onsidered to be an authenti
mat
h for a query looking for \Origin of
Spe
ies" as the title. Any do
uments that were submitted with the title \Origin
of Spe
ies" after Charles Darwin's submission would be
onsidered unauthenti
mat
hes to the query even if we de
ided to store these do
uments in the system.
Timestamping systems (e.g. [35)
an be helpful in
onstru
ting le authenti
ity
systems based on this approa
h.
Expert-Based. In this approa
h, a do
ument would be deemed authenti
by
an \expert" or authoritative node. For example, node G may be an expert that
keeps tra
k of signatures for all les ever authored by any user of G. If a user
sear
hing for do
uments authored by any of G's users is ever
on
erned about
the potential authenti
ity of a le re
eived as a response to a query, node G
an be
onsulted. Of
ourse, if node G is unavailable at any parti
ular time
due to a transient or permanent failure, is inltrated by an atta
ker, or is itself
mali
ious, it may be di
ult to properly verify the authenti
ity of les that G's
users authored. Oine digital signature s
hemes (i.e., RSA)
an be used to verify
le authenti
ity in the fa
e of node failures, but are limited by the lifetime and
se
urity of publi
/private keys.
Voting-Based. To deal with the possible failure of G or a
ompromised key
in our last approa
h, our third denition of authenti
ity takes into a
ount the
\votes" of many experts. The expert nodes may be nodes that are run by human experts qualied to study and assess the authenti
ity of parti
ular types of
les, and the majority opinion of the human experts
an be used to assess the
authenti
ity of a le. Alternatively, the expert nodes may simply be \regular"
nodes that store les, and will vote that a parti
ular le is authenti
if they store
a
opy of it. In this s
heme, users are expe
ted to delete les that they do not
believe are authenti
, and a le's authenti
ity is determined by the number of
opies of the le that are distributed throughout the system. The key te
hni
al
issues with this approa
h are how to prevent spoong of votes, of nodes, and of
les.
Reputation-Based. Some experts might be more trustworthy than others (as
determined by past performan
e), and we might weight the votes of more trustworthy experts more heavily. The weights in this approa
h are a simple example of \reputations" that may be maintained by a reputation system. A reputation system is responsible for maintaining, updating, and propagating su
h
weights and other reputation information [41. Reputation systems may or may
not
hoose to use voting in making their assessments. There has been some study
of reputation systems in the
ontext of P2P networks, but no su
h system has
been
ommer
ially su
essful (e.g. [33, 24).
3.3 Anonymity
There is mu
h work that has taken pla
e on anonymity in the
ontext of the
Internet both at the network-layer (e.g. [30) as well as at the appli
ation-layer
Type of Anonymity
Author
Server
Reader
Do
ument
If the benets of P2P systems are to be realized, we need to explore the feasibility of and the te
hni
al approa
hes to instrumenting them with appropriate
me
hanisms to allow for the management of intelle
tual property.
Con lusion
Many of the open problems in P2P data sharing systems surround sear
h and
se
urity issues. The key resear
h problem in providing a sear
h me
hanism is
how to provide for maximum expressiveness,
omprhensiveness, and autonomy
with the best possible e
ien
y, quality-of-servi
e, and robustness. The key to
se
uring a P2P network lies in designing me
hanisms that ensure availabiity, le
authenti
ity, anonymity, and a
ess
ontrol. In this paper, we have illustrated
some of the trade-os at the heart of sear
h and se
urity problems in P2P data
sharing systems, and outlined several major areas of importan
e for future work.
Referen
es
1. Gnutella website. http://www.gnutella.
om
2. Stoi
a, I., Morris, R., Karger, D., Kaashoek, M. F., Balakrishnan, H.: Chord: A s
alable peer-to-peer lookup servi
e for internet appli
ations. Pro
. ACM SIGCOMM
(2001)
3. Ratnasamy, S., Fran
is, P., Handley, M., Karp, R., Shenker, S.: A s
alable
ontent
addressable network. Pro
. ACM SIGCOMM (2001)
4. Rowstron, A., Drus
hel, P.: Pastry: S
alable, distributed obje
t lo
ation and routing
for large-s
ale peer-to-peer systems. Pro
. of the 18th IFIP/ACM Intl. Conf. on
Distributed Systems Platforms (2001)
5. Ratnasamy, S., Shenker, S., Stoi
a, I.: Routing Algorithms for DHTs: Some Open
Questions. Pro
. IPTPS (2002)
6. Lv, Q., Cao, P., Cohen, E., Li, K., Shenker, S.: Sear
h and repli
ation in unstru
tured peer-to-peer networks. Pro
. of Intl. Conf. on Super
omputing (2002)
7. Cuen
a-A
una, F. M., Peery, C., Martin, R. P., Nguyen, T. D.: PlanetP: using gossiping to build
ontent addressable peer-to-peer information sharing
ommunities.
Te
hni
al Report DCS-TR-487, Dept. of Computer S
ien
e, Rutgers Univ. (2002)
8. Bawa, M., Gar
ia-Molina, H., Gionis, A., Motwani, R.: Estimating the size of a
peer-to-peer network (2002)
9. Harren, M., Hellerstein, M., Huebs
h, R., Loo, B., Shenker, S., Stoi
a, I.: Complex
Queries in DHT-based Peer-to-Peer Networks. Pro
. IPTPS (2002)
10. Crespo, A., Gar
ia-Molina, H.: Routing indi
ies for peer-to-peer systems. Pro
.
28th Intl. Conf. on Distributed Computing Systems (2002)
11. Yang, B., Gar
ia-Molina, H.: Improving sear
h in peer-to-peer systems. Pro
. 28th
Intl. Conf. on Distributed Computing Systems (2002)
12. Yang, B., Gar
ia-Molina, H.: Designing a super-peer network. Pro
. ICDE (2003)
13. S
hlosser, M., Sintek, M., De
ker, S., Nejdl, W.: A s
alable and ontology-based
P2P infrastru
ture for semanti
web servi
es (2002)
14. Malkhi, D., Nao, M., Rataj
zak, D.: Vi
eroy: a s
alable and dynami
emulation of
the butter
y. Pro
. PODC (2002)
15. Harvey, N., Jones, M., Saroiu, S., Theimer, M., Wolman, A.: SkipNet: a s
alable
overlay network with pra
ti
al lo
ality properties (2002)
16.
17.
18.
19.
20.
21.
22.