Data Sharing

Open Problems in Data-Sharing Peer-to-Peer
Systems
Neil Daswani, He tor Gar ia-Molina, and Beverly Yang

Stanford University, Stanford CA 94305, USA,
fdaswani, he tor, byanggdb.stanford.edu,

http://www-db.stanford.edu
In a Peer-To-Peer (P2P) system, autonomous omputers pool

their resour es (e.g., les, storage, ompute y les) in order to inexpensively handle tasks that would normally require large ostly servers. The
s ale of these systems, their \open nature," and the la k of entralized
ontrol pose di ult performan e and se urity hallenges. Mu h resear h
has re ently fo used on ta kling some of these hallenges; in this paper,
we propose future dire tions for resear h in P2P systems, and highlight
problems that have not yet been studied in great depth. We fo us on
two parti ular aspe ts of P2P systems { sear h and se urity { and suggest several open and important resear h problems for the ommunity
to address.
Abstra t.
Introdu tion
Peer-to-peer (P2P) systems have re ently be ome a very a tive resear h area, due
to the popularity and widespread use of P2P systems today, and their potential
uses in future appli ations. Re ently, P2P systems have emerged as a popular
way to share huge amounts of data (e.g., [1, 16, 17). In the future, the advent
of large-s ale ubiquitous omputing makes P2P a natural model for intera tion
between devi es (e.g., via the web servi es [18 framework).
P2P systems are popular be ause of the many benets they oer: adaptation, self-organization, load-balan ing, fault-toleran e, availability through massive repli ation, and the ability to pool together and harness large amounts of
resour es. For example, le-sharing P2P systems distribute the main ost of sharing data { bandwidth and storage { a ross all the peers in the network, thereby
allowing them to s ale without the need for powerful, expensive servers.
Despite their many strengths, however, P2P systems also present several
hallenges that are urrently obsta les to their widespread a eptan e and usage
{ e.g., se urity, e ien y, and performan e guarantees like atomi ity and transa tional semanti s. The P2P environment is parti ularly hallenging to work in
be ause of the s ale of the network and unreliable nature of peers hara terizing
most P2P systems today. Many te hniques previously developed for distributed
systems of tens or hundreds of servers may no longer apply; new te hniques are
needed to meet these hallenges in P2P systems.
In this paper, we onsider resear h problems asso iated with sear h and se uin data-sharing P2P systems. Though data-sharing P2P systems are apable
of sharing enormous amounts of data (e.g., 0.36 petabytes on the Morpheus [17
network as of O tober 2001), su h a olle tion is useless without a sear h me hanism allowing users to qui kly lo ate a desired pie e of data (Se tion 2). Furthermore, to ensure proper, ontinued operation of the system, se urity measures
must be in pla e to prote t against availability atta ks, unauthenti data, and
illegal a ess (Se tion 3). In this paper, we highlight several important and open
resear h issues within both of these topi s.
Note that this paper is not meant to be an exhaustive survey of P2P resear h.
First, P2P an be applied to many domains outside of data-sharing; for example,
omputation (e.g., [19, 20), ollaboration (e.g., [21), and infrastru ture systems
(e.g., [22) are all popular appli ations of P2P. Ea h appli ation fa es its own
unique hallenge (e.g., job s heduling in omputation systems), as well as ommon issues (e.g., resour e dis overy). In addition, within data-sharing systems
there exists important resear h outside of sear h and se urity. Good examples
in lude resour e management issues su h as fairness and administrative ease. Finally, due to spa e limitations, the issues we present within sear h and se urity
are not omprehensive, but illustrative. Examples are also hosen with a bias
towards work done at the Stanford Peers group [23, be ause it is the resear h
that the authors know best.
rity
Sear h
A good sear h me hanism allows users to ee tively lo ate desired data in a

resour e-e ient manner. Designing su h a me hanism is di ult in P2P systems
for several reasons: s ale of the system, unreliability of individual peers, et . In
this se tion, we outline the basi ar hite ture, requirements and goals of a sear h
me hanism for P2P systems, and then suggest several areas of open resear h.
2.1 Overview
In a data-sharing P2P system, users submit queries and re eive results (su h
as data, or pointers to data) in return, via the sear h me hanism. Data shared
in the system an be of any type. In most ases users share les, su h as musi
les, images, news arti les, web pages, et . Other possibilities in lude data stored
in a relational DBMS, or a queryable spreadsheet. Queries may take any form
that is appropriate given the type of data shared. For example, in a le-sharing
system, queries might be keywords with regular expressions, and the sear h may
be dened over dierent portions of the do ument (e.g., header, title, metadata).
A sear h me hanism denes the behavior of peers in three areas:
{ Topology: Denes how peers are onne ted to ea h other. In some systems
(e.g., Gnutella [1), peers may onne t to whomever they wish. In other
systems, peers are organized into a rigid stru ture, in whi h the number and
nature of onne tions is di tated by the proto ol. Dening a rigid topology
may in rease e ien y, but will restri t autonomy.
: Denes how data or metadata is distributed a ross the
network of peers. For example, in Gnutella, ea h node stores only its own
olle tion of data. In Chord [2, data or metadata is arefully pla ed a ross
nodes in a deterministi fashion. In super-peer networks [12, metadata for
a small group of peers is entralized onto a single super-peer.
: Denes how messages are propagated through the network. When a peer submits a query, the query message is sent to a number
of the peer's \neighbors" (that is, nodes to whom the peer is onne ted),
who may in turn forward the message sequentially or in parallel to some of
their neighbors, and so on. When, and to whom, messages are sent is di tated by the routing proto ol. Often, the routing proto ol an take advantage
of known patterns in topology and data pla ement, in order to redu e the
number of messages sent.
{ Data pla ement
{ Message routing
In an a tual system, the general model des ribed above takes on a dierent
form depending on the requirements of the system. Requirements are spe ied
in several main ategories:
{ Expressiveness: The query language used for a system must be able to
des ribe the desired data in su ient detail. Key lookups are not expressive
enough for IR sear hes over text do uments, and keyword queries are not
expressive enough to sear h stru tured data su h as relational tables.
: In some systems, returning any single result is su ient (e.g., any ast), whereas in others, all results are required. The latter
type of system requires a omprehensive sear h me hanism, in whi h all
possible results are returned.
: Every sear h me hanism must dene peer behavior with respe t
to topology, data pla ement, and message routing. However, autonomy of a
peer is restri ted when the me hanism limits behavior that a peer ould
reasonably expe t to ontrol. For example, a peer may wish to only onne t
to its friends or other trusted peers in the same organization, or the peer
may wish to ontrol whi h nodes an store its data (e.g., only nodes on the
intranet), and how mu h of other nodes' data it must store. Depending on
the purpose and users of the system, the sear h me hanism may be required
to meet a ertain level of autonomy for peers.
{ Comprehensiveness
{ Autonomy
In this paper, we assume the additional requirement that the sear h me hanism
be de entralized. A P2P system may have entralized sear h, and indeed, su h
\hybrid systems" have been very useful and popular in pra ti e (e.g., [16).
However, entralized systems have been well-studied, and it is desirable that the
sear h me hanism share the same benets of P2P mentioned in Se tion 1; hen e,
here we fo us only on de entralized P2P solutions.
While a well-designed sear h me hanism must satisfy the requirements spe ied by the system, it should also seek to maximize the following goals:
{ E ien y: We measure e ien y in terms of absolute resour es onsumed
{ bandwidth, pro essing power, storage, et . An e ient use of resour es

results in lighter overhead on the system, and hen e, higher throughput.
: We an measure quality of servi e (QoS) along many
dierent metri s depending on the appli ation { number of results, response
time, et . Note the distin tion between QoS and e ien y: QoS fo uses on
user-per eived qualities, while e ien y fo uses on the resour e ost (e.g.,
bandwidth) to a hieve a parti ular level of servi e.
: We dene robustness to mean stability in the presen e of failures: quality of servi e and e ien y are maintained as peers in the system
fail or leave. Robustness to atta ks is a separate issue dis ussed in Se tion 3.
{ Quality of Servi e
{ Robustness
By pla ing urrent work in the framework of requirements and goals above, we
an identify several areas in whi h resear h is mu h needed. In the following
se tion, we mention just a few of these areas.
2.2 Expressiveness
In order for P2P systems to be useful in a wide range of appli ations, they must
be able to support query languages of varying levels of expressiveness. Thus far,
work in P2P sear h has fo used on answering simple queries, su h as key lookups.
An important area of resear h therefore lies in developing me hanisms for ri her
query languages. Here, we list a few examples of useful types of queries, and
dis uss the related work and hallenges in supporting them.
{ Key lookup: The simplest form of query is an obje t lookup by key or iden-
tier. Proto ols dire tly supporting this primitive have been widely studied,
and e ient solutions exist (e.g., [2{4). Ongoing resear h is exploring how
to make these proto ols more e ient and robust [5.
: While mu h resear h has fo used on sear h te hniques for keyword queries (e.g., [11, 10, 6), all of these te hniques have been geared towards e ient, partial (not omprehensive) sear h { e.g., all musi -sharing
systems urrently only support partial sear h. Partial sear h is a eptable in
those appli ations where a few keywords an usually uniquely identify the
desired le (e.g., musi -sharing systems, as opposed to web page repositories), be ause the rst few mat hes are likely to satisfy the user's request.
Te hniques for partial sear h an always be made omprehensive simply by
sending the query message to every peer in the network; however, su h an
approa h is prohibitively expensive. Hen e, designing te hniques for e ient,
omprehensive sear h remains an open problem.
: If many results are returned for omprehensive keyword
sear h, users will need results to be ranked and ltered by relevan e. While
the query language for ranked keyword sear h remains the same, the additional information in the results (i.e., the relevan e ranking) poses additional
hallenges and opportunities. For example, ranked sear h an be built on top
of regular sear h by retrieving all results and sorting lo ally; however, stateof-the-art ranking fun tions usually require global statisti s over the total
{ Keyword
{ Ranked keyword
olle tion of do uments (e.g., do ument frequen y). Colle ting and maintaining these statisti s in a robust, e ient, and distributed manner is a
hallenge. At the same time, ranked results allow the system to return \top
k" results, whi h provides the opportunity to optimize sear h if k is mu h
less than the total number of results (whi h is generally the ase, for example, in web sear hes). Te hniques for ranked sear h exists for distributed
systems of moderate s ale (e.g., [7), but future resear h must extend these
te hniques to support mu h larger systems.
: A user may sometimes be interested in knowing aggregate
properties of the system or data olle tion as a whole, rather than lo ating spe i data. For example, to olle t global statisti s to support ranked
keyword sear h mentioned earlier, a user ould submit several SUM queries
to sum the number of do uments that ontain a parti ular term. Ongoing
resear h [8 addresses COUNT queries dened over a predi ate { for example,
ounting the number of nodes that belong to the stanford.edu domain.
Further resear h is needed to extend these te hniques into more expressive
aggregates like SUM, MAX, and MEDIAN.
: As a omplex language dened over a ri h data model, SQL is the most
di ult query language to support among the examples listed. Current resear h on supporting SQL in P2P systems is very preliminary. For example,
the PIER proje t [9 supports a subset of SQL over a P2P framework, but
they report signi ant performan e \hotspots" in their preliminary implementation. A great deal of additional resear h is needed to advan e urrent
work into a sear h me hanism with reasonable performan e, and to investigate alternative approa hes to the problem.
{ Aggregates
{ SQL
2.3 Autonomy, E ien y and Robustness
Autonomy, e ien y and robustness are all desirable features in any system.
These features on eptually dene an informal spa e of P2P systems, as shown in
Figure 1a, where a point in the spa e represents a system with the orresponding
\values" for ea h feature. Note that the value of a system with respe t to a
feature only provides a partial order, sin e features an be measured along several
metri s (e.g., e ien y an be measured by bandwidth, pro essing power, and
storage). Hen e, Figure 1 illustrates the spa e by showing just a few points for
whi h the relative order (and not the a tual oordinates) along ea h feature is
fairly obvious.
The spa e dened by autonomy, e ien y and robustness is not fully explored; in parti ular, there appears to be some orrelation between autonomy
and e ien y (Figure 1b ), and autonomy and robustness (Figure 1 ). A partial explanation for the rst orrelation is that less autonomy allows the sear h
me hanism to spe ify a data pla ement and topology su h that:
There exist a deterministi way to lo ate data within bounded ost (e.g.,
Chord)
There is a small set of nodes that is guaranteed to hold the answer, if it
exists (e.g., super-peer networks, on ept lusters [13)
{
{
autonomy
+
efficiency
robustness - -+
+
(a)
aut.
aut.
Gnutella
Gnutella
super-peer networks
super-peer
super-peer redundancy
Chord/Viceroy
(b)
eff.
Chord
(c)
Viceroy
rob.
Fig. 1. The spa e of systems dened by autonomy, e ien y and robustness (a ). Looking at a few example systems within this spa e, there appears to be a relationship
between autonomy and e ien y (b), and autonomy and robustness ( )
{ There is an in reased han e of nding results on a random node (e.g., repli ation [6).
At the same time, these rigidly organized networks an be di ult or expensive

to maintain, espe ially as peers join and leave the network at the rapid rate
hara teristi of many P2P systems. As a result, robustness is also orrelated
with autonomy.
One important area of resear h is nding te hniques that push beyond the
urrent tradeos between e ien y, autonomy and robustness. De oupling e ien y from autonomy seems to be the greatest hallenge, sin e existing te hniques almost uniformly sa ri e autonomy to a hieve e ien y. However, the
potential gain is the greatest: a sear h me hanism that is e ient, robust, and
preserves peer autonomy. De oupling autonomy from robustness is also important, be ause it allows greater exibility in hoosing the desired properties of
the me hanism. For example, a sear h me hanism that is robust, but has low
peer autonomy, an be desirable if the la k of autonomy leads to e ien y, and
peer autonomy is not a requirement of the system.
Several resear h proje ts have ta kled the autonomy/robustness tradeo. For
example, the Vi eroy [14 network onstru tion maintains a low level of peer
autonomy, but in reases robustness and e ien y by redu ing the ost of maintaining the network stru ture to a onstant term, for ea h join/leave of a peer.
In omparison, most distributed hash tables (DHTs) with the same fun tionality have logarithmi maintenan e ost. As another example, super-peer redundan y [12 imposes slightly stri ter rules on topology and data pla ement within
a luster of peers, but this de rease in autonomy results in greater robustness of
the super-peer and improved e ien y in the overall network.
Another interesting area of resear h is providing ne-granularity tuning of
the tradeo between autonomy and e ien y within a single system. A single
user may have varying needs; for example, a ompany may have a few sensitive
les that must remain on the intranet, but the remaining les an be stored
anywhere. A single system that an be tuned to support all of these needs is more
desirable than requiring users to use dierent systems for dierent purposes. A
good example of a tunable system is SkipNet [15. SkipNet allows users to spe ify
a range of peers on whi h a do ument may be stored (e.g., all peers within the
stanford.edu domain). At one extreme, if the range is always limited to a single
peer, then user autonomy is high, but the system eases to be P2P and loses good
properties su h as load-balan ing and self-organization. At the other extreme,
if the range always in ludes all peers in the network, SkipNet fun tions as a
traditional P2P lookup system with low autonomy, but other good properties.
While SkipNet does not push beyond existing tradeos, its value lies in allowing
users to hoose the point along the tradeo that meets their needs.
2.4 Quality of Servi e

In the previous dis ussion, we impli itly assume a xed level of servi e (e.g.,
number of results per query) that must be maintained as other fa tors (e.g.,
autonomy) are varied. However, quality of servi e (QoS) an be measured with
many dierent metri s, depending on the appli ation, and a spe trum of a eptable performan e exists along ea h metri . Examples of servi e metri s in lude
number of results (e.g., in partial-sear h systems), response time, and relevan e
(e.g., pre ision and re all in ranked keyword sear hes). A onstant hallenge in
designing P2P systems is a hieving a desired level of QoS as e iently as possible. Be ause metri s and appli ations dier so widely, this hallenge must often
be ta kled on a per- ase basis.
As an example, the number of results returned is an important QoS metri for partial-sear h systems like Gnutella. However, in systems where there
is high autonomy (su h as Gnutella), there is a lear and unavoidable tradeo
between number of results and ost; hen e, the interesting problem is to get as
lose as possible to the lower bounds of the tradeo. For example, the dire ted
BFS te hnique in [11 attempts to minimize ost by sending messages to \produ tive" nodes (e.g., nodes with large olle tions). Con ept- lustering networks
(e.g., [13) luster peers together a ording to \interest" (e.g., musi genre), and
send queries to the luster that best mat hes the queries' area of interest. These
te hniques do improve the tradeo between ost and number of results, but are
learly not optimal: performan e of dire ted BFS depends on the ad-ho topology and is therefore unpredi table, while on ept- lustering only works well if
queries and interests fall leanly into single ategories. Can there exist a general
te hnique that an guarantee (with high probability) that the ost/QoS tradeo
is optimal?
With other metri s of QoS, there is not su h an obvious tradeo between
quality and ost. In these ases, the goal is to maintain the same level of servi e
while de reasing ost. For example, onsider the \satisfa tion" metri , whi h is
binary and is true when a threshold number of results is found. Satisfa tion is an
important metri in partial-sear h systems where only the rst k results are displayed to the user (e.g., [16, 1). Referen e [11 shows that, ompared to urrent
te hniques, ost an be drasti ally redu ed while maintaining satisfa tion. Furthermore, even better performan e is probably possible if we dis ard this work's
requirement of peer autonomy and simpli ity. Additional resear h is required to
explore this spa e further.
Se urity
Se uring P2P data sharing appli ations is hallenging due to their open and
autonomous nature. Compared to a lient-server system in whi h servers an be
relied upon or trusted to always follow proto ols, peers in a P2P system may
provide no su h guarantee. The environment in whi h a peer must fun tion is a
hostile one in whi h any peer is wel ome to join the network; these peers annot
ne essarily be trusted to route queries or responses orre tly, store do uments
when asked to, or serve do uments when requested. In this part of the paper, we
outline a number of se urity issues that are hara teristi to P2P data sharing
systems, dis uss a few examples of resear h that has taken pla e to address some
of these issues, and suggest a number of open resear h problems.
We organize the se urity requirements of P2P data sharing systems into
four general areas: availability, le authenti ity, anonymity, and a ess ontrol.
Today's P2P systems rarely address all of the ne essary requirements in any
one of these areas, and developing systems that have the exibility to support
requirements in all of these areas is expe ted to be a resear h hallenge for quite
some time.
For ea h of these areas, it will be important to develop te hniques that prevent, dete t, manage, and are able to re over from atta ks. For example, sin e
it may be di ult to prevent a denial-of-servi e atta k against a system's availability, it will be important to develop te hniques that are able to 1) dete t
when a denial-of servi e atta k is taking pla e (as opposed to there just being a
high load), 2) manage an atta k that is \in-progress" su h that the system an
ontinue to provide some (possibly redu ed) level of servi e to lients, and 3)
re over from the atta k by dis onne ting the mali ious nodes.
3.1 Availability
There are a number of dierent node and resour e availability requirements that
are important to P2P le sharing systems. In parti ular, ea h node in a P2P
system should be able to a ept messages from other nodes, and ommuni ate
with them to oer a ess to the resour es that it ontributes to the network.
A denial-of-servi e (DoS) atta k attempts to make a node and its resour es
unavailable by overloading it. The most obvious DoS atta k is targeted at using up all of a node's bandwidth. This type of atta k is similar to traditional
network-layer DoS atta ks (e.g. [31). If a node's available bandwidth is used up
transferring useless messages that are dire tly or indire tly reated by a mali ious node, all of the other resour es that the node has to oer (in luding CPU
and storage) will also be unavailable to the P2P network.
A spe i example of a DoS atta k against node availability is a hosenvi tim atta k in Gnutella that an adversary onstru ts as follows: a mali ious
super-node maneuvers its way into a \ entral" position in the network and then
responds to every query that passes thru it laiming that the vi tim node has a
le that satises the query (even though it does not). Every node that re eives
one of these responses then attempts to onne t to the vi tim to obtain the
le that they were looking for, and the number of these requests overloads the
bandwidth of the vi tim su h that any other node seeking a le that the vi tim
does have is unable to ommuni ate the vi tim.
The key aspe t to note here is that in our example the atta ker exploited a
vulnerability of the Gnutella P2P proto ol (namely, that any node an respond
to any query laiming that any le ould be anywhere). In the future, P2P
proto ols need to be designed to make it hard for adversaries to onstru t DoS
atta ks by taking advantage of loosely onstrained proto ol features.
Atta kers that onstru t DoS atta ks typi ally need to nd and take advantage of an \ampli ation me hanism" in the network to ause signi antly more
damage than they ould with only their own resour es. In addition, if they would
like to have ontrol over how their atta k is arried out, they must also nd or
reate a ba k-door ommuni ation hannel to ommuni ate with \zombie" hosts
that they inltrate using manual or automati means. It is important to design future P2P proto ols su h that they do not open up new opportunities for
atta kers to use as ampliers and ba k-door ommuni ation hannels.
Some resear h has taken pla e to date to spe i ally address DoS atta ks in
P2P networks. In parti ular, [38 addresses DoS atta ks based on query- oods
in the Gnutella network. However, more resear h is ne essary to understand the
ee ts of other types of DoS atta ks in various P2P networks.
Aside from DoS atta ks, node availability an also be atta ked by mali ious
users that inltrate vi tim nodes and indu e their failure. These types of atta ks
an be modeled as fail-stop or byzantine failures, whi h ould potentially be dealt
with using many te hniques that have already been developed (e.g. [34). However, these te hniques have typi ally not been popular due to their ine ien y,
unusually high message overhead, and omplexity. In addition, these te hniques
often assume omplete and se ure pairwise onne tivity between nodes, whi h
is not the ase in most P2P networks. Further resear h will be ne essary to
make these or similar te hniques a eptable from a performan e and se urity
standpoint in a P2P ontext.
In addition, there are many proposals to provide signi ant levels of faulttoleran e in the fa e of node failure in luding CAN [3, Chord [2, Pastry [4,
and Vi eroy [14. Se urity analyses of these types of proposals an be found in
[43 and [36. The IRIS [25 proje t seeks to ontinue the investigation of these
types of approa hes.
A mali ious node an also dire tly atta k the availability of any of the parti ular resour es at a node. The CPU availability at a node an be atta ked by
sending a modest number of omplex queries to bog down the CPU of a node
without onsuming all of its bandwidth. The available storage ould be atta ked
by mali ious nodes who are allowed to submit bogus do uments for storage. One
approa h to deal with this is to allo ate storage to nodes in a manner proportional to the resour es that a node ontributes to the network as proposed in
[28.
We might like to ensure that all les stored in the system are always available
regardless of whi h nodes in the network are urrently online. File availability
ensures that les an be perpetually preserved, regardless of fa tors su h as

the popularity of the les. Systems su h as Gnutella and Freenet provide no
guarantees about the preservation of les, and unpopular les tend to disappear.
Even if les an be assured to physi ally exist and are a essible, a DoS atta k
an still be made against the quality-of-servi e with whi h they are available.
In this type of a DoS atta k, a mali ious node makes a le available, but when
a request to download the le is re eived, it serves the le so slowly that the
requester will most likely lose patien e and an el the download before it ompletes. The mali ious node ould also laim that it is serving the le requested
but send some other le instead. As su h, te hniques su h as hash trees [26
ould to be used by the lient to in rementally ensure that the server is sending
the orre t data, and that data is sent at a reasonable rate.
3.2 File Authenti ity

File authenti ity is a se ond key se urity requirement that remains largely unaddressed in P2P systems. The question that a le authenti ity me hanism answers
is: given a query and a set of do uments that are responses to the query, whi h
of the responses are \authenti " responses to the query? For example, if a peer
issues a sear h for \Origin of Spe ies" and re eives three responses to the query,
whi h of these responses are \authenti "? One of the responses may be the exa t ontents of the book authored by Charles Darwin. Another response may be
the ontent of the book by Charles Darwin with several key passages altered. A
third response might be a dierent do ument that advo ates reationism as the
theory by whi h spe ies originated.
Note that the problem of le authenti ity is dierent than the problem of le
(or data) integrity. The goal of le integrity is to ensure that do uments do not
get inadvertently orrupted due to ommuni ation failures. Solutions to the le
integrity problem usually involve adding some type of redundan y to messages in
the form of a \signature." After a le is sent from node A to node B, a signature
of the le is also sent. There are many fewer bits in the signature than in the le
itself, and every bit of the signature is dependent on every bit of the le. If the le
arrived at node B orrupted, the signature would not mat h. Te hniques su h
as CRCs ( y li redundan y he ks), hashing, MACs (message authenti ation
odes), or digital signatures (using symmetri or asymmetri en ryption) are
well-understood solutions to the le integrity problem.
The problem of le authenti ity, however, an be viewed as: given a query,
what is (or are) the \authenti " signature(s) for the do ument(s) that satisfy
the query? On e some le authenti ity algorithm is used to determine what is
(or are) the authenti signatures, a peer an inspe t responses to the query by
he king that ea h response has an authenti signature.
In our dis ussion until this point, we have not dened what it means for a
le to be authenti . There are a number of potential options: we will outline four
reasonable ones.
Oldest Do ument. The rst denition of authenti ity onsiders the oldest do ument that was submitted with a parti ular set of metadata to be the authenti
opy of that do ument. For example, if Charles Darwin was the rst author to
ever submit a do ument with the title \Origin of Spe ies," then his do ument
would be onsidered to be an authenti mat h for a query looking for \Origin of
Spe ies" as the title. Any do uments that were submitted with the title \Origin
of Spe ies" after Charles Darwin's submission would be onsidered unauthenti
mat hes to the query even if we de ided to store these do uments in the system.
Timestamping systems (e.g. [35) an be helpful in onstru ting le authenti ity
systems based on this approa h.
Expert-Based. In this approa h, a do ument would be deemed authenti by
an \expert" or authoritative node. For example, node G may be an expert that
keeps tra k of signatures for all les ever authored by any user of G. If a user
sear hing for do uments authored by any of G's users is ever on erned about
the potential authenti ity of a le re eived as a response to a query, node G
an be onsulted. Of ourse, if node G is unavailable at any parti ular time
due to a transient or permanent failure, is inltrated by an atta ker, or is itself
mali ious, it may be di ult to properly verify the authenti ity of les that G's
users authored. Oine digital signature s hemes (i.e., RSA) an be used to verify
le authenti ity in the fa e of node failures, but are limited by the lifetime and
se urity of publi /private keys.
Voting-Based. To deal with the possible failure of G or a ompromised key
in our last approa h, our third denition of authenti ity takes into a ount the
\votes" of many experts. The expert nodes may be nodes that are run by human experts qualied to study and assess the authenti ity of parti ular types of
les, and the majority opinion of the human experts an be used to assess the
authenti ity of a le. Alternatively, the expert nodes may simply be \regular"
nodes that store les, and will vote that a parti ular le is authenti if they store
a opy of it. In this s heme, users are expe ted to delete les that they do not
believe are authenti , and a le's authenti ity is determined by the number of
opies of the le that are distributed throughout the system. The key te hni al
issues with this approa h are how to prevent spoong of votes, of nodes, and of
les.
Reputation-Based. Some experts might be more trustworthy than others (as
determined by past performan e), and we might weight the votes of more trustworthy experts more heavily. The weights in this approa h are a simple example of \reputations" that may be maintained by a reputation system. A reputation system is responsible for maintaining, updating, and propagating su h
weights and other reputation information [41. Reputation systems may or may
not hoose to use voting in making their assessments. There has been some study
of reputation systems in the ontext of P2P networks, but no su h system has
been ommer ially su essful (e.g. [33, 24).
3.3 Anonymity
There is mu h work that has taken pla e on anonymity in the ontext of the
Internet both at the network-layer (e.g. [30) as well as at the appli ation-layer
Type of Anonymity
Author
Server
Reader
Do ument
Di ult for Adversary to Determine:
Whi h users reated whi h do uments?

Whi h nodes store a given do ument?
Whi h users a ess whi h do uments?
Whi h do uments are stored at a given node?
Table 1. Types of Anonymity
(e.g. [40). In this se tion we spe i ally fo us on appli ation-layer anonymity

in P2P data sharing systems.
While some would suggest that many users are interested in anonymity be ause it allows them to illegally trade opyrighted data les in an untra eable
fashion, there are many legitimate reasons for supporting anonymity in a P2P
system. Anonymity an enable ensorship resistan e, freedom of spee h without
the fear of perse ution, and priva y prote tion. Mali ious parties an be prevented from deterring the reation, publi ation, and distribution of do uments.
For example, su h a system may allow an Iraqi nu lear s ientist to publish a
do ument about the true state of Iraq's nu lear weapons program to the world
without the fear that Saddam Hussein's regime ould tra e the do ument ba k to
him or her. Users that a ess do uments ould also have their priva y prote ted
in su h a system. An FBI agent ould a ess a ompany's publi information
resour es (i.e., web pages, databases, et .) anonymously so as not to arouse suspi ion that the ompany may be under investigation.
There are a number of dierent types of anonymity that an be provided in a
P2P system. It is di ult for the adversary to determine the answers to dierent
questions for dierent types of anonymity. Table 1 summarizes a few types of
anonymity dis ussed in [39.
We would ideally like to provide anonymity while maintaining other desirable sear h and se urity features su h as e ien y, de entralization, and peer
dis overy. Unfortunately, providing various types of anonymity often on i ts
with these design goals for a P2P system.
To illustrate one of these on i ting goals, onsider the natural trade-o between server anonymity and e ient sear h. If we are to provide server anonymity,
it should be impossible to determine whi h nodes are responsible for storing a
do ument. On the other hand, if we would like to be able to e iently sear h for
a do ument, we should be able to tell exa tly whi h nodes are responsible for
storing a do ument. A P2P system su h as Free Haven that strives to provide
server anonymity resorts to broad ast sear h, while others su h as Freenet [27
provide for e ient sear h but do not provide for server anonymity. Freenet does,
however, provide author anonymity. Nevertheless, supporting server anonymity
and e ient sear h on urrently remains an open issue.
There exists a middle-ground: we might be able to provide some level of server
anonymity by assigning pseudonyms to ea h server, albeit at the ost of sear h
e ien y. If an adversary is able to determine the pseudonym for the server of
a ontroversial do ument, the adversary is still unable to map the pseudonym

to the publisher's true identity or lo ation. The do ument an be a essed in
su h a way as to preserve the server's anonymity by requiring that a reader (a
potential adversary) never dire tly ommuni ate with a server. Instead, readers
only ommuni ate with a server through a hain of intermediate proxy nodes that
forward requests from the reader to the server. The reader presents the server's
pseudonym to a proxy to request ommuni ation with the server (thereby hiding
a server's true identity), and never obtains a onne tion to the a tual server for
a do ument (thereby hiding the server's lo ation). Reader anonymity an also
be provided using a hain of intermediate proxies, as the server does not know
who the a tual requester of a do ument is, and ea h proxy does not know if the
previous node in the hain is the a tual reader or is just another proxy. Of ourse,
in both these ases, the anonymity is provided based on the assumption that
proxies do not ollude. The degradation of anonymity proto ols under atta ks
has been studied in [44, and this study suggests that further work is ne essary
in this area.
Free Haven and Crowds [40 are examples of systems that use forwarding
proxies to provide various types of anonymity with varying strength. Ea h of
these systems dier in how the level of anonymity degrades as more and more potentially olluding mali ious nodes take on the responsibilities of proxies. Other
te hniques that are ommonly found in systems that provide anonymity in lude
mix networks (e.g. [32), and using ryptographi se ret-sharing te hniques to
split les into many shares (e.g. [42).
3.4 A ess Control

Intelle tual property management and digital rights management issues an be
ast as a ess ontrol problems. We want to restri t the a essibility of do uments to only those users that have paid for that a ess. P2P systems urrently
annot be trusted to su essfully enfor e opyright laws or arry out any form of
su h digital rights management, espe ially sin e few assumptions an be made
about key management infrastru ture. This has led to blatant violation of opyright laws by users of P2P systems, and has also led to lawsuits against ompanies
that build P2P systems.
The trade-os involved in enfor ing a ess ontrol in a P2P data sharing
system are hallenging be ause if a system imposes restri tions over what types of
data it shares (i.e., only opy-prote ted ontent), then its utility will be limited.
On the other hand, if it imposes no su h restri tions, then it an be used as a
platform to freely distribute any ontent to anyone that wants it [37.
Further eort must go into exploring whether or not it is reasonable to have
the P2P network enfor e a ess ontrol, or if the enfor ements should take pla e
at the endpoints of the network. In either ase, only users that own (or have
paid for) the right to download and a ess ertain les should be able to do so
to legally support data sharing appli ations.
If the benets of P2P systems are to be realized, we need to explore the feasibility of and the te hni al approa hes to instrumenting them with appropriate
me hanisms to allow for the management of intelle tual property.
Con lusion
Many of the open problems in P2P data sharing systems surround sear h and
se urity issues. The key resear h problem in providing a sear h me hanism is
how to provide for maximum expressiveness, omprhensiveness, and autonomy
with the best possible e ien y, quality-of-servi e, and robustness. The key to
se uring a P2P network lies in designing me hanisms that ensure availabiity, le
authenti ity, anonymity, and a ess ontrol. In this paper, we have illustrated
some of the trade-os at the heart of sear h and se urity problems in P2P data
sharing systems, and outlined several major areas of importan e for future work.
Referen es
1. Gnutella website. http://www.gnutella. om
2. Stoi a, I., Morris, R., Karger, D., Kaashoek, M. F., Balakrishnan, H.: Chord: A s alable peer-to-peer lookup servi e for internet appli ations. Pro . ACM SIGCOMM
(2001)
3. Ratnasamy, S., Fran is, P., Handley, M., Karp, R., Shenker, S.: A s alable ontent
addressable network. Pro . ACM SIGCOMM (2001)
4. Rowstron, A., Drus hel, P.: Pastry: S alable, distributed obje t lo ation and routing
for large-s ale peer-to-peer systems. Pro . of the 18th IFIP/ACM Intl. Conf. on
Distributed Systems Platforms (2001)
5. Ratnasamy, S., Shenker, S., Stoi a, I.: Routing Algorithms for DHTs: Some Open
Questions. Pro . IPTPS (2002)
6. Lv, Q., Cao, P., Cohen, E., Li, K., Shenker, S.: Sear h and repli ation in unstru tured peer-to-peer networks. Pro . of Intl. Conf. on Super omputing (2002)
7. Cuen a-A una, F. M., Peery, C., Martin, R. P., Nguyen, T. D.: PlanetP: using gossiping to build ontent addressable peer-to-peer information sharing ommunities.
Te hni al Report DCS-TR-487, Dept. of Computer S ien e, Rutgers Univ. (2002)
8. Bawa, M., Gar ia-Molina, H., Gionis, A., Motwani, R.: Estimating the size of a
peer-to-peer network (2002)
9. Harren, M., Hellerstein, M., Huebs h, R., Loo, B., Shenker, S., Stoi a, I.: Complex
Queries in DHT-based Peer-to-Peer Networks. Pro . IPTPS (2002)
10. Crespo, A., Gar ia-Molina, H.: Routing indi ies for peer-to-peer systems. Pro .
28th Intl. Conf. on Distributed Computing Systems (2002)
11. Yang, B., Gar ia-Molina, H.: Improving sear h in peer-to-peer systems. Pro . 28th
Intl. Conf. on Distributed Computing Systems (2002)
12. Yang, B., Gar ia-Molina, H.: Designing a super-peer network. Pro . ICDE (2003)
13. S hlosser, M., Sintek, M., De ker, S., Nejdl, W.: A s alable and ontology-based
P2P infrastru ture for semanti web servi es (2002)
14. Malkhi, D., Nao, M., Rataj zak, D.: Vi eroy: a s alable and dynami emulation of
the butter y. Pro . PODC (2002)
15. Harvey, N., Jones, M., Saroiu, S., Theimer, M., Wolman, A.: SkipNet: a s alable
overlay network with pra ti al lo ality properties (2002)
16.
17.
18.
19.
20.
21.
22.
Napster website. http://www.napster. om

Morpheus website. http://www.musi ity. om
W3C website on Web Servi es. http://www.w3.org/2002/ws
SetiHome website. http://setiathome.ssl.berkely.edu
DataSynapse website. http://www.datasynapse. om
Groove Networks website. http://www.groove.net
Stoi a, I., Adkins, D., Zhuang, S., Shenker, S., Surana, S.: Internet Indire tion
Infrastru ture. Pro . SIGCOMM (2002)
23. Stanford Peers group website. http://www-db.stanford.edu/peers
24. Reputation resear h network home page. http://databases.si.umi h.edu/reputations/
25. Iris: Infrastru ture for resilient internet systems. http://iris.l s.mit.edu/
26. Personal ommuni ation with Dan Boneh.
27. Clarke, I., Sandberg, O., Wiley, B., Hong, T.W.: Freenet: A distributed anonymous
information storage and retrieval system. Workshop on Design Issues in Anonymity
and Unobservability, pages 46{66 (2000)
28. Cooper, B., Gar ia-Molina., H.: Peer to peer data trading to preserve information.
ACM Transa tions on Information Systems (2002)
29. Dingledine, R., Freedman, M.J., Molnar, D.: The free haven proje t: Distributed
anonymous storage servi e. Workshop on Design Issues in Anonymity and Unobservability, pages 67{95 (2000)
30. Freedman, M.J., Morris, R.: Tarzan: A peer-to-peer anonymizing network layer.
Pro . 9th ACM Conferen e on Computer and Communi ations Se urity, Washington, D.C. (2002)
31. Garber, L.: Denial-of-servi e atta ks rip the internet. Computer, pages 12-17 (April
2000)
32. Hill, R., Hwang, A., Molnar, D.: Approa hes to mixnets.
33. Lethin, R.: Chapter 17: Reputation. Peer-to-Peer: Harnessing the Power of Disruptive Te hnologies ed. Andy Oram, O'Reilly and Asso iates (2001)
34. Lyn h, N.A.: Distributed algorithms. Morgan Kaufmann (1996)
35. Maniatis, P., Baker, M.: Se ure History Preservation Through Timeline Entanglement. Pro . 11th USENIX Se urity Symposium, SF, CA, USA (2002)
36. Ganesh, A., Rowstron, A., Castro, M., Drus hel, P., Walla h, D.: Se urity for
stru tured peer-to-peer overlay networks. Pro . 5th OSDI, Boston, MA (2002)
37. Peinado, M., Biddle, P., England, P., Willman, B.: The darknet and the future of
ontent distribution. http:// rypto.stanford.edu/DRM2002/darknet5.do .
38. Daswani, N., Gar ia-Molina, H.: Query- ood DoS Atta ks in Gnutella. Pro . Ninth
ACM Conferen e on Computer and Communi ations Se urity, Washington, DC
(2002)
39. Molnar, D., Dingledine, R., Freedman, M.: Chapter 12: Free haven. Peer-to-Peer:
Harnessing the Power of Disruptive Te hnologies ed. Andy Oram, O'Reilly and
Asso iates (2001)
40. Reiter, M.K., Rubin, A.D.: Crowds: anonymity for Web transa tions. ACM Transa tions on Information and System Se urity, 1(1):66{92 (1998)
41. Resni k, P., Ze khauser, R., Friedman, E., Kuwabara, K.: Reputation systems.
Communi ations of the ACM, pages 45-48 (2000)
42. Shamir, A.: How to share a se ret. Communi ations of the ACM, 22:612{613 (1979)
43. Sit, E., Morris, R.: Se urity onsiderations for peer-to-peer distributed hash tables.
IPTPS '02, http://www. s.ri e.edu/Conferen es/IPTPS02/173.pdf (2002)
44. Wright, M., Adler, M., Levine, B., Shields, C.: An analysis of the degradation of
anonymous proto ols. Te hni al Report, Univ. of Massa husetts, Amherst (2001)

Data Sharing

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Sharing

Uploaded by

Copyright:

Available Formats

Open Problems in Data-Sharing Peer-to-Peer

Neil Daswani, He tor Gar ia-Molina, and Beverly Yang

fdaswani, he tor, byanggdb.stanford.edu,

In a Peer-To-Peer (P2P) system, autonomous omputers pool

A good sear h me hanism allows users to ee tively lo ate desired data in a

{ Data pla ement

{ Expressiveness: The query language used for a system must be able to

{ E ien y: We measure e ien y in terms of absolute resour es onsumed

{ bandwidth, pro essing power, storage, et . An e ient use of resour es

2.3 Autonomy, E ien y and Robustness

At the same time, these rigidly organized networks an be di ult or expensive

2.4 Quality of Servi e

ensures that les an be perpetually preserved, regardless of fa tors su h as

3.2 File Authenti ity

Di ult for Adversary to Determine:

Whi h users reated whi h do uments?

(e.g. [40). In this se tion we spe i ally fo us on appli ation-layer anonymity

a ontroversial do ument, the adversary is still unable to map the pseudonym

3.4 A ess Control

Napster website. http://www.napster. om

You might also like

Data Sharing

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Sharing

Uploaded by

Copyright:

Available Formats

Open Problems in Data-Sharing Peer-to-Peer

Neil Daswani, He tor Gar ia-Molina, and Beverly Yang

fdaswani, he tor, byanggdb.stanford.edu,

In a Peer-To-Peer (P2P) system, autonomous omputers pool

A good sear h me hanism allows users to e e tively lo ate desired data in a

{ Data pla ement

{ Expressiveness: The query language used for a system must be able to

{ E ien y: We measure e ien y in terms of absolute resour es onsumed

{ bandwidth, pro essing power, storage, et . An e ient use of resour es

2.3 Autonomy, E ien y and Robustness

At the same time, these rigidly organized networks an be di ult or expensive

2.4 Quality of Servi e

ensures that les an be perpetually preserved, regardless of fa tors su h as

3.2 File Authenti ity

Di ult for Adversary to Determine:

Whi h users reated whi h do uments?

(e.g. [40). In this se tion we spe i ally fo us on appli ation-layer anonymity

a ontroversial do ument, the adversary is still unable to map the pseudonym

3.4 A ess Control

Napster website. http://www.napster. om

You might also like

fdaswani, he tor, byanggdb.stanford.edu,

A good sear h me hanism allows users to ee tively lo ate desired data in a