You are on page 1of 12

Building a Hierarchical Content Distribution Network with

Unreliable Nodes
Jared Friedman and Erik Garrison
Harvard University

Abstract 1.1.1 Mirroring

Running a popular website is expensive, due to hardware and bandwidth Volunteers often run mirror sites to help overloaded primary
costs. For some websites, users might want to donate their idle comput- sites host data. Unfortunately, mirror sites generally need to be
ing resources to help host the website, reducing the site’s hosting costs. dedicated server-class machines with static IP addresses, which
Unfortunately, there is currently no easy way for a website to spread greatly restricts the pool of volunteers who can contribute to
its hosting burden to ordinary PCs, which are geographically dispersed, website hosting in this way. Given that typical access to web ob-
unreliable, possibly malicious, and may have low system resources and jects follows a zipf-like distribution [6], blindly mirroring con-
be behind NAT. We present a design and simulation for Jellyfish, a dis- tent without respect to access frequency can be quite wasteful
tributed web caching system which is intended to run across ordinary relative to caching schemes which selectively cache the most
unreliable PCs while maintaining very high performance. We propose popular content [5, 13].
solutions for the problems of security, diverse performance character- Such caching systems require a degree of coordination
istics, NAT, and hotspots. We also take a detailed look at the object greater than that typically associated with mirroring arrange-
placement problem, propose a new algorithm for object placement, and ments; at the same time they effectively lower the barrier to entry
show through simulation that it has better performance than the com- in a cooperative caching system. Even a donated system capa-
monly used CARP algorithm. ble of caching a handful of the most popular objects can aid the
cause of the cooperative system it is allied to. We believe that
problems associated with security and reliability have prevented
1 Introduction such systems from developing.
Running a popular website is an expensive proposition. 1.1.2 Commercial CDNs
Wikipedia, globally the 17th most popular website,1 requires
200 dedicated servers to keep the site online. The hardware Commercial Content Distribution Networks like Akamai and
and bandwidth costs associated with a high profile site like Digital Island have shown that it is possible to distribute the
Wikipedia run into the hundreds of thousands of dollars. For hosting of a high profile website around the world, with widely
websites run by small companies, non-for-profit organizations, dispersed servers each caching small pieces of the website, and
or individuals, these costs may prove quite burdensome, and that doing so can greatly improve performance and absorb flash
it is only by running constant fund-raising drives that they crowds. These services are defined by their closed architectures
can maintain an acceptable quality of service to the commu- and high cost (some estimates have placed costs-per-byte as high
nity which they serve. Computer resource sharing projects like as 1.75 times that of non-CDN bandwidth [20]).
SETI@home [4], Folding@home [2], BOINC [1], and Google
1.1.3 Cooperative caching
Compute [3] have demonstrated that many people are willing to
donate their idle computer resources to projects with common A wide variety of cooperative caching systems have been pro-
benefit. posed; some are in use today. Two principle classes of coop-
In light of the success of these projects, it is probable that erative storage systems exist: those designed primarily for data
many people would be willing to donate their spare comput- storage, such as Freenet [10] (a distributed storage system aimed
ing resources to help keep popular non-for-profit websites on- at anonymity) and Riptide/Oceanstore [11] (a similar system op-
line. However, currently no suitable content distribution network timized for reliable and highly-available data storage), and those
(CDN) exists to enable such activity. designed for temporarily caching data, such as CoralCDN [12],
a peer-to-peer system designed to distributed the costs of hosting
a website across a network of peer-based caches..
1.1 Supporting websites cooperatively
There are a variety of methods by which the hosting of a website 1.1.4 CoralCDN
can be shared amongst the resources of a number of individuals: CoralCDN is a relatively new project that comes closest to the
goals we have in mind. Coral is a fully distributed peer to peer
1 Alexa ranking, as of May 2006 content distribution network. Coral is intended to allow users
with almost no hosting capacity to host a very high profile web- that the test mentioned in the paper requests a single file,
site. One of the benefits of Coral is that it is very easy to use. whereas a typical website consists of many, and thus re-
Because Coral uses a DNS-based system for handling queries quires the initiation of a number of separate requests. We
directed at the CDN, one only needs to append feel that the overlay structure of Coral, which can cause re-
to the domain of any URL and the result will be delivered from quests to travel through many volunteer nodes before ul-
the Coral network. This makes Coral ideal for handling the so- timately terminating in an origin server, will likely lead
called “Slashdot effect”, in which very small web hosting sys- to higher latencies than Jellyfish’s hierarchical approach,
tems can be completely overloaded when content they distribute which bounds the maximum number of hops for a cache hit
is referenced by popular websites. Coral is currently hosted on at 2.
Planetlab and available for use, but not open to donated com-
puter resources due to security concerns. • Optimization: As a distributed caching system grows, the
While Coral stands as a major contribution to volunteer-based content creators will likely want some guarantee that it is
CDN technology, we believe that due to its current design it making efficient use of the possible hosting capacity of its
may be inapplicable to the problem space in which we are in- volunteered machines. As we discuss later in the paper,
terested. Generally, Coral is ideal for websites with virtually the goal of efficient use of resources leads to mathemati-
no hosting capacity that could be quickly overloaded through a cally difficult object placement and request routing prob-
Slashdot effect. Coral is less suited to websites like Wikipedia lems. There is also the issue of the communications over-
which already have substantial hosting capacity and have very head of keep-alives and other maintenance messages in-
high expectations for their users experience. For such websites, volved in implementing a particular solution, which further
we believe that Coral has the following problems: complicates analysis. We have found it difficult to com-
pare the optimality of Coral with that of Jellyfish, but given
• Security: Coral currently has no security system. We our work with the request routing problem, we suspect that
are most concerned with an internal attack on the system, it may be easier to build a smart heuristic algorithm when
whereby volunteer nodes replace requested content with knowledge of nodes is more centralized.
content of their own choosing. The authors of Coral rec-
ognize this attack as a problem, but do not solve it. If Coral 1.2 Jellyfish
is to remain completely distributed in its operation, doing In this paper, we present a design for Jellyfish3 , a cooperative
so might require the implementation of a complex “web of object caching system for the web. We implement Jellyfish in
trust” on top of the existing system. simulation and discuss the results of our simulated tests. While
we do not present a working prototype of Jellyfish, we have
• Openness: That any website can use Coral to dramatically
taken steps towards an implementation, and additionally propose
reduce its hosting costs is both a curse and a blessing. On
a plan for adapting the Squid caching engine as the base of Jel-
the plus side, it means that users can “register” a website
lyfish. By way of an analysis of literature and a simulation of a
for Coral without that website even knowing of Coral’s ex-
set of caching protocols in a similar (but not identical) problem
istence by surfing to the “Coralized” URL corresponding
space, we relate some heuristic methods which might be used
to the site.2 On the other hand, this openness allows the
in a deployed system. We present some methods for securing
network to be used for whatever purpose its decentralized
a distributed caching system and a different and possibly more
userbase desires.
efficient overall system design.
In short, the entirely decentralized nature of Coral provides
no guarantees to users that their donated resources will be 2 Related work
spent in the manner which they support. Given that the con-
tent of a mature CoralCDN will match that of the Internet, Generally speaking, Jellyfish aims to be a peer-based, decen-
donated resources are just as likely to aid the publication of tralized caching system for web content. This problem space
low-budget, commercial pornography sites as they are free, is relatively unique in that virtually all previous research in dis-
publicly editable encyclopedias. This possibility limits the tributed caching of which the authors are aware assumed that the
draw which the application has to users, in turn throttling nodes in a given caching system were reliable in terms of their
the effectiveness of the entire network. performance, trusted to not carry out malicious behavior, and
(generally) homogeneous in their performance characteristics.
• Performance: In their description of Coral, Freedman et. Research in peer-to-peer systems has tended to relax these
al. describe median client latencies of 0.19 seconds, osten- assumptions most readily. However, related work in peer-to-
sibly a value low enough to make Coral quite usable for peer systems often imposed arbitrarily high restrictions on the
web content delivery. In our experience, however, the ac- elements of systems which could remain centralized, thus pro-
tual latencies for Coralized webpages were much higher— ducing systems which are most applicable to situations in which
usually at least several seconds. Many factors could cause systems must truly be distributed, but less applicable to cases in
this disparity, such as overloading of the Coral network which some degree of relative centralization exists and can be
(likely, since it is only running on Planetlab), or the fact used to improve system performance.
2 This is happening already as some links on Slashdot are now posted in Cor- 3 So-namedin keeping with the naming tradition of caching systems: Coral,
alized form. CARP, Squid, and so forth.
In most cases this decentralization implies the existence of service. Additionally, we believe that the additional constraints
a shared routing or object placement protocol that provides the of unreliable and highly heterogeneous nodes imply the need for
system with important characteristics like scalability and robust- an adaptive solution to the problem.
ness. These structures, (most notably distributed hash tables, or Of these adaptive solutions to the object placement problem,
DHTs) provide deterministic, non-hierarchical routing for object the most commonly used is the so-called greedy, or “least re-
lookup and retrieval. For DHTs like Tapestry [14] and Chord cently used” (LRU) caching algorithm. The algorithm is con-
[22], this routing is in O(n log n), where n is the population of ceptually simple and computationally efficient, and has been em-
nodes in the overlay. Such structures thus excel in the case of a ployed by a wide variety of content distribution systems in ISPs,
system which cannot have an explicit central authority, or where commercial CDNs, and caching systems at the site of content
that authority has extremely limited resources, and have thus creation and at the site of consumption (e.g. in web browsers).
been deployed in support of applications like Bittorrent (where In the LRU algorithm, the cache maintains a queue of all objects
they help to spread responsibility for the legal ramifications of it currently has in store. On receipt of a request, it returns the ob-
copyright-infringing file transfer). In such cases, the RTT en- ject, stores it (if it has not already) and places a reference to the
gendered by the overlay protocol matters little relative to the re- object on the top of its queue. When the cache is full, it simply
liability and decentralized qualities of the system. replaces the least-recently-used objects (i.e. those on the bottom
By contrast, the needs of a system used for the distribution of the queue) with the new objects.
of web content require stronger guarantees of responsiveness Other adaptive solutions seek to optimize the method by
than such overlay networks have yet demonstrated. Studies have which content is pushed to distributed caches. While unco-
shown that users treat the web as a kind of interactive computing ordinated pull-based caching is an effective method to reduce
environment, and tolerate wait times of little more than 8 sec- bandwidth consumption, uncoordinated local caches provide an
onds before giving up their attempt to access content [18]. Some unnecessarily high degree of redundancy in a networked envi-
of the problems associated with the maintenance of DHT-based ronment, and can be supplemented with proxy caches placed at
peer-to-peer systems, such as overlay partitions resulting from the junctions between stub-networks and the transit links which
periodic, distributed routing anomalies [19], can cause unantic- connect them. These caches can store the most popular objects
ipated and undesirable system behavior which could cripple in- accessed by the users of the stub, thus saving all users storage.
teractive systems. These systems have been proposed since the early history of the
Although they may be capable of adjusting to such issues, web [7, 21], and have in practice been used by ISPs to reduce
purely distributed overlays which exhibit no centralization lack bandwidth costs and improve perceived latencies.
methods to guarantee the real-time robustness required to main- In an early paper on the topic, Gwertzman and Seltzer noted
tain an acceptable quality of service. In contrast, a hierarchical that push-caching could be more efficient than proxy-based,
overlay structure provides routing in the order of a constant pro- request-driven caching in distributed cache hierarchy when the
portional to the depth of the hierarchical overlay. Additionally, full network topology was known [13]. Otherwise, request-
when a system is geared for deployment by a single institution or driven caching provided a far greater reduction network usage
entity (or alternatively a “friend-to-friend” based system estab- (around 65%), and a hybrid method only improved performance
lished among such entities), it is sensible that the system utilize by an additional 2%. More recent work has demonstrated that
the centralized properties of the human networks which it serves adaptive clustering solutions can be found which achieve greater
to the benefit of its users. gains relative to un-cached resource consumption [9].
An overview of prior work in hierarchical caching systems The problems of organizational coordination associated with
will clarify the design principles we utilize in our system. establishing a global push-caching scheme would probably
Of principal concern are the trade-offs between various object negate the slight gain in efficiency which it would provide over
placement and retrieval algorithms, and a variety of security sys- the highly optimal pull-based caching systems currently de-
tems which might be used in a network of untrusted peers. ployed. However, in a relatively closed content distribution
environment, where changes to content can be tracked and in-
2.1 Distributed object placement algorithms formation about participating caches is available, a hybrid sys-
In the most general sense, web caching is a subset of a class of tem could easily be implemented to reduce stress on central re-
computationally difficult problems known as object placement sources. Thus we have considered pushing in the case of Jelly-
problems. Generally, an answer to an object placement prob- fish.
lem describes a method of distributing resources in an efficient
manner across a graph topology. In some limited cases in which 2.2 Cache coordination systems and routing protocols
placement is constrained by a small number of factors, the object Caches which are tightly coupled in a clustered environment can
placement problem is soluble [23]. However, we are unaware of divide their labor in a variety of ways. One of the oldest systems
any wide-area use of precise object placement algorithms.We for coordinating caches is the Internet Cache Protocol (ICP). On
suspect that network operators are unlikely to employ these al- receipt of a request which it cannot fulfill (a cache miss), a cache
gorithms given the high cost of updating placement schemas af- in an ICP-based caching cluster iteratively sends messages to
ter alterations to the network are made. While benefits can be every other cache in the group. If one of these caches can fulfill
derived from a perfectly-tuned object placement algorithm, in the request, it forwards the object to the first cache (a near miss).
practical situations a distributed, an adaptive solution can be en- If not, the sibling cache forwards the request to its destination.
gineered to distributed data with little detriment to quality of Because the number of messages required for the coordination
of an ICP-based cache cluster grows in O(n2 ), the system is not like to help from a list of centrally approved choices. When the
recommended for deployment in clusters larger than four or five, volunteers computer is idle, it will connect to the main Jellyfish
and thus serves merely as a counterexample to our approach. servers and register itself as an available node. Requests will
Other intercommunication protocols are implicit in the object immediately begin to be directed to the volunteers computer.
placement system employed by the cache cluster. One widely Users: Users, or browsers of a website, should see no appar-
used system is the Cache Array Routing Protocol (CARP), ent difference. In contrast to Coral, there is no way for users to
which simply splits the hash space of the URI namespace into opt in or out of the caching system—it is under the control of
ranges configured to be commensurate to the capacity of the in- the content provider. However, content providers may elect to
dividual servers in the cluster. When a request for an object use Jellyfish to make an independently accessible cached site,
is received by the cluster, the result of the hash of the URI of affording users the ability to choose the distribution system they
the object is used to direct the request to one of the coordinated use.
caches, which obtains the object and caches it. In this way, the
namespace is split into a set of distinct buckets, and coordination 3.1 Request Routing
occurs not through gossip between the servers, but a preconfig- Jellyfish is based on a node/supernode structure. We chose this
ured algorithm. While CARP and similar algorithms can be used structure over the aforementioned DHT-based overlay structures
to evenly spread load across a large cluster of caches, the hash because we felt that for low latency applications open to arbi-
functions require updating when servers in the array fail [15]. trary computers this would ultimately give the best performance.
As we show in our simulation, this approach has serious prob- A key difficulty with distributed caching systems is that a very
lems adapting to high rates of churn which might be found in large portion of personal computers are behind restrictive NATs
a distributed cluster of volunteer nodes. Furthermore, heuristics and firewalls and thus require the assistance of unrestricted peers
based on known system factors (such as memory, bandwidth, to initiate communication with normal users. Another difficulty
and storage capacity) which can be used to obtain hash range is the immense range of hardware and in particular bandwidth
values for caches in a closed environment prove problematic capacities observed in real life networks. Successful low latency
when system capacities must be inferred from records of their applications like Skype have chosen a node/supernode structure
behavior. to exploit this immense spread of capacities instead of fighting
Other approaches have attempted, with some success, to it.
solve the problem of request routing more adaptively. Kawai In Jellyfish, volunteer nodes which are not behind NATs or
et. al. approach the problem of fault tolerance by allowing hash firewalls and have reasonably high capacities and uptimes are
ranges to be duplicated between nodes in the system, and opti- promoted to being supernodes. All other volunteers are ordinary
mize their caching algorithm to allow caches to store certain ob- nodes.
jects requested by their local clients [16]. Kaiser et. al. construct Each ordinary node is assigned to a nearby supernode using
an adaptive distributed algorithm for object placement that has existing network location techniques. In our current Jellyfish
the same goals as CARP, but relies on the convergence of routing design, there is a sharp delineation of roles—supernodes act as
tables among a set of caches [15]. The authors demonstrate that, DNS servers, whereas ordinary nodes act only as HTTP servers.
over time, it outperforms a CARP-like protocol in terms of the However, there is no reason that a computer acting as a supern-
number of hops required for request resolution and the average ode cannot simultaneously act as an ordinary node, running both
cache hit ratio. the DNS server and the HTTP server, or alternatively, change
roles over time. For the purposes of this discussion it is proba-
bly helpful to think of them as distinct machines.
3 Design
Much like both Akamai and Coral, request routing in Jel-
Jellyfish is designed to use volunteered computer resources lyfish is done using DNS. Generally, the entire cached web-
to greatly amplify the hosting capacity of a websites primary site will be assigned its own subdomain, generally ’cached’.
servers while maintaining extremely high performance, secu- Each conceptual webpage, which includes the main HTML file
rity, and an excellent overall user experience. This high-level and associated image, javascript and css files, is assigned a
overview describes the utilization of the system in terms of con- unique subdomain which is a simple translation of its URL. For
tent providers, volunteers, and users: example, the current
Content providers: To use the software, content providers Page would be rewritten as wiki_Main_Page.cached.
(e.g., an organization or individual hosting some website) sup-, and its associ-
ply at least one trusted and reliable machine to run the Jelly- ated images would follow the same subdomain. These links can
fish server-side software. They will need to set up this server- be rewritten permanently or using on-the-fly URL rewriting like
side software and reconfigure their nameservers to make use Akamai.
of Jellyfish, but do not need to change their actual code. As In this example, the URL
a trial, they can create a separate Jellyfished subdomain (e.g., is resolved by primary DNS servers owned by Wikipedia to and make changes to their primary do- point to a local supernode. The supernode chosen is based on
main only when the setup has been tested. the IP address of the request and the round trip times to vari-
Volunteers: Anyone with a broadband connection and an in- ous geographically close supernodes, in addition to the current
terest in helping downloads the Jellyfish client, thus becoming a loads on the various supernodes (avoiding overloaded servers
volunteer. A volunteer selects which websites he or she would is the first priority, ensuring good geographic location is sec-
ond). When the user decides to go to the Wikipedia main page, it 3.2 Object Placement
sends a request to resolve wiki_Main_Page.cached.en.
When a supernode gets a request for a particular subdomain, it to its assigned supernode. The supernode
must decide which of its ordinary nodes (or a neighbor supern-
resolves this address to a node that it wants to serve the page
ode’s nodes) to forward the request to. More generally, the su-
request. The algorithm the supernodes use for doing this is the
pernode needs to decide what nodes should store what objects.
subject of the next section. The supernode will also, if necessary,
These are variants of the ’object placement problem’ and the ’re-
coordinate the necessary firewall/NAT traversal to establish bidi-
quest routing problem’, which are, unfortunately, difficult prob-
rectional communication between the user and the ordinary node
lems. From a qualitative point of view, the supernode has at least
at this time. The user then requests the page from the assigned
the following considerations.
node, which acts as an ordinary HTTP server.
If the node that the supernode resolves the DNS request to
has the document, it simply responds with it. If it does not, then 1. Ordinary nodes must be load balanced - no nodes can be
a cache miss has occurred. In this case, the node will need to overloaded.
retrieve the documents from the central server before it can for-
ward them to the client. Clearly, if no node that is assigned to 2. However, ordinary nodes may have vastly different capaci-
a supernode has the requested document, the supernode will not ties, and these capacities may change suddenly.
be able to forward to a child node without creating a cache miss.
On the one hand, if this happens infrequently enough, it might 3. Popular files cannot generally be served by a single node -
be fine. On the other hand, if a nearby node that is assigned to they will need to be replicated to share the load.
some other supernode is caching the data, then it seems like it
might make sense to try to get it from that node, avoiding the 4. However, too much replication is bad too. Disk capacities
additional request to the central servers. CoralCDN takes this are finite, and it is important for caches to store different
idea to an extreme, compared to Jellyfish, by guaranteeing that files in order to maximize hit ratios.
cached documents will always be found. But if the document is
cached on the other side of the world, it may be very time con- Object placement is a difficult problem in general. Given the
suming to retrieve it, and due to the somewhat different design difficulty of finding closed form solutions for even much simpler
goals of Jellyfish, we would prefer such documents to simply cases, we feel it is very unlikely that a closed form optimal so-
be cache misses. Still, a compromise between the two seems lution exists for our case. Instead, we compared three heuristic
possible. based algorithms, one a very simple one for comparison pur-
The compromise we use makes use of cache digests, which poses only, one the commonly used CARP algorithm, and one a
are a bloom filter-based method of efficiently reporting the con- new heuristic algorithm designed specifically for Jellyfish.
tent of a cache. Each supernode is aware of a few other nearby
supernodes. Every few minutes, it runs a bloom filter on its state
3.2.1 Load balanced CARP
representation of the cache content of its sub-nodes to obtain
a cache digest, and sends copies of the results to its neighbor This algorithm is a variant of the well known cache array rout-
supernodes. When a supernode receives a request that none of ing protocol algorithm. CARP is a very simple system, but it is
its own children can handle, it checks the cache digests of its used quite frequently in practice. Each node is assigned a weight
neighbors to see if they might be storing the file. If one is, it between 0 and 1, and the weight vector is normalized to sum to
forwards the client’s DNS request to the supernode whose child 1. Given these values, each node is assigned a unique interval
node stores the file. The client will therefore get the request from on [0,1] with size equal to its weight. CARP uses a hash func-
a node belonging to a different supernode. We should note that tion which hashes document URLs to values that fall uniformly
using a simple bloom filter can result in false positives, but that between 0 and 1. To route a document, the hash value of the
these can be limited by changing the parameters of the bloom fil- document URL is computed, and the document is routed to the
ter, or ignored, as they merely result in a cache miss at the falsely node whose assigned interval contains the hash value.
identified node and thus cause no greater damage than that which Ordinarily, CARP is used in situations where the caches in-
would have occurred if they were not employed for inter-cluster volved are quite static. In this case, the weights for the caches are
communication. New methods, such as the “bloomier” filter, can pre-set to reasonable values and remain constant. For Jellyfish,
alternatively be used to eliminate the possibility of false posi- however, the weights must clearly be set dynamically to reflect
tives, but they come at greater computational cost [8]. changing conditions on the volunteer nodes. To determine the
To prevent the problem from recurring, the supernode will weights, we first measure the response time of the peer to a sim-
send a message to one of its own nodes telling it to cache the doc- ple TCP request which is handled by the Jellyfish software. We
ument by retrieving it from the node to which the near miss was assume that the response time to this request is highly indicative
directed. This provision ensures that the central server pushes of the total load the node is experiencing. This assumption will
a minimum number of document replicas to the Jellyfish array, hold if the node is bandwidth or CPU limited, as we expect it to
utilizing volunteer resources for the replication of data whenever be. If it were somehow limited by disk I/O or memory access,
possible. By employing cache digests we are able to share state this might not hold. If this appears to be a problem, a more com-
between cache clusters without employing costly gossip-based plex timing test could be run periodically. We use the following
notification systems. algorithms for calculating response times and weights.
3.2.2 Calculating response times 3.2.4 Weighted Random
Response times are measured for each node every time inter- To test how well our weight calculation worked, we created a
val t1 . However, the stored response times are not simply the very simple object placement algorithm. To route a file, the
last observed response times. Instead, we recognize that there weighted random algorithm simply chooses a peer randomly,
is some random variation in response times, and we want to with each peer chosen with a probability equal to its weight.
get a value which reflects the ’typical’ response time observed Weighted random should load balance between the peers very
recently for each node. A simple way to do this is to use well, since it considers nothing but response times when choos-
a weighted average of the most recent time with the previous ing a peer. However, weighted random does not remember what
stored average time. For node i, let the most recent response nodes have what files, causing it to create an unreasonable num-
time be ri . Let the stored typical response time be ri . To update ber of cache misses. For this reason, it is not a serious competitor
the typical response times, we do to the other two algorithms, but it provides a good benchmark for
w ∗ ri + ri comparing the load balancing performance.
ri 0 =
w+1 3.2.5 Weighted Replication
for some constant weight w. Generally, w ≈ 4 seems to give In the load balanced CARP algorithm, we run the ordinary
reasonable results. CARP algorithm and simply use the dynamically adjusted
weights. Intuitively, there are some problems with this algo-
3.2.3 Calculating weights rithm. First, like CARP, it does no replication - as far as the su-
Every time interval t2 >= t1 , the weights for the nodes are re- pernode is concerned, each file is stored on only one node. Sec-
computed. The weight calculation is based on a pre-specified ond, because the weights are dynamically changing, the intervals
maximum threshold maxt for the response time of a node. This that nodes are assigned to also change. Intuitively, files which
number derives ultimately from the time we find acceptable for a are on the ’edges’ near the endpoints of intervals will tend to
user to wait for a webpage. Intuitively, if a node is going slower get moved between nodes. When this move happens, however,
than this, we want to reduce the weight on that node. A real life the supernode does not remember that the old node still has the
value for maxt might be something like 200ms. In our simu- file; this is simply a limitation due to CARP being a hash-based
lations, we generally use t1 = t2 ≈ 30ms. The algorithm to algorithm. Instead, the file is uselessly cached on the old node,
recompute the weights is as follows. whereas requests for that file will be cache misses send to the
For each node, new node.
To solve these two problems, we created our own algorithm,
weighted replication. Instead of hashing URLs and considering
1. If the node’s response time ri maxt , its weight is reduced only the hashes, weighted replication stores a table of all URLs
by a percentage p1 . the supernode has seen. Each such URL is associated with a
set of nodes, which is the set of nodes that are currently storing
2. Otherwise
that web page. We reuse the same weight-based system that we
(a) If its weight is less than the average weight of all used in the previous two algorithms. We use the same response
the nodes, its weight is increased by a percentage p2 , times, and we define an overloaded node to be a node which has
where generally, p2 < p1 . ri > maxt . The routing algorithm, however, now follows these
3. Finally, all the weights are renormalized to 1.
The motivation for this algorithm is the following. In step 1, 1. Given a URL, look up the nodes currently storing the URL.
we want to make sure that nodes which are currently overloaded
2. If there aren’t any nodes currently storing this file, then:
or are going slow for some other reason have their weights re-
duced quickly. In step 2, we want to be constantly trying to in- (a) Look in cache digests to see if a neighbor supernode
crease weights of non-overloaded nodes. It is possible that these is storing the file.
nodes have excess capacity, and the only way to find out is to
(b) If so, send the DNS request to the supernode that has
increase their weights until we see signs of a capacity shortage.
control of the file, tell a local node to replicate the file
The restriction we place on step 2 to not increase weights
from the node the DNS request resolves to, and send
which are already above average is not important if the demand
the user back the IP of the node that belongs to the
is very close to capacity. However, if there is a great deal of ex-
other supernode.
cess capacity, then there are many possible weights which could
be used that would still cause no overloading. In this circum- (c) If not, choose a node from the probability distribution
stance, we prefer the most equal weight distribution possible consisting of the node weights, send the request to this
without overloading, because in the event of node failure, this node and add the node-file pair to the table.
will have the least impact on the overall system. The restriction 3. If some nodes are storing this file, then:
in step 2 guarantees that if demand is well below total capacity,
the weight distribution will be exactly equal. Nodes will only get (a) Look to see if these nodes are overloaded. Specifi-
a higher than average share if other nodes are being overloaded. cally, rank the noes in order of their response times.
Take the 80th percentile response time out of these with email and the Internet, one can only expect that, if volun-
(or the closest to it). If this response time is greater teer nodes were simply trusted not to alter the cached data, cer-
than maxt , the nodes are overloaded. If the nodes tain organizations would jump at the chance to redirect people
are overloaded, find a node that isn’t overloaded, and seeking to websites peddling prescription drugs
tell that new node to cache this file also, and tell it to and pornography. To ensure this does not happen, Jellyfish uses
download its copy from a peer currently caching this a security system which ultimately relies on trusted servers run
file. by the content provider, and also on the difficulty of an attacker
(b) Each of the nodes caching this file has an associated to acquire a large number of machines and human labor.
weight. These weights will not in general sum to 1, 3.3.2 Security System Design
because the weights for all the nodes in the cluster
are set to sum to 1. Nevertheless, they can still be The idea behind Jellyfish’s security system is to get the volun-
used as a probability distribution. Using temporary teer nodes to check each other. When a node gets a request for
re-normalized versions of these weights, choose one a document, in addition to handling the request, with a random
of the nodes currently caching the file according to probability p, it also requests the document from another peer,
this probability distribution. pretending to be an ordinary user. It then requests the signature
of that document, which is cached as an ordinary document by
Here is the motivation behind weighted replication. We want the Jellyfish system. The signature is generated by a private key
to reuse the successful weighting system for the past two algo- known only to the content provider. If a node consistently re-
rithms. However, we also want to guarantee that no document turns documents that do not match their signatures, it is reported
will be a cache miss if a known node is currently caching the to the central server as possibly suspect, and it may be banned
document. Rather than using hash intervals, which are funda- from the system.
mentally approximate, we simply use a table to store all cached Jellyfish’s security system begins when volunteers download
URLs. When a request comes in, if the requested file is currently the Jellyfish client software. Each Jellyfish volunteer has a Jelly-
cached by some node, we clearly want a node which is caching fish ID and password, which cannot be automatically generated,
it to respond. But since multiple nodes may be caching it, we requiring a unique email address and perhaps the solution of a
need to pick which one. To do this, we simply use the adaptive captcha. The Jellyfish ID is used to identify volunteer nodes and
weights system used in the first two algorithms. is required to sign on to the system. Use of the Jellyfish ID pre-
We also want to avoid hotspots by replicating files which are vents attackers from automatically creating an unlimited number
heavily used. If most of the caches which are currently storing of nodes and allows us to ban malicious users. Users are gen-
a file are overloaded, then it may well be the case that that file erally banned by Jellyfish ID, but may in addition be banned by
is heavily requested, and so we want to replicate it. It might IP address, making it more difficult for an attacker to get a large
also be the case that that file is not heavily requested, but the number of nodes.
caches that are storing it are overloaded because they are stor- In order for nodes to check on other nodes, each node needs
ing other heavily requested files. Even in this case, though, it to know the IP address of at least one supernode. A simple im-
makes sense to replicate the file to an unloaded cache in order plementation would be to just tell each node the IP addresses
to reduce response times on this file while the problems with the of all the nodes. However, this would present a critical security
nodes currently storing it are worked out. When replicating files, flaw, because a malicious node could return correct replies to
we obviously do not want to go back to the origin servers, but all IP addresses in the node list, and false replies otherwise. In-
instead to send the file directly from node to node. stead, each node should only know the IP address of some of the
Each node clearly has a finite storage space for data. Gener- supernodes, and none of the ordinary nodes. Furthermore, the
ally, Jellyfish users will choose a maximum size for the cache to request traffic from an ordinary node must be indistinguishable
grow to on their hard disk. When that cache is full, Jellyfish will from the request traffic from an actual user, otherwise malicious
free up space for new files using an LRU algorithm. The node nodes could play the same trick. At a minimum, this means that
will of course no longer be caching the files it has gotten rid of. the request frequency and distribution should match that of an
In the algorithm as specified above, the supernode does not keep actual user. The easiest way to handle this is for nodes to send
track of this. Therefore, the supernode may create a cache miss requests to any given supernode not too frequently.
by sending a file request to a node that has evicted the requested An attack we worry about is a kind of denial of service at-
file. Our simulation uses the algorithm as specified above, and tack in which the attacker runs false nodes which accuse inno-
these cache misses do occur. One could imagine resolving this cent nodes of being malicious. To prevent this attack, we don’t
by having the supernode try to calculate what files have been immediately ban some Jellyfish ID just because a single node
dumped. Alternatively, each node could periodically send the said it was malicious. Instead, we require that a number of dif-
supernode a list of all the files it has evicted. ferent Jellyfish nodes find the same node to be serving incor-
rect replies. Furthermore, each Jellyfish node is only allowed to
3.3 Security blacklist nodes at a certain rate. Nodes which try many more
blacklists than average may reasonably be suspected of being
3.3.1 The Security Problem malicious, and can have this rate limited further.
The key security issue with Jellyfish is how to ensure that vol- One detail is that both nodes and supernodes could poten-
unteer nodes serve unaltered data to users. Given our experience tially be malicious, and if a node finds a request to be incorrect,
the node now needs to know if it was the fault of the node or traffic is about twice the overhead of the data traffic, making SSL
the supernode. The difficult way to handle this is try to do sta- splitting highly inefficient.
tistical analysis on the bad requests and determine overall if the In a real-world test on a particular website, the overhead of
supernode seems to be at fault or if the problem lies only with the authentication was such that even with no cache misses, the
the node. This would work, but an easier way is to take advan- caching system reduced the load on the main servers by at best
tage of the fact that the central server knows the IP addresses of 90% [17]. This might be appropriate for some systems, but we
all the legitimate nodes. The node that found the invalid docu- feel that this is too much overhead to take as a starting point for
ment can simply send both the node and supernode IP addresses Jellyfish. SSL splitting is certainly worthy of a detailed inves-
to the central server. If the node IP address is invalid for any tigation, however, and could be very useful as a backup system
node currently or very recently signed in, then the supernode is used to survive a massive attack on the main security system.
the problem; else the problem is the node.
3.4 File Aggregation
3.3.3 Overhead of Security System In Jellyfish, each conceptual webpage is assigned its own subdo-
This security system clearly adds overhead to the system. It re- main. This guarantees that all the component files of that web-
quires nodes to make fake requests which use capacity without page will be served from the same ordinary node. Prior work has
directly helping users. However, a simple analysis can show that shown that this causes tremendous performance improvements
it does not require much overhead to be quite secure. Consider compared to many tiny HTTP requests which go to scattered
a Jellyfish network with 1000 nodes and 100 supernodes, which servers. If requests are not aggregated in this way, a single node
is receiving 10,000 requests per second. Assume, for simplicity, which is experiencing transient delay can stop a webpage from
that just one of these nodes is run by a spammer, and responds to loading, even if all the other files are complete. In addition,
all requests with advertisements. If we make the security over- aggregation allows the system to take advantage of persistent
head five percent of the total traffic to the site, and require five connections and HTTP pipelining, which can significantly re-
’bad node’ messages before we block a node, then it will take, duce download times by eliminating TCP connection creations
on average, approximately ten seconds to block the bad node. and tear-downs. The fact that Coral does not do this yet may
If instead we had a bad supernode, it would take only 1 second currently be detrimental to Coral’s performance. A downside to
on average to find and block the supernode, but admittedly, it this is a repetition of files - nearly every client will have to cache
would have served ten times as many users bad pages. Also, the certain standard files like logos and CSS files. Fortunately, these
overhead for this system, unlike for SSL splitting, is added only files are usually small and heavily accessed anyway.
to the peer network, not to the origin server.
4 Simulation
3.3.4 Encryption Layer
4.1 Design
In addition to the above security system, we will add one small
layer of additional security. Instead of storing the HTML and 4.1.1 Motivation
image files on the nodes in plain text, like ordinary webservers To test our system design, we built a simulation of many aspects
and caching proxies do, we will lightly encrypt the information. of Jellyfish. So far, we have used the simulation primarily to test
Since the Jellyfish software must contain the decryption key, this and compare object placement algorithms. However, we hope
will not stop a serious attack - it is merely security through ob- to ultimately use it to test update strategies, supernode place-
fuscation. Nevertheless, we feel it will help greatly to keep ordi- ment and promotion, security, and system reliability. Because
nary users from casually modifying pages for fun. we wanted to first focus on object placement, however, we sim-
ulate only the subset of Jellyfish which is relevant to the choice
3.3.5 SSL Splitting - An Alternative of object placement algorithms.
An alternative possibility for Jellyfish’s security system is a tech-
nique called SSL splitting. SSL splitting is a clever idea which
4.1.2 Implementation
is mentioned briefly in the Coral paper as a possible way to solve Object placement designs are made by a single supernode about
the security problem [12]. The idea behind SSL splitting is that its child ordinary nodes. Therefore, our simulation looks at only
the SSL authentication traffic and data traffic can be split. With a single cluster - one supernode and a cluster of ordinary nodes.
SSL splitting, the user makes an SSL connection to the central Simulating multiple clusters is critical for testing update strate-
server with the caches as an intermediary. SSL authentication gies and security schemes, but it will not affect object placement
traffic must still be handled by the central server, but SSL data results.
traffic is handled by the proxy cache. The savings on central Our simulation is implemented in about 1500 lines of Erlang.
server load this caching system can produce is clearly dependent Erlang is a language designed for distributed computing, and
on the proportion of authentication traffic to data traffic. Since it features very convenient message passing between processes
SSL authentication traffic size is approximately constant per file, and user-level threading, which is important for simulating many
the proportional overhead of the authentication traffic is closely entities. Erlang is based on the notion of a process. Each pro-
tied to the file size. For files of 1MB, the authentication traffic cess is essentially a user-level thread, but processes cannot di-
is about 0.05% of the data traffic, making SSL splitting highly rectly share data, but must pass messages to communicate. This
efficient [17]. However, for files of 100 bytes, the authentication worked nicely with our simulation, in which every computer
simulated - every node, supernode, webserver, and user - was
a separate process.
4.1.3 Simplifying Assumptions
To make the problem tractable, our simulation makes a number
of simplifications. Simulated nodes have an associated band-
width and hard drive space, which is generated randomly and
is different for each node. However, we do not simulate nodes’
apparent bandwidths changing, as might occur due to network
congestion or other programs running on the node. We assume
that requests for web pages are zipf distributed, as much research
has suggested that this distribution closely approximates the ac-
tual access frequency of many real-world web sites. Currently
we use an extremely simple Internet topology, in which Internet
latencies are simply randomly generated every time a message Figure 1: Adaptive weight-finding algorithm causes observed
is sent. Using a more accurate Internet topology will be crucial weights to cluster around actual weights without bias
when exploring multi-cluster designs which are intended to be
spread out over the globe. Within a single cluster, however, the
nodes are supposed to be close together, and the simulation of
nodes which have slow connections is already handled by as- time requesting a document. The original cache miss number
signing each node a bandwidth. In these simulations we also can be used without any change in the results; however, since
assume that there are no malicious or malfunctioning nodes and many cache misses are not bad misses, it is easier to see the
we do not look at document updates. change in the data when subtracting the noise of the non-bad
4.2 Results In our simulation, we assume that each client has a limited
We first used the simple weighted random algorithm to test our patience for waiting for responses. Specifically, we say that any
adaptive weight finding system. In our simulation, we do not request that takes longer than 1s from initiation to completion
give the supernode process access to actual capacities or loads of took inappropriately long, and it is labeled a user timeout. The
the ordinary nodes. Instead, we use our adaptive weight finding response times of all requests that did not time out are recorded,
system to try to deduce these capacities using response times. and we use these values to calculate an average response time.
To make sure that this was working correctly, we compared the We primarily ran the simulation at fairly high load - between
weights assigned by the supernode with the actual simulated ca- 40 and 80% of the system’s theoretical capacity. We found that
pacities of the nodes. Ideally, under heavy load, the weight for weighted replication gave a considerably lower bad miss rate
node i should equal the bandwidth of node i divided by the sum at both 40% and 80% load (Figure 2). However, the differ-
of the bandwidths. ence was greater at 80%. CARP causes more bad misses be-
When the system is under light load, however, this relation cause files which are near the boundaries of caches wind up be-
should not hold at all. When there is much excess capacity, the ing swapped between them, causing cache misses. At higher
weights will tend to a more equal relationship than proportional loads, the weights tend to change more rapidly, causing more of
to bandwidth. But to test whether the weight finding system this swapping to occur. Currently, bad misses only occur with
worked under heavy load, we ran a simulation with the demand weighted replication once the node caches have filled up and
nearly equal to the total capacity. We found that the weights begun evicting items; this is why no bad misses for weighted
cluster around the correct values without bias (Figure 1). We replication can be seen at the beginning of the graph.
also found that the algorithm ran stably, accurately finding the We also found that client timeouts happened less frequently
weights, until the system was running at about 90% of theoret- with weighted replication (Figures 3, 4). Client timeouts occur
ical capacity. After that, the weight finding algorithm basically when load-balancing fails and some peers become overloaded.
broke down and no longer gave accurate results. 90% may be Because CARP does not replicate files, it tends to load balance
sufficient, but further work should investigate how to increase less well. Also, CARP’s hashing scheme implicitly assumes
this threshold. a uniform distribution of file access. In real life, file access
The main purpose of our simulation, however, was to com- frequencies are far from uniform. In our simulation, we use
pare the load balancing CARP algorithm to the weighted replica- a zipf distribution to try to approximate realistic file accesses.
tion algorithm. We compared the two algorithms using three per- Because the access pattern is not uniform, CARP load balancing
formance metrics: the frequency of ’bad misses’, the frequency does not work very well - a small adjustment of weights can lead
of client timeouts, and the average response time. to a large change in traffic. In all cases, there is an initial burst
Bad misses are a subset of cache misses. The idea behind bad of failed requests at the start. This occurs while the supernode
misses is that some cache misses are unavoidable. Specifically, learns the correct weights of the nodes. It is to a certain extent
the first time in the simulation that a document is requested, it an artifact of our simulation beginning with a bang instead of
will inevitably be a cache miss. Bad misses is calculated by having nodes gradually join the supernode, as would occur in
taking all cache misses and removing the ones that were the first real life.
Figure 2: CARP causes more bad misses than weighted repli-
cation because documents near the edges of hash intervals get
switched between nodes.

Figure 3: Failed requests occur less frequently with weighted

replication. Because CARP does not replicated documents, and
At a 40% load, we found that the response times for weighted because request frequencies are not even approximately uni-
replication were significantly better than the response times for formly distributed, as hash range based approaches assume,
CARP (Figure 5). At 80%, weighted replication still did better, CARP does not load balance as effectively as weighted repli-
but only slightly (Figure 6). Note however that these response cation. This leads to overloaded nodes and user timeouts.
times do not include responses which timed out. Including
these response times might have affected this result. Weighted
replication appears to do better at %40 percent because it tends
to prefer high bandwidth, low latency nodes by replicating files
onto them. At higher loads, it is forced to use all nodes at close
to their actual capacities, and no longer prefers these nodes.

Overall, our results indicate that under the parameters we

tried, the weighted replication algorithm seems to correctly load
balance between nodes of widely varying capacities, avoid over-
loaded nodes, and replicate popular files to avoid hotspots. It ap-
pears to cause load balance better and give fewer cache misses
than load balanced CARP. However, benchmarking is admit-
tedly a difficult exercise, and a different simulation design or pa-
rameter choices might have led to a different result. We did not
try to examine parameter combinations exhaustively, but looking
at the sensitivity of our results to parameter and design decisions
would be worthwhile future work.

5 Future work
5.1 Squid
The deployable version of Jellyfish would contain many compo-
nents that have been built before, such as an HTTP request han-
dler, a highly efficient caching and retrieval engine, and an Edge
Figure 4: Failed requests occur less frequently with weighted
Sides Includes implementation. We looked for a project that we
replication than with CARP. However, at 40% load, relatively
could build our system on top of, saving the replication of these
few failed requests occur with either algorithm.
components. We found that the Squid caching engine seemed to
be our best option. Squid is a popular open-source caching sys-
tem derived from the Harvest project. Squid is normally run by
a web host on a single server or on a small hierarchy of servers
that have been carefully configured to talk to each other.
Squid contains a highly efficient web caching system with
respect to data storage and retrieval, but none of the features
required to build a large network of untrusted caches. It does
contain some support for inter-cache communication. However,
this is basically designed for organizations that want to distribute
their squid caches across a handful of trusted, identical comput-
ers. We think that building Jellyfish on top of Squid will save
duplication of some difficult technical effort. We also think that
the project will benefit from an attachment to an existing and
active open source community with closely aligned interests.

[1] Berkeley open infrastructure for network computing, http://
[2] Folding@home
[3] Google compute
[4] Seti@home
[5] L. Breslau, P. Cao, L. Fan, G. Phillips, and S. Shenker. On the
implications of zipf’s law for web caching. Technical Report CS-
Figure 5: Weighted replication gives considerably lower re-
TR-1998-1371, 1998.
sponse times at 40% load. This occurs because weighted repli-
cation tends to replicate documents onto high bandwidth, low la- [6] L. Breslau, P. Cao, L. Fan, G. Phillips, and S. Shenker. Web
caching and zipf-like distributions: Evidence and implications. In
tency nodes, ultimately preferring them more than CARP does. INFOCOM (1), pages 126–134, 1999.
It also occurs because CARP occasionally overloads nodes due
[7] A. Chankhunthod, P. B. Danzig, C. Neerdaels, M. F. Schwartz,
to poor load balancing, leading to higher response times, and
and K. J. Worrell. A hierarchical internet object cache. In USENIX
especially higher variability in response times. Annual Technical Conference, pages 153–164, 1996.
[8] B. Chazelle, J. Kilian, R. Rubinfeld, and A. Tal. The Bloomier
filter: An efficient data structure for static support lookup tables.
[9] Y. Chen, L. Qiu, W. Chen, L. Nguyen, and R. Katz. Efficient and
adaptive web replication using content clustering, 2003.
[10] I. Clarke, O. Sandberg, B. Wiley, and T. W. Hong. Freenet: A
distributed anonymous information storage and retrieval system.
Lecture Notes in Computer Science, 2001.
[11] P. R. Eaton. Caching the web with oceanstore.
[12] M. Freedman, E. Freudenthal, and D. Mazi. Democratizing con-
tent publication with coral, 2004.
[13] J. Gwertzman and M. Seltzer. An analysis of geographical push-
caching, 1997.
[14] K. Hildrum, J. Kubiatowicz, S. Rao, and B. Zhao. Distributed
object location in a dynamic network, August 2002.
[15] M. Kaiser, K. Tsui, and J. Liu. Adaptive distributed caching, 2002.
[16] E. Kawai, K. Osuga, K. ichi Chinen, and S. Yamaguchi. Du-
plicated hash routing: A robust algorithm for a distributed www
cache system.
[17] C. Lesniewski-Laas and M. F. Kaashoek. Ssl splitting: Securely
serving data from untrusted caches. In Proceedings of the 12th
USENIX Security Symposium, 2003.
[18] C. Loosley. When is your web site fast enough? E-Commerce
Times, November 12 2005.
Figure 6: Weighted replication gives lower response times at [19] A. Mislove, A. Post, A. Haeberlen, and P. Druschel. Experiences
80% load, but the difference is less pronounced than at 40% in building and operating epost, a reliable peer-to-peer applica-
load. However, the variability of the response times is still con- tion. Proceedings of EuroSys2006, 2006.
siderably reduced. [20] G. Pallis and A. Vakali. Insight and perspectives for content deliv-
ery networks. Communications of the ACM, 49:101–106, January
[21] P. Rodriguez, C. Spanner, and E. W. Biersack. Web caching ar-
chitectures: Hierarchical and distributed caching. In Proceedings
of the 4th International Web Caching Workshop, 1999.
[22] I. Stoica, R. Morris, D. Karger, F. Kaashoek, and H. Balakrishnan.
Chord: A scalable Peer-To-Peer lookup service for internet appli-
cations. In Proceedings of the 2001 ACM SIGCOMM Conference,
pages 149–160, 2001.
[23] J.-F. C. E. E. Tolga Bektaş and G. Laporte. Exact algorithms for
the joint object placement and request routing problem in content
distribution networks, 2006.