You are on page 1of 21

High Performance

P2P Web Caching

Erik Garrison
Jared Friedman

CS264 Presentation
May 2, 2006
SETI@Home

Basic Idea: people donate computer time to look for
aliens

Delivered more than 9 million CPU-years

Guinness BWR – largest computation ever

Many other successful projects (BOINC, Google
Compute)

The point: many people are willing to donate
computer resources for a good cause
Wikipedia

About 200 servers required to keep the site
live

Hosting & Hardware costs over 1$M per year

All revenue from donations

Hard to make ends meet

Other not-for-profit websites in similar
situation
HelpWikipedia@Home

What if people could donate idle computer
resources to help host not-for-profit
websites?

They probably would!

This is the goal of our project
Prior Work

This doesn't exist

But some things are similar
 Content Distribution Networks (Akamai)

Distributed web hosting for big companies
 CoralCDN/CoDeeN

P2P web caching, like our idea,

But a very different design

Both have some problems
Akamai, the opportunity

Internet traffic is 'bursty'

Expensive to build infrastructure to handle
flash crowds

International audience, local servers
 Sites run slowly in other countries
Akamai, how it works

Akamai put >10,000 servers around the
globe

Companies subscribe as Akamai clients

Client content (mostly images, other media)
is cached on Akamai's servers

Tricks with DNS make viewers download
content from nearby Akamai servers

Result: Website runs fast everywhere, no
worries about flash crowds

But VERY expensive!
CoralCDN

P2P web caching

Probably the closest system to our goal

Currently in late-stage testing on PlanetLab

Uses an overlay and a 'distributed sloppy
hash table'

Very easy to use – just append '.nyud.net' to
a URL and Coral handles it

Unfortunately ...
Coral: Problems

Currently very slow
 This might improve in later versions
 Or it might be due to the overlay structure

Security: volunteer nodes can respond with
fake data

Any site can use Coral to help reduce load
 Just append .nyud.net to their internal links

Decentralization makes optimization hard
 more on this later
Our Design Goals

Fast: Akamai level performance

Secure: Pages served are always genuine

Fast updates possible

Must greatly reduce demands on main site
 But this cannot compromise first 3
Our Design

Node/Supernode structure
 Take advantage of extremely heterogeneous
performance characteristics

Custom DNS server redirects incoming
requests to nearby super node

Super node forwards request to nearby
ordinary node

Node replies to user
Our Design
User goes to wikipedia.org

DNS server resolves Node retrieves document


wikipedia.org to a super node and sends to user

Super node forwards request to


ordinary node that has the
requested document
Performance

Requests are answered in only 2 hops

DNS server resolves to a geographically
close supernode

Supernode avoids sending requests to slow
or overloaded nodes

All parts of a page (e.g., html and images)
should be served by a single node
Security

Have to check nodes' accuracy

First line of defense: encrypt local content

May delay attacks, but won't stop them
Security

More serious defense: let users check the
volunteer nodes!

Add a javascript wrapper to the website that
requests the pages using AJAX

With some probability, the AJAX script will
compute the MD5 of the page it got and send
it to a trusted central node

Central node kicks out nodes that frequently
get invalid MD5sum's

Offload processing not just to nodes, but to
users, with zero-install
A Tricky Part

Supernodes get requests, have to decide
what node should answer what requests

Have to load-balance nodes – no
overloading

Popular documents should be replicated
across many nodes

But don't want to replicate unpopular
documents much – conserve storage space

Lots of conflicting goals!
On the plus side...

Unlike Coral & CoDeeN, supernodes know a
lot of nodes (maybe 100-1000?)

They can track performance characteristics of
each node

Make object placement decisions from a
central point

Lots of opportunity to make really intelligent
decisions
 Better use of resources
 Higher total system capacity
 Faster response times
Object Placement Problem

This kind of problem is known as an object
placement problem
 “What nodes do we put what files on?”

Also related to the request routing problem
 “Given the files currently on the nodes, what
node do we send this particular request to?”

These problems are basically unsolved for
our scenario

Analytical solutions have been done for very
simplified, somewhat different cases

We suspect a useful analytic solution is
impossible here
Simulation

Too hard to solve analytically, so do a
simulation

Goal is to explore different object placement
algorithms under realistic scenarios

Also want to model the performance of the
whole system
 What cache hit ratios can we get?
 How does number/quality of peers affect cache
hit ratios?
 How is user latency affected?

Built a pretty involved simulation in Erlang
Simulation Results

So far, encouraging!

Main results using a heuristic object
placement algorithm

Can load-balance without creating hotspots
up to about 90% of theoretical capacity

Documents rarely requested more than once
from central server

Close to theoretical optimum
Next Steps

Add more detail to simulation
 Node churn
 Better internet topology

Explore update strategies

Obviously, an actual implementation would
be nice, but not likely to happen this week

What do you think?