You are on page 1of 21

High Performance

P2P Web Caching
Erik Garrison
Jared Friedman
CS264 Presentation
May 2, 2006

SETI@Home



Basic Idea: people donate computer time to look for
aliens
Delivered more than 9 million CPU-years
Guinness BWR – largest computation ever
Many other successful projects (BOINC, Google
Compute)
The point: many people are willing to donate
computer resources for a good cause

Wikipedia




About 200 servers required to keep the site
live
Hosting & Hardware costs over 1$M per year
All revenue from donations
Hard to make ends meet
Other not-for-profit websites in similar
situation

HelpWikipedia@Home


What if people could donate idle computer
resources to help host not-for-profit
websites?
They probably would!
This is the goal of our project

Prior Work

This doesn't exist
But some things are similar

Content Distribution Networks (Akamai)

Distributed web hosting for big companies

CoralCDN/CoDeeN


P2P web caching, like our idea,
But a very different design
Both have some problems

Akamai, the opportunity

Internet traffic is 'bursty'
Expensive to build infrastructure to handle
flash crowds
International audience, local servers

Sites run slowly in other countries

Akamai, how it works


Akamai put >10,000 servers around the
globe
Companies subscribe as Akamai clients
Client content (mostly images, other media)
is cached on Akamai's servers
Tricks with DNS make viewers download
content from nearby Akamai servers
Result: Website runs fast everywhere, no
worries about flash crowds
But VERY expensive!

CoralCDN



P2P web caching
Probably the closest system to our goal
Currently in late-stage testing on PlanetLab
Uses an overlay and a 'distributed sloppy
hash table'
Very easy to use – just append '.nyud.net' to
a URL and Coral handles it
Unfortunately ...

Coral: Problems

Currently very slow

Security: volunteer nodes can respond with
fake data
Any site can use Coral to help reduce load

This might improve in later versions
Or it might be due to the overlay structure

Just append .nyud.net to their internal links

Decentralization makes optimization hard

more on this later

Our Design Goals



Fast: Akamai level performance
Secure: Pages served are always genuine
Fast updates possible
Must greatly reduce demands on main site

But this cannot compromise first 3

Our Design

Node/Supernode structure

Take advantage of extremely heterogeneous
performance characteristics

Custom DNS server redirects incoming
requests to nearby super node
Super node forwards request to nearby
ordinary node
Node replies to user

Our Design
User goes to wikipedia.org

DNS server resolves
wikipedia.org to a super node

Node retrieves document
and sends to user

Super node forwards request to
ordinary node that has the
requested document

Performance

Requests are answered in only 2 hops
DNS server resolves to a geographically
close supernode
Supernode avoids sending requests to slow
or overloaded nodes
All parts of a page (e.g., html and images)
should be served by a single node

Security


Have to check nodes' accuracy
First line of defense: encrypt local content
May delay attacks, but won't stop them

Security

More serious defense: let users check the
volunteer nodes!
Add a javascript wrapper to the website that
requests the pages using AJAX
With some probability, the AJAX script will
compute the MD5 of the page it got and send
it to a trusted central node
Central node kicks out nodes that frequently
get invalid MD5sum's
Offload processing not just to nodes, but to
users, with zero-install

A Tricky Part

Supernodes get requests, have to decide
what node should answer what requests
Have to load-balance nodes – no
overloading
Popular documents should be replicated
across many nodes
But don't want to replicate unpopular
documents much – conserve storage space
Lots of conflicting goals!

On the plus side...

Unlike Coral & CoDeeN, supernodes know a
lot of nodes (maybe 100-1000?)
They can track performance characteristics of
each node
Make object placement decisions from a
central point
Lots of opportunity to make really intelligent
decisions


Better use of resources
Higher total system capacity
Faster response times

Object Placement Problem

This kind of problem is known as an object
placement problem

Also related to the request routing problem

“What nodes do we put what files on?”

“Given the files currently on the nodes, what
node do we send this particular request to?”

These problems are basically unsolved for
our scenario
Analytical solutions have been done for very
simplified, somewhat different cases
We suspect a useful analytic solution is
impossible here

Simulation

Too hard to solve analytically, so do a
simulation
Goal is to explore different object placement
algorithms under realistic scenarios
Also want to model the performance of the
whole system


What cache hit ratios can we get?
How does number/quality of peers affect cache
hit ratios?
How is user latency affected?

Built a pretty involved simulation in Erlang

Simulation Results

So far, encouraging!
Main results using a heuristic object
placement algorithm
Can load-balance without creating hotspots
up to about 90% of theoretical capacity
Documents rarely requested more than once
from central server
Close to theoretical optimum

Next Steps

Add more detail to simulation


Node churn
Better internet topology

Explore update strategies
Obviously, an actual implementation would
be nice, but not likely to happen this week
What do you think?