Professional Documents
Culture Documents
Shared state is what makes the application interesting. Modern web applications
are getting more shared state--all
the personalization and user data that is associated with web 2.0 is a whole lot of
state.
Relational databases are the gold standard of state management and have been for
the last 30 years. The relational
data model is one of the better contributions of computer science. Relational
databases are here to stay, but they are not
appropriate for every storage situation (we still have files and hashmaps, after
all). In particular relational databases,
and in particular the current SQL products available, present problems for large
scale software.
But first lets review what is good about the relational model:
1. It separates data from look-up strategy, queries are declarative
2. It is incredibly flexible--most desirable data operations can be expressed in
relations
To begin to understand the problems with relational databases, let's first review
some basic facts
For context let's review some basic facts about a modern computer:
roundtrip throughput
Java HashMap
BDB Btree
MySQL local
MySQL remote
1. Disk IO is stagnant
2. Moore's law is going gangbusters for parallel applications (e.g. any application
that has shared state)
So from this you can get some feel for the cost of various operations. It is clear
that the disk operations are
our greatest enemy followed by the network. We can iterate over XXX items of an
array in the time required to
look-up a single item via a log(n) b-tree index. We can iterate over YYY items in
the time required to complete
a mysql request.
The relational model is great, when accessed programmatically, but it is best for
little applications (of which there are many).
In a high scalability scenario with a shared db, the advantages of the relational
model disappear. No longer do you have
flexibility, what you have is a system in which most operations will bring the
database to its knees along with
everyone depending on it. Think about the average table, of the set of possible
queries, only a few can be issued without
difficulty. With this being the case, the abstraction of the lookup structure
becomes a major problem--it is impossible to
tell by looking at code whether it will run quickly or bring down the system, that
is totally dependent on data sizes and
indexing structure.
Caching is the most popular because it is the simplest to implement. Basically you
replace all your fancy queries with
simple puts and gets, and each operation goes first to the cache and then to the
db. The cache may not actually be much
faster than the db, but it can be distributed whereas the db cannot.
In the first strategy we try to keep each object on each server. We read and write
to the local cache, and in the background
the cache replicates our changes out to the other servers. This is the ehcache
strategy.
The problem with this is that we must store the whole cache local which means
either our entire dataset must fit in memory
or we have to use the local disk. Of course the local disk may well be slower than
the network, so our cache may become slower
then the database itself (though, importantly, it is distributed). In addition
each write now must be written to the db as well
as every application server.
How it works?
A simpler approach is much faster--use the key to calculate the location. This
means coming up with a function
f(key)->(n1,n2,n3)
where n1, n2, n3 are the nodes containing the data.
In this model only the key and the cluster topology (which servers are where) are
needed to locate and retrieve data.
References
- Google BigTable
- Amazon Dynamo
- One-size fits all