Professional Documents
Culture Documents
Scribd
Scribd
The result of this simplicity is the key innovation of this paper: the ability to
use commodity hardware to provide production level distributed systems. If
disks and machines are expected to fail, it is far better (and cheaper) to replace
them with economically viable machines - allowing for horizontal scaling; the
more machines, the more capacity. There is no need for RAID or other
expensive hardware because the filesystem replicates. And, extremely
importantly - the data is where you will do the computation in the cluster, thus
moving data off the disk to where the computation happens doesn’t require
network overhead.
It turns out that modern data appliances, especially those with terabytes of
data, benefit from this distributed data storage model; especially when there is
a distributed programming framework that also optimizes the storage model.
Combined with another paper - MapReduce: Simplified Data Processing on
Large Clusters - GFS became the foundation for Big Data as we know it; and
this paper was eventually implemented as HDFS in Hadoop. Chunks in the
GFS are perfect inputs to Mapping processes, as each mapper can be run on
every node in the cluster. Functional Mappers take a list of inputs as an
argument and apply a function to every input value. In append-optimized
systems chunks of data are therefore lists of inputs.
To the critique: there is a clear bottleneck in GFS, the Master server. This
server has to be smart not only about chunk allocation, but also has to handle
heartbeats, read and write requests and store metadata in memory. Although it
does checkpoint itself to disk and have read-only shadow backups it is a
central point of failure for the cluster. Worse, the master cannot handle many
small files - it is optimized for files that are 64 MB or larger. Storing the meta
data for billions of small files would eat up the memory on the server which
can only vertically scale. Additionally because of the 64 MB chunks, there
will be some storage loss for files that do not completely use up the chunk.