Professional Documents
Culture Documents
Megastore: Providing Scalable, Highly Available Storage For Interactive Services
Megastore: Providing Scalable, Highly Available Storage For Interactive Services
Google, Inc.
{jasonbaker,chrisbond,jcorbett,jfurman,akhorlin,jimlarson,jm,yaweili,alloyd,vadimy}@google.com
ABSTRACT
Megastore is a storage system developed to meet the requirements of todays interactive online services. Megastore blends the scalability of a NoSQL datastore with the
convenience of a traditional RDBMS in a novel way, and
provides both strong consistency guarantees and high availability. We provide fully serializable ACID semantics within
fine-grained partitions of data. This partitioning allows us
to synchronously replicate each write across a wide area network with reasonable latency and support seamless failover
between datacenters. This paper describes Megastores semantics and replication algorithm. It also describes our experience supporting a wide range of Google production services built with Megastore.
General Terms
Algorithms, Design, Performance, Reliability
Keywords
Large databases, Distributed transactions, Bigtable, Paxos
1. INTRODUCTION
Interactive online services are forcing the storage community to meet new demands as desktop applications migrate
to the cloud. Services like email, collaborative documents,
and social networking have been growing exponentially and
are testing the limits of existing infrastructure. Meeting
these services storage demands is challenging due to a number of conflicting requirements.
First, the Internet brings a huge audience of potential
users, so the applications must be highly scalable. A service
223
2.1 Replication
Replicating data across hosts within a single datacenter
improves availability by overcoming host-specific failures,
but with diminishing returns. We still must confront the
networks that connect them to the outside world and the
infrastructure that powers, cools, and houses them. Economically constructed sites risk some level of facility-wide
outages [25] and are vulnerable to regional disasters. For
cloud storage to meet availability demands, service providers
must replicate data over a wide geographic area.
2.1.1 Strategies
We evaluated common strategies for wide-area replication:
Asynchronous Master/Slave A master node replicates
write-ahead log entries to at least one slave. Log appends are acknowledged at the master in parallel with
224
2.1.2
Enter Paxos
2.2
2.2.1
Entity Groups
Nearly all applications built on Megastore have found natural ways to draw entity group boundaries.
2.2.3
3.
225
Physical Layout
A TOUR OF MEGASTORE
Megastore maps this architecture onto a feature set carefully chosen to encourage rapid development of scalable applications. This section motivates the tradeoffs and describes the developer-facing features that result.
3.1
226
3.2.2
Indexes
227
Row
key
101
101,500
101,502
102
User.
name
John
Photo.
time
Photo.
tag
Photo.
url
12:30:01
12:15:22
Dinner, Paris
Betty, Paris
...
...
Mary
3.3.1
Queues
batch multiple updates into a single transaction, or to defer work. A transaction on an entity group can atomically
send or receive multiple messages in addition to updating
its entities. Each message has a single sending and receiving
entity group; if they differ, delivery is asynchronous. (See
Figure 2.)
Queues offer a way to perform operations that affect many
entity groups. For example, consider a calendar application
in which each calendar has a distinct entity group, and we
want to send an invitation to a group of calendars. A single transaction can atomically send invitation queue messages to many distinct calendars. Each calendar receiving
the message will process the invitation in its own transaction
which updates the invitees state and deletes the message.
There is a long history of message queues in full-featured
RDBMSs. Our support is notable for its scale: declaring a
queue automatically creates an inbox on each entity group,
giving us millions of endpoints.
4. REPLICATION
This section details the heart of our synchronous replication scheme: a low-latency implementation of Paxos. We
discuss operational details and present some measurements
of our production service.
4.1 Overview
Megastores replication system provides a single, consistent view of the data stored in its underlying replicas. Reads
and writes can be initiated from any replica, and ACID semantics are preserved regardless of what replica a client
starts from. Replication is done per entity group by synchronously replicating the groups transaction log to a quorum of replicas. Writes typically require one round of inter-
228
datacenter communication, and healthy-case reads run locally. Current reads have the following guarantees:
A read always observes the last-acknowledged write.
After a write has been observed, all future reads observe that write. (A write might be observed before it
is acknowledged.)
4.2
4.3
Master-Based Approaches
To minimize latency, many systems use a dedicated master to which all reads and writes are directed. The master
participates in all writes, so its state is always up-to-date.
It can serve reads of the current consensus state without
any network communication. Writes are reduced to a single
round of communication by piggybacking a prepare for the
next write on each accept [14]. The master can batch writes
together to improve throughput.
Reliance on a master limits flexibility for reading and writing. Transaction processing must be done near the master
replica to avoid accumulating latency from sequential reads.
Any potential master replica must have adequate resources
for the systems full workload; slave replicas waste resources
until the moment they become master. Master failover can
require a complicated state machine, and a series of timers
must elapse before service is restored. It is difficult to avoid
user-visible outages.
4.4
Megastores Approach
In this section we discuss the optimizations and innovations that make Paxos practical for our system.
4.4.1
Fast Reads
4.5 Architecture
Figure 5 shows the key components of Megastore for an
instance with two full replicas and one witness replica.
Megastore is deployed through a client library and auxiliary servers. Applications link to the client library, which
implements Paxos and other algorithms: selecting a replica
for read, catching up a lagging replica, and so on.
Each application server has a designated local replica. The
229
4.6
This section details data structures and algorithms required to make the leap from consensus on a single value to
a functioning replicated log.
4.6.1
Replicated Logs
4.6.2
Reads
Client
Coordinator A
Replica A
Replica B
Replica C
Check Coordinator
Find Pos
Catchup
Get Logs
Apply Logs
Validate
(a) (Local read ) If step 1 indicates that the local replica is up-to-date, read the highest accepted log
position and timestamp from the local replica.
(b) (Majority read ) If the local replica is not up-todate (or if step 1 or step 2a times out), read from
a majority of replicas to find the maximum log
position that any replica has seen, and pick a replica to read from. We select the most responsive
or up-to-date replica, not always the local replica.
3. Catchup: As soon as a replica is selected, catch it up
to the maximum known log position as follows:
(a) For each log position in which the selected replica does not know the consensus value, read the
value from another replica. For any log positions
without a known-committed value available, invoke Paxos to propose a no-op write. Paxos will
drive a majority of replicas to converge on a single
valueeither the no-op or a previously proposed
write.
(b) Sequentially apply the consensus value of all unapplied log positions to advance the replicas state
to the distributed consensus state.
In the event of failure, retry on another replica.
4. Validate: If the local replica was selected and was not
previously up-to-date, send the coordinator a validate
message asserting that the (entity group, replica) pair
reflects all committed writes. Do not wait for a reply
if the request fails, the next read will retry.
5. Query Data: Read the selected replica using the
timestamp of the selected log position. If the selected
replica becomes unavailable, pick an alternate replica,
perform catchup, and read from it instead. The results
of a single large query may be assembled transparently
from multiple replicas.
Note that in practice 1 and 2a are executed in parallel.
4.6.3 Writes
Having completed the read algorithm, Megastore observes
the next unused log position, the timestamp of the last write,
and the next leader replica. At commit time all pending
changes to the state are packaged and proposed, with a
timestamp and next leader nominee, as the consensus value
for the next log position. If this value wins the distributed
230
consensus, it is applied to the state at all full replicas; otherwise the entire transaction is aborted and must be retried
from the beginning of the read phase.
As described above, coordinators keep track of the entity
groups that are up-to-date in their replica. If a write is not
accepted on a replica, we must remove the entity groups
key from that replicas coordinator. This process is called
invalidation. Before a write is considered committed and
ready to apply, all full replicas must have accepted or had
their coordinator invalidated for that entity group.
The write algorithm (shown in Figure 8) is as follows:
1. Accept Leader: Ask the leader to accept the value
as proposal number zero. If successful, skip to step 3.
2. Prepare: Run the Paxos Prepare phase at all replicas
with a higher proposal number than any seen so far at
this log position. Replace the value being written with
the highest-numbered proposal discovered, if any.
3. Accept: Ask remaining replicas to accept the value.
If this fails on a majority of replicas, return to step 2
after a randomized backoff.
4. Invalidate: Invalidate the coordinator at all full replicas that did not accept the value. Fault handling at
this step is described in Section 4.7 below.
5. Apply: Apply the values mutations at as many replicas as possible. If the chosen value differs from that
originally proposed, return a conflict error.
Step 1 implements the fast writes of Section 4.4.2. Writers using single-phase Paxos skip Prepare messages by sending an Accept command at proposal number zero. The next
leader replica selected at log position n arbitrates the value
used for proposal zero at n + 1. Since multiple proposers
may submit values with proposal number zero, serializing
at this replica ensures only one value corresponds with that
proposal number for a particular log position.
In a traditional database system, the commit point (when
the change is durable) is the same as the visibility point
(when reads can see a change and when a writer can be notified of success). In our write algorithm, the commit point is
after step 3 when the write has won the Paxos round, but the
visibility point is after step 4. Only after all full replicas have
accepted or had their coordinators invalidated can the write
be acknowledged and the changes applied. Acknowledging
before step 4 could violate our consistency guarantees: a current read at a replica whose invalidation was skipped might
Client
Replica A
Accept Leader
Replica B
Replica C
Coordinator C
4.7.2
Optional Prepare Messages
Accept
X
Optional Invalidate Message
Apply
231
Validation Races
4.8
Write Throughput
4.9
Operational Issues
90
% of Production Applications
Read/Query
Write
80
70
60
50
40
30
20
10
0
Reads
Writes
100
80
60
40
20
0
>99.999
>99.99
>99.9
>99
Operation Success Rate %
<99
10
100
Average Latency (ms)
1000
Most users see average write latencies of 100400 milliseconds, depending on the distance between datacenters, the
size of the data being written, and the number of full replicas. Figure 10 shows the distribution of average latency for
read and commit operations.
232
5.
EXPERIENCE
Development of the system was aided by a strong emphasis on testability. The code is instrumented with numerous
(but cheap) assertions and logging, and has thorough unit
test coverage. But the most effective bug-finding tool was
our network simulator: the pseudo-random test framework.
It is capable of exploring the space of all possible orderings
and delays of communications between simulated nodes or
threads, and deterministically reproducing the same behavior given the same seed. Bugs were exposed by finding a
problematic sequence of events triggering an assertion failure (or incorrect result), often with enough log and trace
information to diagnose the problem, which was then added
to the suite of unit tests. While an exhaustive search of
the scheduling state space is impossible, the pseudo-random
simulation explores more than is practical by other means.
Through running thousands of simulated hours of operation
each night, the tests have found many surprising problems.
In real-world deployments we have observed the expected
performance: our replication protocol optimizations indeed
provide local reads most of the time, and writes with about
the overhead of a single WAN roundtrip. Most applications have found the latency tolerable. Some applications
are designed to hide write latency from users, and a few
must choose entity group boundaries carefully to maximize
their write throughput. This effort yields major operational
advantages: Megastores latency tail is significantly shorter
than that of the underlying layers, and most applications
can withstand planned and unplanned outages with little or
no manual intervention.
Most applications use the Megastore schema language to
model their data. Some have implemented their own entityattribute-value model within the Megastore schema language,
then used their own application logic to model their data
(most notably, Google App Engine [8]). Some use a hybrid
of the two approaches. Having the dynamic schema built on
top of the static schema, rather than the other way around,
allows most applications to enjoy the performance, usability,
6. RELATED WORK
Recently, there has been increasing interest in NoSQL
data storage systems to meet the demand of large web applications. Representative work includes Bigtable [15], Cassandra [6], and Yahoo PNUTS [16]. In these systems, scalability is achieved by sacrificing one or more properties of
traditional RDBMS systems, e.g., transactions, schema support, query capability [12, 33]. These systems often reduce
the scope of transactions to the granularity of single key
access and thus place a significant hurdle to building applications [18, 32]. Some systems extend the scope of transactions to multiple rows within a single table, for example the
Amazon SimpleDB [5] uses the concept of domain as the
transactional unit. Yet such efforts are still limited because
transactions cannot cross tables or scale arbitrarily. Moreover, most current scalable data storage systems lack the
rich data model of an RDBMS, which increases the burden
on developers. Combining the merits from both database
and scalable data stores, Megastore provides transactional
ACID guarantees within a entity group and provides a flexible data model with user-defined schema, database-style and
full-text indexes, and queues.
Data replication across geographically distributed datacenters is an indispensable means of improving availability
in state-of-the-art storage systems. Most prevailing data
storage systems use asynchronous replication schemes with
a weaker consistency model. For example, Cassandra [6],
HBase [1], CouchDB [7], and Dynamo [19] use an eventual
233
7.
CONCLUSION
8. ACKNOWLEDGMENTS
Steve Newman, Jonas Karlsson, Philip Zeyliger, Alex Dingle, and Peter Stout all made substantial contributions to
Megastore. We also thank Tushar Chandra, Mike Burrows,
and the Bigtable team for technical advice, and Hector Gonzales, Jayant Madhavan, Ruth Wang, and Kavita Guliani for
assistance with the paper. Special thanks to Adi Ofer for
providing the spark to make this paper happen.
[19]
9. REFERENCES
[1] Apache HBase. http://hbase.apache.org/, 2008.
[2] Keyspace: A consistently replicated, highly-available
key-value store. http://scalien.com/whitepapers/.
[3] MySQL Cluster. http://dev.mysql.com/techresources/articles/mysql clustering ch5.html, 2010.
[4] Oracle Database. http://www.oracle.com/us/products/database/index.html, 2007.
[5] Amazon SimpleDB.
http://aws.amazon.com/simpledb/, 2007.
[6] Apache Cassandra.
http://incubator.apache.org/cassandra/, 2008.
[7] Apache CouchDB. http://couchdb.apache.org/, 2008.
[8] Google App Engine.
http://code.google.com/appengine/, 2008.
[9] Google Protocol Buffers: Googles data interchange
format. http://code.google.com/p/protobuf/, 2008.
[10] MySQL. http://www.mysql.com, 2009.
[11] Y. Amir, C. Danilov, M. Miskin-Amir, J. Stanton, and
C. Tutu. On the performance of wide-area
synchronous database replication. Technical Report
CNDS-2002-4, Johns Hopkins University, 2002.
[12] M. Armbrust, A. Fox, D. A. Patterson, N. Lanham,
B. Trushkowsky, J. Trutna, and H. Oh. Scads:
Scale-independent storage for social computing
applications. In CIDR, 2009.
[13] M. Burrows. The chubby lock service for
loosely-coupled distributed systems. In OSDI 06:
Proceedings of the 7th symposium on Operating
systems design and implementation, pages 335350,
Berkeley, CA, USA, 2006. USENIX Association.
[14] T. D. Chandra, R. Griesemer, and J. Redstone. Paxos
made live: an engineering perspective. In PODC 07:
Proceedings of the twenty-sixth annual ACM
symposium on Principles of distributed computing,
pages 398407, New York, NY, USA, 2007. ACM.
[15] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A.
Wallach, M. Burrows, T. Chandra, A. Fikes, and
R. E. Gruber. Bigtable: A distributed storage system
for structured data. ACM Trans. Comput. Syst.,
26(2):126, 2008.
[16] B. F. Cooper, R. Ramakrishnan, U. Srivastava,
A. Silberstein, P. Bohannon, H.-A. Jacobsen, N. Puz,
D. Weaver, and R. Yerneni. Pnuts: Yahoo!s hosted
data serving platform. Proc. VLDB Endow.,
1(2):12771288, 2008.
[17] B. F. Cooper, A. Silberstein, E. Tam,
R. Ramakrishnan, and R. Sears. Benchmarking cloud
serving systems with ycsb. In SoCC 10: Proceedings
of the 1st ACM symposium on Cloud computing, pages
143154, New York, NY, USA, 2010. ACM.
[18] S. Das, D. Agrawal, and A. El Abbadi. G-store: a
234
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]