A Practical Scalable Distributed B-Tree
Marcos K. Aguilera
Microsoft Research Silicon ValleyMountain View, CAUSA
University of TorontoToronto, ONCanada
Mehul A. Shah
HP LaboratoriesPalo Alto, CAUSA
Internetapplications increasingly rely on scalable data struc-tures that must support high throughput and store hugeamounts of data. These data structures can be hard toimplement eﬃciently. Recent proposals have overcome thisproblem by giving up on generality and implementing spe-cialized interfaces and functionality (e.g., Dynamo ). Wepresent the design of a more general and ﬂexible solution:a fault-tolerant and scalable distributed B-tree. In additionto the usual B-tree operations, our B-tree provides some im-portant practical features: transactions for atomically exe-cuting several operations in one or more B-trees, online mi-gration of B-tree nodes between servers for load-balancing,and dynamic addition and removal of servers for supportingincremental growth of the system.Our design is conceptually simple. Rather than usingcomplex concurrency and locking protocols, we use distri-buted transactions to make changes to B-tree nodes. Weshow how to extend the B-tree and keep additional infor-mation so that these transactions execute quickly and eﬃ-ciently. Our design relies on an underlying distributed datasharing service, Sinfonia , which provides fault toleranceand a light-weight distributed atomic primitive. We use thisprimitive to commit our transactions. We implemented ourB-tree and show that it performs comparably to an existingopen-source B-tree and that it scales to hundreds of ma-chines. We believe that our approach is general and can beused to implement other distributed data structures easily.
Internet applications increasingly rely on scalable datastructures that must support high throughput and storehuge amounts of data. Examples of such data structuresinclude Amazon’s Dynamo  and Google’s BigTable .They support applications that manage customer shoppingcarts, analyze website traﬃc patterns, personalize search re-sults, and serve photos and videos. They span a large num-ber of machines (hundreds or thousands), and store an un-
Work developed while author was at HP Laboratories.
Permission tocopy without feeall orpart ofthis material isgranted providedthat the copies are not made or distributed for direct commercial advantage,theVLDBcopyrightnoticeandthetitleofthepublicationanditsdateappear,and notice is given that copying is by permission of the Very Large DataBase Endowment. To copy otherwise, or to republish, to post on serversor to redistribute to lists, requires a fee and/or special permission from thepublisher, ACM.
August 24-30, 2008, Auckland, New ZealandCopyright 2008 VLDB Endowment, ACM 000-0-00000-000-0/00/00.
precedented amount of data (tens of Petabytes) for a hugecustomer base (hundreds of millions of users) that can gen-erate high retrieve and update rates.Besides massive scalability, three other requirements drivethe design of these data structures: low cost, fault toler-ance, and manageability. Low cost calls for cheap commod-ity hardware and precludes the use of expensive businesssoftware. Fault tolerance is important for continuous busi-ness operation. And manageability is necessary because hu-man time is precious and unmanageable systems can leadto catastrophic human errors. These requirements are noteasy to meet, and as a result recent proposals compromiseon generality and opt for approaches tailored for a givenuse. In particular, most deployed solutions are limited tosimple, hash-table-like, lookup and update interfaces withspecialized semantics .In this paper, we present a more general and ﬂexible datastructure, a
that is highly scalable, lowcost, fault-tolerant, and manageable. We focus on a B-treewhose nodes are spread over multiple servers in a local-areanetwork
. Our B-tree is a B+tree, where leaf nodes holdkey-value pairs and inner nodes only hold key-pointer pairs.A B-tree supports the usual dictionary operations (
), as well as ordered traversal (
). In addition, our distributed B-tree provides somepractical features that are absent in previous designs (suchas [9, 14]):
An application can execute sev-eral operations on one or more B-trees, and do so atom-ically. Transactional access greatly simpliﬁes the de-velopment of higher-level applications.
Online migration of tree nodes.
We can move treenodes transparently from one server to another exist-ing or newly-added server, while the B-tree continuesto service requests. This feature helps in performingonline management activities necessary for continuousbusiness operation. It is useful for replacing, adding,or maintaining servers, and thus it enables smooth in-cremental growth and periodic technology refresh. Italso allows for load balancing among servers to accom-modate changing workloads or imbalances that arise asservers come and go.
1.1 Motivating use cases
Here are some concrete use cases and examples of howapplications might beneﬁt from our B-tree:
Throughout the paper, the term
refers to nodes of theB-tree, unless explicitly qualiﬁed, as in “memory nodes.”