Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Standard view
Full view
of .
Look up keyword
Like this
0 of .
Results for:
No results containing your search query
P. 1
A Practical Scalable Distributed B-Tree

A Practical Scalable Distributed B-Tree

Ratings: (0)|Views: 195|Likes:
Published by newtonapple

More info:

Published by: newtonapple on Nov 12, 2010
Copyright:Attribution Non-commercial


Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less





A Practical Scalable Distributed B-Tree
Marcos K. Aguilera
Microsoft Research Silicon ValleyMountain View, CAUSA
Wojciech Golab
University of TorontoToronto, ONCanada
Mehul A. Shah
HP LaboratoriesPalo Alto, CAUSA
Internetapplications increasingly rely on scalable data struc-tures that must support high throughput and store hugeamounts of data. These data structures can be hard toimplement efficiently. Recent proposals have overcome thisproblem by giving up on generality and implementing spe-cialized interfaces and functionality (e.g., Dynamo [4]). Wepresent the design of a more general and flexible solution:a fault-tolerant and scalable distributed B-tree. In additionto the usual B-tree operations, our B-tree provides some im-portant practical features: transactions for atomically exe-cuting several operations in one or more B-trees, online mi-gration of B-tree nodes between servers for load-balancing,and dynamic addition and removal of servers for supportingincremental growth of the system.Our design is conceptually simple. Rather than usingcomplex concurrency and locking protocols, we use distri-buted transactions to make changes to B-tree nodes. Weshow how to extend the B-tree and keep additional infor-mation so that these transactions execute quickly and effi-ciently. Our design relies on an underlying distributed datasharing service, Sinfonia [1], which provides fault toleranceand a light-weight distributed atomic primitive. We use thisprimitive to commit our transactions. We implemented ourB-tree and show that it performs comparably to an existingopen-source B-tree and that it scales to hundreds of ma-chines. We believe that our approach is general and can beused to implement other distributed data structures easily.
Internet applications increasingly rely on scalable datastructures that must support high throughput and storehuge amounts of data. Examples of such data structuresinclude Amazon’s Dynamo [4] and Google’s BigTable [3].They support applications that manage customer shoppingcarts, analyze website traffic patterns, personalize search re-sults, and serve photos and videos. They span a large num-ber of machines (hundreds or thousands), and store an un-
Work developed while author was at HP Laboratories.
Permission tocopy without feeall orpart ofthis material isgranted providedthat the copies are not made or distributed for direct commercial advantage,theVLDBcopyrightnoticeandthetitleofthepublicationanditsdateappear,and notice is given that copying is by permission of the Very Large DataBase Endowment. To copy otherwise, or to republish, to post on serversor to redistribute to lists, requires a fee and/or special permission from thepublisher, ACM.
VLDB ‘08,
August 24-30, 2008, Auckland, New ZealandCopyright 2008 VLDB Endowment, ACM 000-0-00000-000-0/00/00.
precedented amount of data (tens of Petabytes) for a hugecustomer base (hundreds of millions of users) that can gen-erate high retrieve and update rates.Besides massive scalability, three other requirements drivethe design of these data structures: low cost, fault toler-ance, and manageability. Low cost calls for cheap commod-ity hardware and precludes the use of expensive businesssoftware. Fault tolerance is important for continuous busi-ness operation. And manageability is necessary because hu-man time is precious and unmanageable systems can leadto catastrophic human errors. These requirements are noteasy to meet, and as a result recent proposals compromiseon generality and opt for approaches tailored for a givenuse. In particular, most deployed solutions are limited tosimple, hash-table-like, lookup and update interfaces withspecialized semantics [4].In this paper, we present a more general and flexible datastructure, a
distributed B-tree
that is highly scalable, lowcost, fault-tolerant, and manageable. We focus on a B-treewhose nodes are spread over multiple servers in a local-areanetwork
. Our B-tree is a B+tree, where leaf nodes holdkey-value pairs and inner nodes only hold key-pointer pairs.A B-tree supports the usual dictionary operations (
), as well as ordered traversal (
). In addition, our distributed B-tree provides somepractical features that are absent in previous designs (suchas [9, 14]):
Transactional access.
An application can execute sev-eral operations on one or more B-trees, and do so atom-ically. Transactional access greatly simplifies the de-velopment of higher-level applications.
Online migration of tree nodes.
We can move treenodes transparently from one server to another exist-ing or newly-added server, while the B-tree continuesto service requests. This feature helps in performingonline management activities necessary for continuousbusiness operation. It is useful for replacing, adding,or maintaining servers, and thus it enables smooth in-cremental growth and periodic technology refresh. Italso allows for load balancing among servers to accom-modate changing workloads or imbalances that arise asservers come and go.
1.1 Motivating use cases
Here are some concrete use cases and examples of howapplications might benefit from our B-tree:
Throughout the paper, the term
refers to nodes of theB-tree, unless explicitly qualified, as in “memory nodes.”
ab cd e f g
server1 (memorynode1)
ab cd e f g
ab cd e f g
ab cd e f g
ab cd e f g
Figure 1: Our distributed B-tree. Nodes are dividedamong servers (grey indicates absence of node). Aversion table stores version numbers for inner nodes.Leaf nodes have versions, but these are not stored inthe version table. Two types of replication are donefor performance: (a) lazy replication of inner nodesat clients, and (b) eager replication of the versiontable at servers. Note that a realistic B-tree willhave a much greater fan-out than shown. With afan-out of 200, inner nodes represent
of allnodes.
The back-end of a multi-player game.
Multi-playergames have thousands of players who generate highaggregate request rates, and latency can be critical.These systems keep persistent state for the players,such as their inventory and statistics. Our B-tree couldkeep this persistent state: transactional access can im-plement atomic multi-object updates that ensure stateconsistency, while range queries are useful for search-ing. For instance, transactional access ensures that aplayer’s item does not appear in two places simulta-neously, and range queries can be used to search foritems in a player’s inventory.
Keeping metadata in a cluster file system.
In a file sys-tem, metadata refers to the attributes of files, the listof free blocks, and the contents of directories. Meta-data access is often a bottleneck in cluster file systemssuch as Lustre or Hadoop’s HDFS [21, 22]. Our B-treecould hold the metadata and alleviate this bottleneck.Transactional access is useful for implementing atomicoperations, like rename, which involves atomically in-serting and deleting a key-value. Ordered traversal isuseful for enumerating files in a directory. And distri-bution provides scalable performance.
Secondary indices.
Many applications need to keepmore than one index on data. For example, an e-auction site may support search of auction data bytime, bid price, and item name, where each of theseattributes has an index. We can keep each index in aseparate B-tree, and use transactional access to keepindices mutually consistent.One might be able to use database systems instead of distributed B-trees for these applications, but distributed B-trees are easier to scale, are more streamlined, have a smallerfootprint, and are easier to integrate inside an application.In general, a distributed B-tree is a more basic building blockthan a database system. In fact, one could imagine usingthe former to build the latter.
1.2 Challenges and contribution
Our B-tree is implemented on top of Sinfonia, a low-levelfault-tolerant distributed data sharing service. As shownin Figure 1, our B-tree comprises two main components: aB-tree client library that is linked in with the application,and a set of servers that store the B-tree nodes. Sinfoniatransparently makes servers fault-tolerant.The main difficulty in building a scalable distributed B-tree is to perform consistent concurrent updates to its nodeswhile allowing high concurrency. Unlike previous schemesthat use subtle (and error-prone) concurrency and lockingprotocols [9, 14], we take a simpler approach. We update B-tree nodes spread across the servers using distributed trans-actions. For example, an
operation may have to splita B-tree node, which requires modifying the node (storedon one server) and its parent (stored possibly on a differentserver); clients use transactions to perform such modifica-tions atomically, without having to worry about concurrencyor locking protocols. Sinfonia provides a light-weight distri-buted atomic primitive, called a
, which weuse to implement our transactions (see Section 2).A key challenge we address is how to execute such trans-actions efficiently. A poor design can incur many networkround-trips or limit concurrent access. Our solution relieson a combination of three techniques. (1) We combine opti-mistic concurrency control and minitransactions to imple-ment our transactions. (2) Our transactions use eagerlyreplicated version numbers associated with each B-tree nodeto check if the node has been updated. We replicate theseversion numbers across all servers and keep them consistent.(3) We lazily replicate B-tree inner nodes at clients, so thatclients can speculatively navigate the inner B-tree withoutincurring network delays.With these techniques, a client executes B-tree operationsin one or two network round-trips most of the time, and noserver is a performance bottleneck. We have implementedour scheme and evaluated it using experiments. The B-treeperforms well compared to an open-source B-tree implemen-tation for small configurations. Moreover, we show that itcan scale almost linearly to hundreds of machines for readand update workloads.This paper is organized as follows. We describe the basicapproach in Section 2. In Section 3, we explain the assump-tions upon which we rely. We then explain the featuresof our B-tree in Section 4. In Section 5, we describe thetransactions we use and techniques to make them fast. TheB-tree algorithm is presented in Section 6, followed by itsexperimental evaluation in Section 7. We discuss some ex-tensions in Section 8. Related work is explained in Section 9.Section 10 concludes the paper.
        S         i      n        f      o       n        i      a 
user librarymemorynodememorynodememorynode
Figure 2: Sinfonia service, on top of which we buildour distributed B-tree.
In this Section, we give an overview of our B-tree design.We give further technical details in later sections. Our B-tree is built on top of the Sinfonia service, and so we firstdescribe Sinfonia and its features in Section 2.1. Then, inSection 2.2, we outline our design and justify the decisionswe made.
2.1 Sinfonia overview
Sinfonia is a data sharing service that, like our B-tree,is composed of a client library and a set of servers called
memory nodes
(see Figure 2). Each memory node exports alinear address space without any structure imposed by Sinfo-nia. Sinfonia ensures the memory nodes are fault-tolerant,offering several levels of protection. Sinfonia also offers apowerful primitive, a
, that can perform con-ditional atomic updates to many locations at many servers.A minitransaction is a generalization of an atomic compare-and-swap operation. While the compare-and-swap opera-tion performs one comparison and one conditional updateon one address, a minitransaction can perform many com-parisons and updates, where each update is conditioned onall comparisons (i.e., all updates occur atomically only if all comparisons succeed). Comparisons and updates can beon different memory locations and memory nodes. In ad-dition, minitransactions can read data from memory nodesand return it to the application.More precisely, a minitransaction comprises a set of com-pare items, a set of read items, and a set of write items.Each item specifies an address range on a memory node;compare items and write items also have data. All itemsmust be specified
a minitransaction starts executing.When executed, a minitransaction compares the locationsspecified by the compare items against the specified data.If all comparisons succeed, the minitransaction returns thecontents of the locations specified by the read items andupdates the locations specified by the write items with thespecified data.Sinfonia uses a variant of two-phase commit to executeand commit minitransactions in two phases. Memory nodeslock the minitransaction items only for the duration of thetwo-phase commit. Sinfonia immediately aborts the mini-transaction if it cannot acquire the locks, and the client mustretry. Thus, unlike general transactions, a minitransactionis short-lived, lasting one or two network round-trips. Moredetails on minitransactions and commit protocols are in theSinfonia paper [1].Minitransactions are not optimistic, but they can be usedto implement general optimistic transactions, which we usefor our B-tree operations. Briefly, each transaction (notminitransaction) maintains a read and write set of objectsthat it touches. Each object has a version number that isincremented on updates. We commit a transaction using aminitransaction to (1) verify that (version numbers of) ob- jects in the read set are unchanged and, (2) if so, update theobjects in the write set. In other words, a minitransactionperforms the final commit of optimistic transactions.It is worth noting that, even though we built optimistictransactions using Sinfonia, we could haveimplemented themfrom scratch. Thus, our B-tree design is not limited torunning on top of Sinfonia. However, the performance andscalability numbers reported here depend on the lightweightminitransactions of Sinfonia.
2.2 Design outline
Figure 1 shows how our B-tree is distributed across servers(which are Sinfonia memory nodes). Each B-tree node isstored on a single server, though clients may keep replicasof these nodes (possibly old versions thereof). The true stateof the B-tree nodes is on the servers and the clients store nopermanent state. Nodes are partitioned across servers ac-cording to some data placement policy. For example, largerservers may get more nodes than others. Our current im-plementation allocates nodes randomly among servers.Our B-tree operations are implemented as natural exten-sions of centralized B-tree algorithms wrapped in optimistictransactions. As a client traverses the tree, it retrieves nodesfrom the servers as needed, and adds those nodes to thetransaction’s read set. If the client wants to change a node,say due to a key-value insertion or a node split, the clientlocally buffers the change and adds the changed node to thetransaction’s write set. To commit a transaction, the clientexecutes a Sinfonia minitransaction, which (a) validates thatthe nodes in the read set are unchanged, by checking thattheir version numbers match what is stored at the servers,and (b) if so, atomically performs the updates in the writeset. As we explain, a consequence of wrapping B-tree oper-ations in transactions is that we can easily provide featuressuch as online node migration and multi-operation transac-tions.Since clients use optimistic concurrency control, they donot lock the nodes during the transaction, thereby allow-ing high concurrency. Instead, nodes are only locked duringcommit— and this is done by Sinfonia as part of minitrans-action execution—which is short-lived. If a minitransactionaborts, a client retries its operation. Optimistic concurrencycontrol works well because, unless the B-tree is growing orshrinking dramatically, there is typically little update con-tention on B-tree nodes.Using optimistic concurrency control alone, however, isnot enough. A transaction frequently needs to check the ver-sion number of the root node and other upper-level nodes,because tree traversals will frequently involve these nodes.This can create a performance bottleneck at the serversholding these nodes. To avoid such hot-spots, we replicatethe node version numbers across all servers, so that theycan be validated at any server (see Figure 1). Only versionnumbers are replicated, not entire tree nodes, and only forinner nodes, not leaf nodes. We keep the version numberreplicas consistent, by updating the version number at allservers when an inner node changes, and this is done aspart of the transaction. Because inner nodes change infre-

Activity (5)

You've already reviewed this. Edit your review.
1 thousand reads
1 hundred reads
maelbesson liked this
Pankesh Bamotra liked this
FuadMuzaki09 liked this

You're Reading a Free Preview

/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->