This action might not be possible to undo. Are you sure you want to continue?
Mike Burrows, Google Inc.
example, the Google File System  uses a Chubby lock We describe our experiences with the Chubby lock ser-to appoint a GFS master server, and Bigtable  uses vice, which is intended to provide coarse-grained lock-Chubby in several ways: to elect a master, to allow the ing as well as reliable (though low-volume) storage for amaster to discover the servers it controls, and to permit loosely-coupled distributed system. Chubby provides anclients to find the master. In addition, both GFS and interface much like a distributed file system with ad-Bigtable use Chubby as a well-known and available locavisory locks, but the design emphasis is on availabilitytion to store a small amount of meta-data; in effect they and reliability, as opposed to high performance. Manyuse Chubby as the root of their distributed data strucinstances of the service have been used for over a year,tures. Some services use locks to partition work (at a with several of them each handling a few tens of thou-coarse grain) between several servers. sands of clients concurrently. The paper describes the Before Chubby was deployed, most distributed sysinitial design and expected use, compares it with actualtems at Google used ad hoc methods for primary elecuse, and explains how the design had to be modified totion (when work could be duplicated without harm), or required operator intervention (when correctness was esaccommodate the differences. sential). In the former case, Chubby allowed a small saving in computing effort. In the latter case, it achieved a significant improvement in availability in systems that no 1 Introduction longer required human intervention on failure. This paper describes a lock service called Chubby. It is Readers familiar with distributed computing will recintended for use within a loosely-coupled distributed sys-ognize the election of a primary among peers as an intem consisting of moderately large numbers of small ma-stance of the distributed consensus problem, and realize chines connected by a high-speed network. For example, awe require a solution using asynchronous communicaChubby instance (also known as a Chubby cell) mighttion; this term describes the behaviour of the vast maserve ten thousand 4-processor machines connected by jority of real networks, such as Ethernet or the Internet, 1Gbit/s Ethernet. Most Chubby cells are confined to awhich allow packets to be lost, delayed, and reordered. single data centre or machine room, though we do run at(Practitioners should normally beware of protocols based least one Chubby cell whose replicas are separated byon models that make stronger assumptions on the enthousands of kilometres. vironment.) Asynchronous consensus is solved by the The purpose of the lock service is to allow its clients toPaxos protocol [12, 13]. The same protocol was used by synchronize their activities and to agree on basic in-Oki and Liskov (see their paper on viewstamped replicaformation about their environment. The primary goalstion [19, §4]), an equivalence noted by others [14, §6]. included reliability, availability to a moderately large setIndeed, all working protocols for asynchronous consenof clients, and easy-to-understand semantics; through- putsus we have so far encountered have Paxos at their core. and storage capacity were considered secondary.Paxos maintains safety without timing assumptions, but Chubby’s client interface is similar to that of a simple fileclocks must be introduced to ensure liveness; this oversystem that performs whole-file reads and writes, aug-comes the impossibility result of Fischer et al. [5, §1]. mented with advisory locks and with notification of var- Building Chubby was an engineering effort required to ious events such as file modification. fill the needs mentioned above; it was not research. We We expected Chubby to help developers deal withclaim no new algorithms or techniques. The purpose of coarse-grained synchronization within their systems, andthis paper is to describe what we did and why, rather than in particular to deal with the problem of electing a leaderto advocate it. In the sections that follow, we de- scribe from among a set of otherwise equivalent servers. For Chubby’s design and implementation, and how it
many programmers have come across locks before. One consensus. This suggests that• we chose to serve small-files to permit elected priwe should allow clients to store and fetch small quanti. Thus. covered elsewhere in the literature. implemented as state machines. if a client system their systems start as prototypes with little load and loose uses a lock service. rather than a library that accesses a cen-few consider the effects of independent machine fail. this approach solves none of nique easier than making existing servers participate in a the other problems described above.our environment: both because this reduces the number of servers on which• A service advertising its primary via a Chubby file may a client depends. we found• Clients and replicas of a replicated service may wish to that developers greatly appreciated not having know when the service’s primary changes.ferent ing program structure and communication patterns. This rather than build and maintain a second service. However. a similar technique has been used to ger (the lock acquisition count) with the write RPC. as opposed to a library or that partition data between their components need a service for consensus. First. Second. we provide such a Last. such as the details of a Third. assuming a consensus the acquisition count is lower than the current value (to service is not used exclusively to provide locks (which guard against delayed packets). and think they know to use 2. Ironically. and would provide a standard frame-overcomes a hurdle in persuading programmers to use a work for programmers. While this allows a client system to make decisions correctly when could be done with a library that provides distributed less than a majority of its own members are up. a lock service reduces specially structured for use with a consensus proto. ties of data—that is. so they use several replicas to achieve Nevertheless. replication and primary elecservice as a way of providing a generic electorate that tion are then added to an existing design. but our experience has Some decisions follow from our expected use and from been that the lock service itself is well-suited for this task. the apparent familiarity of locks the name service). A clienton locks in a system with asynchronous communiPaxos library would depend on no other servers (besidescations. prefer. and especially so if compatibility These arguments suggest two key design decisions: must be maintained during a transition period. We have found this techreduces it to a lock service). Chubby’s success as a name thousands of clients to observe this file. make decisions. However. In a loose sense. and reduce the number of state machines needed for Byzanadd an if-statement to the file server to reject the write if tine fault tolerance . Like a lock service. availability to make progress.ably server owes much to its use of consistent client caching. Both the replicated state machine of Paxos and the critical sections associated with exclusive locks can provide the programmer with the illusion of sequen2 Design tial programming. Of. Nevertheless. to read and write small files. a lock-based interface is more familiar to our consensus protocol or an RPC system. In contrast. One might argue that we should have built a library em-especially when they use locks in a distributed system. Therefore. many of our services that elect a primary or• We chose a lock service. pass an additional inteactive client process. bodying Paxos. invariably the code has not been and make progress safely.ten ning for the cell to be up. tures that proved to be mistakes. rather than time-based caching. and because the consistency fea. For way: by providing a “consensus service”. distributed-consensus algorithms use quorums to client library that is independent of Chubby.ures tralized lock service.col. programmers. using a number example. and mechanism for advertising the results. consensus protocol. We describe un-to choose a cache timeout such as the DNS time-to-live expected ways in which Chubby has been used. even a single client can obtain a lock availability guarantees. the number of servers needed for a reliable client system As the service matures and gains clients. which if chosen poorly can lead to high DNS load. such programmers are usually wrong. one can view the lock becomes more important. Chubby itself usually has client library. a consensus service would one RPC parameter to an existing system: One would allow clients to make progress safely even with only one acquire a lock to become master. our developers sometimes do not plan five replicas in each cell. For example. a lock server makes it easier to maintain existmight imagine solving this last problem in a dif. and fea-value. of which three must be runfor high availability in the way one would wish. We omit details that areor long client fail-over times. without needing many servers. assuming their services can bereliable mechanism for distributed decision making. even a highly reliable one. This .has changed in the light of experience. could be done with a name service.1 Rationale them.tures of have thousands of clients. to elect a master which then writes to an exof servers to provide the “acceptors” in the Paxos isting file server requires adding just two statements and protocol.maries to advertise themselves and their parameters. we must allow the protocol are shared. Indeed. a lock service has some advantages over a high availability. In particular.
Fine-grained locks lead to different conclusions. A choice that may surprise some readers is that we do not expect lock use to be fine-grained. • Even if clients need not poll files periodically.) Chubby is intended to provide only coarse-grained locking. Even brief unavailability of the lock server may cause many clients to stall. it is straightforward for clients to implement their own fine-grained locks tailored to their application. Little state is needed to maintain these fine-grain locks. so temporary lock server unavailability delays clients less. and the time penalty for dropping locks every so often is not severe because locks are held for short periods. this is a consequence of supporting many developers. we expect coarse-grained use. These two styles of use suggest different requirements from a lock server.suggests that an event notification mechanism would be useful to avoid polling. The most important benefits of this . monotonicallyincreasing acquisition counter that is rarely updated. . there is little concern about the overhead of doing so. Performance and the ability to add new servers at will are of great concern because the transaction rate at the lock service grows with the combined transaction rate of clients. the servers need only keep a non-volatile. the lock-acquisition rate is usually only weakly related to the transaction rate of the client applications. An application might partition its locks into groups and use Chubby’s coarse-grained locks to allocate these lock groups to application-specific lock servers. including access control. it is good for coarsegrained locks to survive lock server failures. the protocol can be simple and efficient. many client applicati on PP ♠ P c l i b 5 servers of a Chubby cell will. an application might use a lock to elect a primary. Thus. instead. the transfer of a lock from client to client may require costly recovery procedures. so the loss of locks on lock server fail-over introduces no new recovery paths. in which they might be held only for a short duration (seconds or less).. so we prefer consistent caching. caching of files is desirable. we provide security mechanisms. and if a simple fixed-length lease is used. and such locks allow many clients to be adequately served by a modest number of lock servers with somewhat lower availability. Clients can learn of lost locks at unlock time. In particular. For example. (Clients must be prepared to lose locks during network partitions. Thus. Fortunately. Coarse-grained locks are acquired only rarely. client application ♠ RPCs master ✏ chubby ✏✏ ✏✏ ♠ ♠ • To avoid both financial loss and jail time. On the other hand. perhaps hours or days. so one would not wish a fail-over of a lock server to cause locks to be lost. It can be advantageous to reduce the overhead of locking by not maintaining locks across lock server failure. • Our developers are confused by non-intuitive caching semantics. Coarse-grained locks impose far less load on the lock server.. which would then handle all access to that data for a considerable time.
yet are relieved of the complexity of implementing consensus themselves. Non-master replicas respond to such requests by returning the identity of the master.1.client processes ♠ Figure 1: System structure scheme are that our client developers become responsible for the provisioning of the servers needed to support their load. a new master will typically be elected in a few seconds. see Figure 1. but only the master initiates reads and writes of this database. the other replicas run the election protocol when their master leases expire. Write requests are propagated via the consensus protocol to all replicas. plus promises that those replicas will not elect a different master for an interval of a few seconds known as the master lease. For example. The replicas use a distributed consensus protocol to elect a master. All other replicas simply copy updates from the master. 2. and a library that client applications link against. in different racks). but we see values as high as 30s (§4. or until it indicates that it is no longer the master. as no other master can possibly exist. is discussed in Section 3. If a master fails. All communication between Chubby clients and the servers is mediated by the client library. the client directs all requests to it either until it ceases to respond. The master lease is periodically renewed by the replicas provided the master continues to win a majority of the vote.2 System structure Chubby has two main components that communicate via RPC: a server.1). . this is safe provided the master lease has not expired. Clients find the master by sending master location requests to the replicas listed in the DNS. An optional third component. placed so as to reduce the likelihood of correlated failure (for example. sent using the consensus protocol. Once a client has located the master. the master must obtain votes from a majority of the replicas. such requests are acknowledged when the write has reached a majority of the replicas in the cell. Read requests are satisfied by the master alone. a proxy server. The replicas maintain copies of a simple database. two recent elections took 6s and 4s. A Chubby cell consists of a small set of servers (typically five) known as replicas.
so full access control checks need be write basic browsing and name space manipulation tools. this inbe used. including three the members via the normal replication protocol. To allow the files in different directories to be scriptors cannot be forged). Because Chubby’s ACLs are of files and directories in the usual way. which checks its permissions bits at open The design differs from UNIX in a ways that ease dis. /wombat/pouch.If a replica fails and does not recover for a few hours. each di-• an ACL generation number. and • mode information provided at open time to allow the we avoid path-dependent permission semantics (that is. thus the one most likely to be accessible. we do not maintain directory modified times.a handle was generated by it or by a previous master. they are automatically available to other sercomponents separated by slashes. other. Thus. replacing the IP address of theries.Each Chubby file and directory can act as a reader-writer collectively called nodes. writing and changing the ACL names for the database from a combination of backups stored on file node. there are no symbolic or hard links.time. Chubby also exposes a 64-bit file-content checksum so Because Chubby’s naming structure resembles a fileclients may tell whether files differ. or any number of client handles may . which is a well-known waiting to commit. Any simple replacement system selects a fresh machine from anode may be deleted explicitly.3 Files.4. named Chubby cell.der • a lock generation number. this is usually one in the same building and creases when the file’s contents are written. performed only when handles are created (com. It thenalso deleted if no client has them open (and. while node’s ACL names are written. and increasing 64-bit numbers that allow clients to detect stands for lock service. with UNIX. The current mas-rary files. A typical name is: vices that wish to use similar access control mechanisms. but not at each read/write because file detribution. To make it easier to cache file meta-data. system. and handles foo that contains an entry bar. Users are authenticated by a mechanism simpler than that of UNIX . Once the new of its parent directory on creation. a Nodes may be either permanent or ephemeral. A special cell name of any previous node with the same name. sive (writer) mode. local indicates that the client’s local Chubby cell should • a content generation number (files only). greater than the instance number Chubby servers via DNS lookup.4 Locks and sequencers system does not reveal last-access times. they are empty).pare and reduced the need to educate casual Chubby users. we do not expose • a sequence number that allows a master to tell whether operations that can move files from one directory to an. Handles include: by our other file systems. with name simply files. and as indicators to others that a client is alive. we were able to make it available to applications Clients open nodes to obtain handles that are analoboth with its own specialized API. is interpreted within the node’s lock transitions from free to held. the replica is permitted to vote in the part of the cell’s local name space. master to recreate its state if an old handle is presented access to a file is controlled by the permissions on the to a newly restarted master. The second component (foo) ischanges easily: the name of a Chubby cell. this list is kept consistent across all Each node has various meta-data. but ted to write F. if file F’s write ACL name is foo. file itself rather than on directories on the path leading to the file). The remain. /ls/foo/wombat/pouch The per-node meta-data includes four monotonicallyThe ls prefix is common to all Chubby names. and the ACL directory contains a file 2. This significantly reduced the effort needed to guessing handles. In the names of access control lists (ACLs) used to control meantime. ter polls the DNS periodically and eventually notices theAny node can act as an advisory reader/writer lock. the cell’s database. It then updates the list of the cell’s members inlocks are described in more detail in Section 2. directories. Again following UNIX. the 2. the new replica obtains a recent copy of the reading. this increases when the rectory contains a list of child files and directories. it is resolved to one or more• an instance number. but ephemeral nodes are free pool and starts the lock server binary on it. these change. Every such node has only onelock: either one client handle may hold the lock in excluname within its cell. Unless overridden. for directoupdates the DNS tables. ACLs are themselves replica has processed a request that the current master is files located in an ACL directory. this increases when the of the name. each file contains a sequence of uninterpreted bytes. a node inherits the ACL names servers and updates from active replicas. sist of simple lists of names of principals. It consists of a strict tree built into the RPC system. readers may be reminded of Plan 9’s groups . Ephemeral files are used as tempofailed replica with that of the new one. and via interfaces usedgous to UNIX file descriptors. The name space contains only files and directories. such as the Google File• check digits that prevent clients from creating or System. then user bar is permitChubby exports a file system interface similar to. These ACL files conelections for new master. served from different Chubby masters.
locks are advisory. 2. removed. rather than just the file associated withprotocols evolve slowly. the lock-delay protects held”. ties to corrupt data when locks are not held.12). cally suggests a communications problem. if a client is informed that file contents the lock immediately after acquisition. so they benefit little from mandatory checks. it is immediately available for other clients to claim. and the lock generation number. (exclusive or shared). by writing assertions such as “lock X isily long time. returning events for child the protection of L. and with The client passes the sequencer to servers (such as filehindsight could have been omitted. These events are delivered to the writer from making progress. it is guaranteed to see the new data (or data name of the lock. Instead. and potentially on inconsistent data. Thus. After primary servers) if it expects the operation to be protected by theelection for example.5 Events In Chubby. acquiring a lock in either mode requires write Chubby clients may subscribe to a range of events when permission so that an unprivileged reader cannot prevent a they create a handle. At any time. client asynchronously via an up-call from the Chubby liLocking is complex in distributed systems becausebrary. The validity of a known to most programmers. cations when they needed to access locked files foras one would expect.Events are delivered after the corresponding action has quencer. nodes makes it possible to monitor ephemeral files The problem of receiving messages out of order has been without affecting their reference counts. The sequencer mechanism reso.• child node added. use of locks. or modified—used to imquire L and perform some action before R arrives at its plement mirroring (§2. an opaque byte-string that describes the state oftaken place. thus. if a lock becomes free debugging or administrative purposes.unmodified servers and clients from everyday problems Buggy or malicious processes have many opportunicaused by message delays and restarts. Like the mutexesif not. If R later arrives. currently one the user to shut down his applications or to reboot. That is. so we find the extra guards provided by mandatory locking to be of no significant value. system.hold the lock in shared (reader) mode. so data must be rescanned. In a complexbecause the holder has failed or become inaccessible. The recipient server is expected to test whethernicate with the new primary. that messages are processed in an order consistent with • a handle (and its lock) has become invalid—this typithe observations of every participant. where administrative soft-the lock for a period called the lock-delay. (In addition to allowing destination. mary has been elected. it may be acted on without new files to be discovered. If a client releases a lock in the normal • We did not wish to force users to shut down appli-way. if the server does not wish to maintain a seslock: holding a lock called F neither is necessary tosion with Chubby. While imperfect. and processes may• file contents modified—often used to monitor the lofail independently. it is harder to use the approach employed onthe lock server will prevent other clients from claiming most personal computers. clients typically need to commulock. against the most recent sequencer that access the file F . We rejected mandatory locks. a lock holder may request a se. objects inaccessible to clients not holding their locks: and is easily explained to our developers. minute. solutions include virtual time . it should reject the request.• Chubby master failed over—warns clients that other tual synchrony . a primary exists. port sequencers. To enforce mandatory locking in a meaning-imperfect but easier mechanism to reduce the risk of deful way would have required us to make more exten-layed or re-ordered requests to servers that do not supsive modification of these services.ingthe server has observed. nor prevents other clients from do. issue a request R. Another process may ac. the mode in which it was acquiredthat is yet more recent) if it subsequently reads the file. Events include: communication is typically uncertain. It is costly to introduce sequence numbers into all • lock acquired—can be used to determine when a prithe interactions in an existing complex system. Chubby provides a means by which sequence numbers• conflicting lock request from another client—allows can be introduced into only those interactions that make the caching of locks.) well studied. this limit prevents a faulty client from making a • Our developers perform error checking in the conven-lock (and thus some resource) unavailable for an arbitrartional way. The last two events mentioned are rarely used. Clients may ware can break mandatory locks simply by instructing specify any lock-delay up to some bound. However.sequencer can be checked against the server’s Chubby they conflict only with other attempts to acquire the samecache or. It contains thehave changed. but then fail. which avoids the problem by ensuring events may have been lost. a process holding a lock L may cation of a service advertised via the file. Chubby therefore provides an the lock. rather than simply know that the sequencer is still valid and has the appropriate mode. • Chubby locks often protect resources implemented by Although we find sequencers simple to use. they wait for a file modifi- . which make lockedquires only the addition of a string to affected messages. Thus. important other services. and vir.
Chubby locks to maintain cache consistency. changing the ACL). ply access control checks on any call. rather than with a file name. while the others act as replicas. write-through cache held in memory. dition to any others needed by the call itself. The contents of a file are read atom-and node meta-data (including file absence) in a consisically and in their entirety. no one hasis no longer valid. or an error. analogous to a UNIX file descriptor. Release() acquire and permits clients to cache data held on other servers.5).5). a client flushes the invalidated state and acknowl- 2. the handle is createdas follows: All potential primaries open the lock file and only if the client has the appropriate permissions. indicates whether the file was in fact created. to cancel Chubby calls made by other threads without fear of deallocating the memory being accessed by them. flush modifications to a home loca-Subsequent operations on the handle fail if the sequencer tion. to allow the client to simulate compare-and-swap on a When file data or meta-data is to be changed. On receipt of an invaliated with the node. the caller may supply iniread the file with GetContentsAndStat(). keeps a list of what each client may be caching. while ReadDir() returns thekept consistent by invalidations sent by the master. We avoided partial reads andtent.6 API . discard cached data. The primary • the lock-delay (§2. If a file is created. the primary. pending operations. usingrelease locks. The conflicting lock event in theory Acquire(). • how the handle will be used (reading. A related call GetStat()maintained by a lease mechanism described below. they dle is not permitted. A related callshould confirm with CheckSequencer() that it is still Poison() causes outstanding and subsequent operations the primary. In particular. A notifi. even if the file has been subsequently Clients see a Chubby handle as a pointer to an opaquerecreated. CheckSequencer() checks whether a sequencer is valid (see §2. • wait for the completion of such a call. the modfile. gram that contains many layers of abstraction . perhaps in tial contents and initial ACL names.4). a handle is associated with an instance structure that supports various operations. which created. 2.4). the library provides a handle on ”/” that is alwaysmay be associated with any call. adopted this style of use. but always checks Open() opens a named file or directory to produce a Open() calls (see §2. the client may provide a content generation numberChubby state. Close() closes an open handle. via the valid. This call never fails. The contents of a file are always writtenfor the data to every client that may have cached it. this atomically and in their entirety. writing and Clients can use this API to perform primary election locking. The proSetContents() writes the contents of a file. Chubby clients cache file data meta-data of a file. Directory handles avoid the difficulties of using aoperation parameter the client may: program-wide current directory in a multi-threaded pro-• supply a callback to make the call asynchronous. writes its identity into the lock file with SetContents() • whether a new file or directory should (or must) beso that it can be found by clients and replicas. dation. which names and meta-data for the children of a directory. Chubby may apcreated only by Open(). The cache is writes to discourage large files. and release. Further use of the hanwhich it then passes to servers it communicates with. the contents are changed only if the generation num-ification is blocked while the master sends invalidations ber is current. attempt to acquire the lock. That is. One succeeds and becomes • events that should be delivered (see §2.7 Caching The main calls that act on a handle are: GetContentsAndStat() returns both the contents and To reduce read traffic. A related call SetACL()mechanism sits on top of KeepAlive RPCs. and returns just the meta-data. handle.GetSequencer() returns a sequencer (§2. finish using data associated with the lock: it would finish SetSequencer() associates a sequencer with a handle. its address in a file. and destroyed with Close(). Handles areof a file. The return valueresponse to a file-modification event (§2. Calls fail if the node has been deleted since the han. Optiontocol ensures that clients see either a consistent view of ally.4). So far. discussed performs a similar operation on the ACL names associ-more fully in the next section. and/or The client indicates various options: • obtain extended error and diagnostic information. The operThe name is evaluated relative to an existing directoryation parameter holds data and control information that handle. Ideally. A lock-delay may be used with services that on the handle to fail without closing it. the primary obtains a sequencer with GetSequencer().cation event indicating that the new primary has written Delete() deletes the node if it has no children. Only this All the calls above take an operation parameter in adcall takes a node name. this allows a clientcannot check sequencers (§2. all others operate on handles.3). TryAcquire().dle was created.4) that decation of a conflicting lock request would tell a client toscribes any lock held by this handle.
If a client’s local lease timeout expires. but update. The master advances the lease timeout in three cirThe caching protocol is simple: it invalidates cached cumstances: on creation of the session.out by any amount. Unless a Chubby client informs the master master manage to exchange a successful KeepAlive be- . it becomes un2. If the client and KeepAlives. The A Chubby session is a relationship between a Chubby cellclient empties and disables its cache. to maintain consistency. scheme that switched tactics if overload were detected. The clients will bombard the master with uncached accessesend of this interval is called the session lease timeout. allowing the holder to release the lock master’s clock is advancing. this is useful(with no open handles and no calls for a minute). the client’s handles. This approach allowseither when it terminates. As clients cache open handles. to hold locks longer than strictly necessary in the hope differs from the master’s lease timeout because the client that they can be used again by the same client. and we say that its and a Chubby client. mecha. The client maintains a local lease timeout that is a conChubby’s protocol permits clients to cache locks—that servative approximation of the master’s lease timeout. 45s by default. RPC to return to the client. The modi-otherwise. The client waits a further interval and is maintained by periodic handshakes calledcalled the grace period. but may not move it backwards in time.) Only one round of invalidations is needed because the A client requests a new session on first contacting the master treats the node as uncachable while cache inval-master of a Chubby cell. we just when it is needed elsewhere (see §2.initiates a new KeepAlive immediately after receiving existing communication protocols. (Howclient has invalidated its cache. outstanding Acquire() calls to the master. if a client opens a file itwell as extending the client’s lease. This simplifies the client. but an nisms such as virtual synchrony that require clients to overloaded master may use higher values to reduce the exchange sequence numbers in all messages were con.edges by making its next KeepAlive call. and This last restriction exists because the client may use allows the protocol to operate through firewalls that Close() or Poison() for their side-effect of cancelling allow initiation of connections in only one direction. An event must make conservative assumptions both of the time its informs a lock holder if another client has requested a KeepAlive reply spent in flight.the new lease timeout. and cached data all fication proceeds only after the server knows that eachremain valid provided its session remains valid. Chubby almost always a KeepAlive call blocked at the master. tics observed by the client: handles on ephemeral files Piggybacking events on KeepAlive replies ensures that cannot be held open if the application has closed them.5).used to transmit events and cache invalidations back to sarily leads to an RPC to the master. It ends the session explicitly idations remain unacknowledged. causnot allow it to return) until the client’s previous lease ining an unbounded number of unnecessary updates.number of KeepAlive calls it must process.edging and handles that permit locking can be reused.cache invalidations. The default extension is 12s. This caching is rethe client. or because the client al-the client to acknowledge a cache invalidation in order to lowed its cache lease to expire. The master allows a KeepAlive to return stricted in minor ways so that it never affects the seman. Ifmaster is free to advance this timeout further into the this were a problem. but can. or if the session has been idle reads always to be processed without delay. and thus informs the client of we rejected weaker models because we felt that program. the KeepAlive reply is has opened previously. locks. Thus. at the cost of occasional delays.8 Sessions and KeepAlives sure whether the master has terminated its session.ered. An alternative Each session has an associated lease—an interval of would be to block calls that access the node during in-time extending into the future during which the master validation. the master typically blocks the RPC (does accessed a file might receive updates indefinitely. only the first Open() call neces.session is in jeopardy. It would be just as ter fail-over occurs (see below). from client to master. the previous reply. this would make it less likely that over-eagerguarantees not to terminate the session unilaterally.early when an event or invalidation is to be deliv. Similarly. The master may extend the timemers would find them harder to use. and causes all Chubby RPCs to flow not be used concurrently by multiple application handles. it exists for some interval of time. when a masdata on a change. On receiving a protocols can be arbitrarily inefficient. clients cannot maintain a session without acknowl. require that the server’s clock advance no faster than a known constant factor faster than the client’s. the client ensures that there is In addition to caching data and meta-data. because reads greatly outnumber writes. the protocol for session maintenance may require acknowledged the invalidation.only a KeepAlive RPC from the client. It is. terval is close to expiring. maintain its session. and never updates it. The during invalidation. The master later allows the Despite the overheads of providing strict consistency. one could imagine adopting a hybridfuture. The client sidered inappropriate in an environment with diverse pre. and when it responds to simple to update rather than to invalidate. and the rate at which the conflicting lock. Thus. either because the clientever. a client that KeepAlive. see below.
Client ses- 2. an expired event is sent. If a master election occurs quickly. Time increases from left to right. Thus the grace period allows sessions to be maintained across fail-overs that exceed the normal lease timeout. a safe event tells the client to proceed. When the session is known to have survived the communications problem. this is legal because it is equiva- . This information allows the application to quiesce itself when it is unsure of the status of its session. all subsequent operations on H (except Close() and Poison()) will fail in the same way. and locks. If the election takes a long time. If a client holds a handle H on a node and any operation on H fails because the associated session has expired.❄ OLD MASTER new master elected ❄ lease M1 no master lease M2✲ lease M3 4 ✄✗ NEW MASTER ✄✗ 1❈ ✄ ✄✗ ❈ ✄ 3 ✲ ❈ 6 ✄✗ ❈ 8✄✗ ✲ ✛ KeepAlives ✲ ❈ ✄ ❈ ✄ ✄ 2 ❈❲ ✄ lease C1 ✲ ✲ lease C2 jeopardy CLIENT 7❲❈✄ ✻grace period ✻ ✲ lease C3 ✲ safe Figure 2: The role of the grace period in master fail-over fore the end of the client’s grace period. calls return with an error if the grace period ends before communication is re-established. Otherwise. This can be important in avoiding outages in services with large startup overhead. lent to extending the client’s lease. the client assumes that the session has expired. handles. Clients can use this to guarantee that network and server outages cause only a suffix of a sequence of operations to be lost. The authoritative timer for session leases runs at the master. so until a new master is elected the session lease timer is stopped. the client enables its cache once more. This is done so that Chubby API calls do not block indefinitely when a Chubby cell becomes inaccessible. rather than an arbitrary subsequence. but times are not to scale. and to recover without restarting if the problem proves to be transient. it discards its in-memory state about sessions. The Chubby library can inform the application when the grace period begins via a jeopardy event. clients flush their caches and wait for the grace period while trying to find the new master. if the session times out instead. clients can contact the new master before their local (approximate) lease timers expire.9 Fail-overs When a master fails or otherwise loses mastership. thus allowing complex changes to be marked as committed with a final write. Figure 2 shows the sequence of events in a lengthy master fail-over event in which the client must use its grace period to preserve its session.
while the client has a conservative approximation C1. It does this partly by reading data stored stably on disc (replicated via the normal database repli- . Had the grace period been less than that interval. The first KeepAlive request (4) from the client to the new master is rejected because it has the wrong master epoch number (described in detail below). the new master must reconstruct a conservative approximation of the in-memory state that the previous master had. During this period. the client saw nothing but a delay. the Chubby library sends a jeopardy event to the application to allow it to quiesce itself until it can be sure of the status of its session. the client would have abandoned the session and reported the failure to the application. Once a client has contacted the new master. The master commits to lease M2 before informing the client via KeepAlive reply 2. It does not tear down its session. At the start of the grace period. and optionally inform the application that its session is no longer in jeopardy. the client library and master co-operate to provide the illusion to the application that no failure has occurred. and downward angled arrows their replies. Eventually the client’s approximation of its lease (C2) expires. However the reply (7) allows the client to extend its lease (C3) once more. The master initially uses a conservative approximation M3 of the session lease that its predecessor may have had for the client. Eventually a new master election succeeds. and some time elapses before another master is elected. Upward angled arrows indicate KeepAlive requests. the client is able to extend its view of the lease C2.sion leases are shown as thick arrows both as viewed by both the old and new masters (M1-3. the client cannot be sure whether its lease has expired at the master. The retried request (6) succeeds but typically does not extend the master lease further because M3 was conservative. below). The master dies before replying to the next KeepAlive. The client then flushes its cache and starts a timer for the grace period. The original master has session lease M1 for the client. above) and the client (C1-3. Because the grace period was long enough to cover the interval between the end of lease C2 and the beginning of lease C3. but it blocks all application calls on its API to prevent the application from observing inconsistent data. To achieve this.
system as a whole. a GFS cell in the same building potentially might closed handle. the database log is distributed among 5. clients to flush their caches (because they may haveChubby used few of the features of Berkeley DB. and has requests. we have written a simple database using write ahead 4. Provided there are no network problems. Software maintainers must give priority to session-related operations. Readers will be unsurprised to learn that the fail-overmirroring code immediately if a file is added. but does not at first process incomingfewer users. Mirroring is fast because the files are small and the event mechanism (§2. held lock. and provides the new epoch number. The master now lets clients perform KeepAlives. and master epoch. It emits a fail-over event to each session. but this is harmless given Backups provide both disaster recovery and a means for initializing the database of a newly replaced replica that the client is already faulty. . handle in a future epoch. 8. name. the master of each Chubby cell writes a representation of the handle and honours the call. has been a rich source of interesting bugs. fail-over event or lets its session expire. and so missed invalidations). the replication code was added recently. say). we felt that use of the replication code leases are extended to the maximum that the pre-exposed us to more risk than we wished to take. this matched the design of very old packet that was sent to a previous master. If a client uses a handle created prior to the fail-over2. it remains unchanged until connectivity is restored. A special cell. The single lookup in the database suffices for each file access. Ifsnapshot of its database to a GFS file server  in a such a recreated handle is closed. changes are reflected in dozens of mirrors world-wide in the system. Updated 2. A faulty client can recreate a closedrely on the Chubby cell for electing its master. the master recreates the in-memoryEvery few hours. A newly elected master proceeds: while keeping sibling nodes adjacent in the sort order. a clients are required to present on every call. maintaining and improving their most popular product 3. string values. While Berkeley DB’s maintainers solved the locks that are recorded in the database.Chubby. While Berkeley DB’s B-tree code is widely-used and 2. Sessionproblems we had. master rejects calls from clients using older epoch Berkeley DB’s uses a distributed consensus protocol numbers.logging and snapshotting similar to the design of Bir. which Because Chubby does not use path-based permissions. Once ensures that the new master will not respond to amaster leases were added. even one running on the same machine. the master recordsdifferent building. After some interval (a minute. which is exercised far less often than other parts ofmodified. conB-trees that map byte-string keys to arbitrary bytetains a subtree /ls/global/master that is mirrored to the . It builds in-memory data structures for sessions andfeatures.videsaround the world. As before. Berkeley DB pro. sult. the masterwithout placing load on replicas that are in service. The new master may respond to master-locationmature.5) informs the a file loses its session during a fail-over. As a revious master may have been using.12 Mirroring files during this interval after a fail-over. Thisto replicate its database logs over a set of servers. The database that sorts first by the number of components in a path records each session. The use of a separate building ensures it in memory so that it cannot be recreated in thisboth that the backup will survive building damage. If a mirror is unreachable. well under a second. for example. deleted. and ephemeral file. partly by obtaining state from clients. while we needed atomic 6. Clients should refresh handles on ephemeral2. this ensures that a delayed or dupli-that the backups introduce no cyclic dependencies in the cated network packet cannot accidentally recreate asystem. It first picks a new client epoch number. named global.cation protocol). Mirroring is used most commonly to copy configThe first version of Chubby used the replicated version ofuration files to various computing clusters distributed Berkeley DB  as its database.rell but no other session-related operations. which made implementation straightforward. We installed a key comparison function and partly by conservative assumptions. 7. 9. The master allows all operations to proceed. This mechanism has the unfortunate effect that ephemeral filesChubby allows a collection of files to be mirrored from may not disappear promptly if the last client on suchone cell to another. or code.10 Database implementation files are then identified by comparing their checksums. this allows nodes to by keyed by their path name.col. we did not need general transactions. and to warn applications thatthis rewrite allowed significant simplification of the other events may have been lost. et al.11 Backup (determined from the value of a sequence number in the handle). The master waits until each session acknowledges theoperations. deletes ephemeral files that have no open file handles. this causesthe replicas using a distributed consensus proto. 1.
We scribed in Section 2. One might expect proxies to make the cell temoverwhelm the master by a huge margin. the mostporarily unavailable at least twice as often as before. the clients canreads. But even with aggressive large data sets such as Bigtable cells. hard-• When a directory is deleted. Second. we expect this communication to have only a modest impact on performance or availability.1 Proxies global cell is special because its five replicas are located in widely-separated parts of the world.1). Chubby protocol into less-complex protocols such as Partitioning is intended to enable large Chubby cells DNS and others. similar machines for Chubby clients and servers. the code can partition the name space by diby far the dominant type of request (see 4. the saving in Chubby master—far more than the number of machinesKeepAlive traffic is by far the more important effect. and manyclient caching. If enabled. clients are largely in-master. meta-data. A proxy Chubby’s own access control lists. Assuming the mastermay fail: its proxy and the Chubby master. so it is almost al-Chubby’s protocol can be proxied (using the same protocol on both sides) by trusted processes that pass reways accessible from most of the organization. We do not yet use them in production. directory modifiedand partitioning.1). But because reads constitute have seen 90. andrectory. We discuss some of these below. a cross-partition call may ware improvements that increase the number of clients be needed to ensure that the directory is empty. and its Proxies add an additional RPC to writes and first-time machine is identical to those of the clients. only Open() and Delete() calls would wish to put in a data centre or make reliant on a require ACL checks. ACL files First. Thus. is not ideal for proxies.) • Chubby clients cache file data.4. it cannot reduce write traffic. Every node D /C in directory D would be stored on the partition P (D/C ) = hash(D) mod N . various files in whichcan reduce server load by handling both KeepAlive and Chubby cells and other systems advertise their presence toread requests. However. need process fewer KeepAlive RPCs. clients almost always use a nearby cell (found with DNS) to avoid reliance on remote machines. .ments in Alert readers will notice that the fail-over strategy derequest processing at the master have little ef. and most clients read publicly single instance of a service. Note sensitive to latency variation in other calls. proxies though Chubby lacks hard links. Chubby’s interface was of several thousand machines. that we expect will allow Chubby totimes. Here with little communication between the partitions. pointers to allow clients to locatethrough the proxy’s cache. • The master may increase lease times from the default chosen so that the name space of a cell could be parAlthough we have not yet 12s up to around 60s when it is under heavy load. a few opscale further. A proxy cache can reduce read Chubby’s clients are individual processes. write traffic constitutes much less than one percent of Chubby’s normal workload (see §4. If a proxy handles Nproxy clients. wefactor of around 10 (§4. • We can create an arbitrary number of Chubby cells. a Chubby cell would be composed of failure to process them in time is the typical failureN partitions. per machine also increase the capacity of each server. which might be 10 thousand or more. proxies allow a significant increase in the number of clients. so ittitioned between servers. which passes our monitoring services.subtree /ls/cell/slave in every other Chubby cell. P (D) = hash(D 0 ) mod N . where D 0 is the parent of • We use protocol-conversion servers that translate the D . so one partition may use present need to consider scaling beyond a factor of five: another for permissions checks.1). We have no• ACLs are themselves files. has no serious performance bug. so Chubbytraffic by at most the mean amount of read-sharing—a must handle more clients than one might expect. The3.9. each of which has a set of replicas and a mode of an overloaded server. (KeepAlives areneeded it.fect. minor improve.3.erations still require cross-partition communication: but they are designed. and open handles to reduce the number of callsthe meta-data for D may be stored on a different partition they make on the server. involved. and may be used soon. so configuration files for other systems. We disuse several approaches: cuss this problem in Section 4. the absencethat of files.000 clients communicating directly with aunder 10% of Chubby’s load at present. because we use accessible files that require no ACL. Our typ3. Alwe describe two familiar mechanisms. and cross-directory rename operations. beeffective scaling techniques reduce communication withcause each proxied client depends on two machines that the master by a significant factor. Because there is just one master per cell. there are limits on the number of machines one are readily cached. Among the files mirrored from the global cell arequests from other clients to a Chubby cell. Because each partition handles most calls independently of the others.2 Partitioning ical deployment uses one Chubby cell for a data centre As mentioned in Section 2. KeepAlive 3 Mechanisms for scaling traffic is reduced by a factor of Nproxy .
one would• RPC traffic is dominated by session KeepAlives. §4. partitioning reduces read and write traf. stored directories 8k Overload typically occurs when many sessions (> 90. client-is-caching-file entries 230k Chubby’s data fits in RAM. artificial delays in Open() to curb abusive clients (see • Negative caching is significant. surprises and design errors the data centre. but 250ms between antipodes. and meta-data files vided sessions are not dropped. Thus. ephemeral 0. suspected network connectivity problems (2). . The remaining nine outages were caused by period.1% 000) are active. on average. software errors (2). they are under 1ms for a CreateSession 1% local cell. the operational errors involved upgrades to files open 12k avoid the software errors. but can result from exceptional stored files 22k conditions: when clients made millions of read requests 0-1k bytes 90% simultaneously (described in Section 4.2% reads.Unless the number of partitions N is large.1 Use and behaviour operators. we have lost time since last fail-over 18 days fail-over duration 14s data on six occasions. We have found that • Few clients hold locks. Mean request latency at our production servers is names negatively cached 32k consistently a small fraction of a millisecond regard. none involved hardware er. If we assume (optimistically) that a cell is essary for Chubby to handle more clients.ror.5). resulting in tens of thousands of requests per secnaming-related 46% ond. there expect that each client would contact the majority of the are a few reads (which are cache misses). our strategy“up” if it has a master that is willing to serve.work congestion. the server can mirrored ACLs & config info 27% maintain a low mean request latency with many active GFS and Bigtable meta-data 11% ephemeral 3% clients by increasing the session lease period (see §3). Writes (which GetContentsAndStat 0. software. rethe consistent with locking being used for primary elec-ducing communication to the server can have far greater tion and partitioning data among replicas. and hardware. but by up to tens of seconds if a Acquire 31ppm recently-failed client cached the file. most of our appliThe following table gives statistics taken as a snapshot ofcations are not affected significantly by Chubby outages a Chubby cell. when shared locks 0 latency increases dramatically and sessions are dropped. We excluded outages due to maintenance that shut down 4 Use. KeepAlive 93% GetStat 2% RPC read latencies measured at the client are limited by Open 1% the RPC system and network. All other causes are included: net. this is key to scaling Chubby is not server performance. on a saminvolves a combination of proxies and partitioning. overload. see §4.less exclusive locks 1k of cell load until the cell approaches overload. access control.4% include lock operations) are delayed a further 5-10ms by SetContents 680ppm the database log update. due to database software errors active clients (direct) 22k (4) and operator error (2). Should it be nec-in our cells. ple of our cells we recorded 61 outages over a period of a few weeks. and errors due to 4. At one point. Clients are fairly insensitive to latency variation pro• Configuration.3. fic on any given partition by a factor of N but does not Now we briefly describe the typical causes of outages necessarily reduce KeepAlive traffic. seconds and were applied repeatedly. maintenance. developers noticed only when delays exceeded ten • 230k/24k≈10 clients use each cached file. amounting to 700 cell-days of data in to. network maintenance (4). there are partitions.tal. we added (analogous to file system super-blocks) are common. We have twice corrected naming-related 60% corruptions caused by software in non-master replicas. In a few dozen cell-years of operation.very few writes or lock acquisitions. and shared locks are rare. additional proxied clients 32k Ironically. Group commit reduces the effective work done per reRPC rate 1-2k/s quest when bursts of writes arrive. and 52 were under 30s.3). and when a 1k-10k bytes 10% mistake in the client library disabled caching for some > 10k bytes 0. Several things can be seen: • Many files are used for naming. so most operations are distinct files cached 24k cheap. the RPC rate was a seen over a ten-minuteunder 30s. and overload (1). Even this variability in write latency has little effect on the mean request latency at the server because writes are so infrequent. The numbers are typical of cells in Google. Because most RPCs are KeepAlives. Most outages were 15s or less. but this is rare.
In the Berkeley DB version of the lock quadratic number of DNS lookups. straightforward to pick a suitable TTL value. but instead when it environment. then focused on the scalingson. for Google before Chubby was introduced.9) requires run jobs involving thousands of processes. name lookups. it is not obvious how full consistency. To group name entries into batches so that a single lookup maintain the library in Java would require care and ex. such as browsers. and DNS data are discarded when they haveto Chubby names. When we first deployed the Chubby-based programmers so dislike JNI that to avoid its use they prefer to translate large libraries into Java. Even though Chubby was designed as a lock service. 4. In contrast. the overhead of creating sessions became a proba TTL of 60s. is based on time. As a result. Thus our Java users run precise than those needed by a name service. The ability to client-side library.2 Java clients maintain an arbitrary number of cache entries indefGoogle’s infrastructure is mostly in C++. A 2-CPU number of systems are being written in Java . This2.impact. thewithin Chubby available to DNS clients. We might wish to useserver.3 Use as a name service this simple. but nevertheless additional server. the server was modified to store a session in the an unreasonably short replacement time in ourdatabase not when it was first created. lock acquisition. leading to aare created.) Larger jobs create worse hand. name copies of a protocol-conversion server that exports a simple RPC protocol that corresponds closely to Chubby’s resolution requires only timely notification rather than client API.would return and cache the name mappings for a large pense. DNS entries have a time. The the local Chubby cache. load spikes can still be a problem. a 2-CPU 2.6GHz Xeon DNS server might handle 50 mechanisms that could be more effective. developers do notice if a performance bug affectsproblems. This server is DNS.sideredoverload. and to accommodate existing applicanot been refreshed within that period. Had we foreseen the use of Chubby as a name service. Chubby’s caching uses explicit invalidations so a constant rate of session KeepAlive requests can 4. but it is regarded as slow and cumbersome. The caching semantics provided by Chubby are more burden the Chubby servers.whatindividually is so appealing that Chubby now provides irksome to link against other languages. Usually it istions that cannot be converted easily. (For comparigious bugs were present.4 Problems with fail-over For example. running and for reducing the load on Chubby by introducing a simple protocol-conversion server designed specifically for maintaining this additional server. this would allow misbehaving clients to belem when many processes were started at once. while an implementation without caching would number (typically 100) of related processes within a job. it is common for our developers toThe original design for master fail-over (§2. the TTL can become small enough to overload the DNS servers. but a growinginitely at a client. In that case. master to its knees. the clients included large jobs with communication patterns of the kind described above. This makes the naming data stored Caching within the normal Internet naming system. Even with hindsight. or open . we One further protocol-conversion server exists: the found that its most popular use was as a name server. No significant effort has been applied to tuningof a single job as small as 3 thousand clients would reread/write server code paths. to maintain the DNS caches attempted its first modification. On the otherthousand requests per second. To avoid replaced without excessive delay and is not con. Our Java tain a large number of clients. starting a 3 thousand process job (thus generating 9 million requests) could bring the Chubby supporting them.6GHz Xeon Chubby master has been seen to handle trend presented an unanticipated problem for Chubby. we checked that no egre-quire 150 thousand lookups per second. and several jobs many be running at once. and for eachthe master to write new sessions to the database as they process to communicate with every other. Java encourages portability of entire applications at theprovide swift name updates without polling each name expense of incremental adoption by making it some. To resolve this problem.to-important both for easing the transition from DNS names live (TTL). but if prompt replacement of failed services is desired. we might have chosen to implement full proxies sooner than we did in order to avoid the need for 4. which a client may read thou-variability in our DNS load had been a serious problem sands of times per second. and the client protocol is delicate. mechanism for accessing non-native libraries is JNI Although Chubby’s caching allows a single cell to sus. The usual Javaname service for most of the company’s systems.90 thousand clients communicating directly with it (no which has a complex client protocol and a non-trivialproxies). there was an opportunity we might have avoided the cost of writing. in the absence of changes. and commit to name service. Chubby DNS server. we chose to Chubby’s C++ client library is 7000 lines (comparable with the server).
In the end it was easier to make from another when a proxy fails. we avoid recording sessions in Lack of aggressive caching Originally. Thus. In addition.¶8). use Chubby’s event mechanism as a publish/subscribe . needed at the master is a guarantee not to relinquish locks Lack of quotas Chubby was never intended to be used or ephemeral file handles associated with proxy sessionsas a storage system for large amounts of data. At first we countered these retry-loops by introduc. For example. Any linear growth must be mitigated by a be discarded if a fail-over occurs. storing some meta-data in Chubby. and how use will grow. ings. and thus force the new mas-disastrous results. if all the recordedload on Chubby to reasonable bounds. or poll a file by not know whether all sessions have checked in (§2.5MByte file was being rewritten in its others.5 Abusive clients uploads occurred rarely and were limited to a small set of Google’s project teams are free to set up their ownpeople.ers allows them to change the session that locks are asso-acknowledged. A problem with this approach is thatmigrated elsewhere. since it can-indefinitely when a file is not present. developers are often unable to predict how their ser. this opti-rate. However. andcant changes to production systems maintained by busy deny access to the shared Chubby name space until reviewpeople—it took approximately a year for the data to be is satisfactory. number of files) grows linearly (or mization has the undesirable effect that young read-onlyworse) with number of users or amount of data processed sessions may not be recorded in the database. so the space was bounded. Nevertheless our sessions were to check in with the new master before theearly reviews were not thorough enough. this was na¨ıve. Such 4.opening it and closing it repeatedly when one might exAgain. in a large system it is almost certain that someteam may be reused a year later by another team with session will fail to check in.ate storage systems.to reuse open file handles. but because other developers this effect. and instead recreate them in the samepreciate the critical need to cache the absence of files. the discarded ses. two other Chubby cells.9. In hindsight. and so mayby the project. Neverthe-terface designers that they must change their interfaces less.ing Once sessions can be recreated without on-disc state.exponentially-increasing delays when an application made proxy servers can manage sessions that the master is notmany attempts to Open() the same file over a short aware of. but doing so adds to their maintenance bur-services started using the same module as a means for den. In some cases this exposed bugs that develop. Chubby is intended to operate within a sin-entirety on each user action. misunderstand-clients combined.vices Publish/subscribe There have been several attempts to will be used in the future. and consumes additional hardware resources. the writes for read-only sessions were The most important aspect of our review is to deterspread out in time. This is rare inmost software documentation. and the overall space used by gle company. leases of discarded sessions expired. It is sometimes hard to explain to inter to await the maximum lease time anyway. But it is difficult to make signifiwe review the ways project teams plan to use Chubby. this has little effect in practice because it is likelypect they would open the file just once.9. active sessions wereReaders will note the irony of our own failure to predict recorded in the database with some probability on eachhow Chubby itself would be used. mine whether use of any of Chubby’s resources (RPC Though it was necessary to avoid overload. mistakes. these services grew until the use of Chubby was it important to isolate clients from the misbehaviour ofextreme: a single 1. An extra operation available only to proxiesperiod. Despite attempts at edA new master must now wait a full worst-case lease time-ucation.A related problem is the lack of performance advice in sions could then read stale data for a while. we did not apthe database at all. One of Google’s projects wrote a module to keep track of data uploads. A module written by one practice. This permits one proxy to take over a clienttime on education. Inservices therefore use shared Chubby cells. and so malicious denial-of-service attacksthe service exceeded the space needs of all other Chubby against it are rare. Although such ses-compensating parameter that can be adjusted to reduce the sions hold no locks.of an ephemeral file. which makesevitably. nor way that the master currently recreates handles (§2. scheme introduces to proxies. Under the new design. Manytracking uploads from a wider population of users. The only further changerepeated Open() calls cheap. encouraged the services to migrate to more appropriSome of our remedies are heavy-handed. and to avoid a complication that the currentmay be less aware of the cost of an RPC. has no storage quotas. we have modified the fail-over design both to avoidnot because they are bad. KeepAlive. and to effects that are similar to attacks. disc space.¶6). However. but often it required us to spend yet more ciated with. our developers regularly write loops that retry out before allowing operations to proceed. and the differing expectations of our developers lead We introduced a limit on file size (256kBytes). this is unsafe. Below we list some problem cases we encountered. that not all sessions will check in. and so it until a new proxy has had a chance to claim them.
many developers chose toused in the same way KeepAlives. but of the cost of redesigning the application was too large. for example. It is perhaps a Developers rarely consider availability We find thatsurprise that so far we have not needed to write such a our developers rarely think about failure probabilities. Our API choices can also affect the way developerswe would prefer to use UDP only when high-level timechose to handle Chubby outages. First.1. Third. our API has evolved well. provides an event that allows clients to detect when a We may augment the protocol with an additional TCPmaster fail-over has taken place. we now supply libraries that perform some highguarantees and its use of invalidation rather than update inlevel tasks so that developers are automatically isolated maintaining cache consistency make it a slow and inef-from Chubby outages. We may add a Cancel() RPC to allow more sharing of open handles. reducing the sensitivity of applications to Chubby’s availability—both can lead to better availability of our systems overall. we review how project teamshaviour to those in Echo . 4. as other events maymunicate events and invalidations in the normal case. This is one of the arguments for coarse-multiple threads. The intent was forbased GetEvent() RPC which would be used to comclients to check for possible changes. the global Chubby cellpassing events and cache invalidations from the master to (see §2. the lo.tially. ing a general-purpose lock service is found in VMS . For example. We might have done better to send redundant “file change” events instead.that events must eventually be acknowledged. However.5). sessions reduce the overplan to use Chubby. so TCP-based KeepAlives led to many lost sesmaintenance. which also discard would prefer developers to plan for short Chubby out-the server state for the handle. Chubby’s heavyweightSecond. Weare the Close() and Poison() RPCs. grained locking.system in the style of Zephyr . .12).ersmunication. initiated recovery procedures taking tens of minutes when Poor API choices have unexpected affects For the Chubby elected a new master.ing ples. we use the post-mortem of ficient for all but the most trivial publish/subscribe exam-each Chubby outage as a means not only of eliminat. This magnified themost part. and miscellaneous design changesSection 2. as pre-tems . Unfortunately. Developers also fail to appreciate the difference be. TCP’s back off policies less likely to be partitioned from the client. except that it introduced a tenof the client’s local Chubby cell. This prevents handles that ages. or even to ensure that no events5 Comparison with related work were lost during a fail-over.6 Lessons learned Fine-grained locking could be ignored At the end of Here we list lessons. This design has the automatic and desirable than two geographically distant data centres to be downconsequence that a client cannot refresh its session lease simultaneously. they must remove unnecessary comwere always available. and second. all such uses have been caught beforebugs in Chubby and our operational procedures.ply crash their applications on receiving this event. but one mistake consequences of a single failure by a factor of a hundredstands out. its observed availabil. Fortunately. especially that of the global cell. and advise them against techniqueshead of leases  in the V system.1 we sketched a design for a server that clients we might make if we have the opportunity: could run to provide fine-grained locking. is almost always up because it is rare for morethe client. thuswould still contain a list of unacknowledged events so decreasing the availability of their systems substan. and that service being availableused both for refreshing the client’s session lease. our developers typically find that to optimize their are inclined to treat a service like Chubby as though itapplications. and for to their applications. the same maintenance affects the client di-sions at times of high network congestion.RPC use affects transport protocols KeepAlives are tween a service being up. have been lost. UDP has no congestion avoidance mechanisms. At present we use three mechanisms to prevent de-Chubby is based on well-established ideas. so client. Chubby’s velopers from being over-optimistic about Chubby avail-cache design is derived from work on distributed file sysability. so Chubby’s unavailability is not observed by theforced to send KeepAlive RPCs via UDP rather than TCP.pay no attention to higher-level timeouts such as Chubby although the local cell may be down often due toleases. For example. For example.ity for awithout acknowledging cache invalidations. given client is usually lower than the observed availability This would seem ideal. by applications. Our means for cancelling long-running calls both in time and the number of machines affected. so that such an event has little or no affect on theircan acquire locks from being shared. We were rectly. discussed in Section 2.cal cell ission in our choice of protocol. First. andserver. our develop. Chubbybounds must be met. The KeepAlive re. Its sessions and cache tokens are similar in beviously mentioned (§4. and that often means finding a way to use once built a system employing hundred of machines thatcoarse-grained locking. The idea of exposthat would tie their availability too closely to Chubby’s.
it is a common rendezvous mechanism for sysface than Boxwood. need not be used together. in design stems from a difference in target audience. and it is a standard repository for files that by default. We exdevelopers working on projects that may share code butpect to scale it further via proxies and partitioning.Finally. and via checksums of the databaseChubby’s grace period.A more interesting difference is the introduction of tiple times per day. we have solved afailure probability in the two systems. these choices alone do not suglatency unless the data is cached. For example. chosen for different expectations: Each Boxwood failure detector is contacted by each client evfiles [18. write. at least) issimple load adaptation to allow it scale to tens of thouappropriate for a smaller number of more sophisticatedsands of client processes per Chubby instance. and a session/lease mechanism in a single service. or the uncertainties of important. but these how parameters in such systems must be adjusted to acattributes are easier to achieve when performance is less commodate more client machines. so we provide details on one: we chose to compareferent purposes. we compare replicas with one another every fewcall that the grace period allows clients to ride out long hours. Although master fail-overs are rare.) Again. it too is designed to run in a loosely-coupled en-tected safely. some interesting and some incidental. Its design is based on well-known ideas that have Chubby was intended for a diverse audience and appli-meshed well: distributed consensus among a few replicas cation mix. 22]. we are racks shared with other projects. 21. a different concept. and yet its design differs in various ways fromand intended primarily for use within Boxwood. mance and storage requirements allows us to serve tens ofBoxwood’s “grace period” is the equivalent of Chubby’s thousands of clients from a single Chubby master. and a familiar file system intervice with a familiar API seemed attractive. but rather an indication of pect consistency. Like its caching model. a lost Chubby lock is expen. For our environment. this differproviding a central point where many clients can shareence is the result of differing expectations about scale and information and co-ordinate activities. and Boxwood provides a toolkit that (to our eyes. while we typically use five replidata. including the idea that a file. timely scripts. locks in the two systems are intended for difson. a Paxos service (a reliable repository for state). Because Chubby’s database is small. while Boxwood’s lock names and Bigtable use Chubby to elect a primary from redunare simple byte sequences. tions. to novices who write administrationduce server load while retaining simple semantics. and they do not expect high throughput or even lowcas per cell. The weakening of the normal file system perfor-Chubby master outages without losing sessions or locks.The two systems have markedly different default pasystem-like name space is convenient for more than justrameters. consistent client-side caching to redistributed systems.face. In contrast. (Restate. vironment. We take full backups mul. Chubby implements locks. Chubby’s default lease Chubby differs from a distributed file system such as time is 12s and KeepAlives are exchanged every 7s. Boxwood separates these into three: a6 Summary lock service. which Boxwood lacks. Chubby combines the tems such as MapReduce . Chubby locks are heavier-weight. and with Boxwood’s lock server  because it was designedneed sequencers to allow externals resources to be prorecently.sive for class of problems faced by our system developers. and reliability. Chubby provides a higher-level interservice. Chubby clients cache file state dant replicas. Chubby has become Google’s primary internal name In many cases. protocol-conversion servers. By“session lease”. However. a large-scale shared ser-notification of updates. a reliable small-file storage system. Chubby’s API is based on a file-system model. We suspect that this differenceservice and repository for configuration information. Chubby. a client of Boxwood’s Paxos service require high availability. In contrast. availability. or store large amounts of achieve availability.though that system initially used a special-purpose high-could implement caching via the lock service. its users range from experts who create newfor fault tolerance. The large number of file systems and lock servers de-clients. but another system could use these buildingdistributed systems. They do exgest a deep design difference. ery 200ms with a timeout of 1s. scribed in the literature prevents an exhaustive compari. Echo or AFS  in its performance and storage aspiraBoxwood’s subcomponents use two or three replicas to tions: Clients do not read. while Boxwood locks are lighter-weight. and a failure detection service respectively. TheChubby is a distributed lock service intended for coarseBoxwood system itself uses these three componentsgrained synchronization of activities within Google’s together. . We use caching. it has found wider use as a name blocks independently. such as access control lists. the storage systems GFS lock and file names spaces. but would speed interconnect that permitted low-latency interac-probably use the caching provided by Boxwood itself. able to store many copies of it on-line (typically five replicas and a few backups).
Java Language Spec. Boxwood: Abstrac. J E R I A N . NA J O R K . 183–192. A N D W I N . K A Z A R . LY N C H . W. F. C.. K. A. In 19th SOSP (Dec. S. G O B I O FF . B O S T I C . In (1985).. M. J. A N D L E U N G . and the full Chubby proxy respectively. S. A. J R . The part- time parliament.. Scale and  B I R R E L L . ACM SIGACT R .. 105–120. A.  F R E N C H . 202– 210. T R I C K E Y.. TOCS 12..ley DB. H. T H O M P S O N . Rep.T E R B OT TO M . Bigtable: A News 32. F I  L A M P O RT.  M C J O N E S . P. Vadim Furman added the caching of open han. A N D S E LT Z E R . T H E K K AT H . A. 21... A N D Z H O U . D. D. Sean Quinlan and San. J OY. S.ficient faulttolerant mechanism for distributed file cache consistency. AddisonWesley.. L. R. A. J. Viewstamped replication: A general primary copy method to support highly-available distributed systems. J. D.  C H A N G . 374–382.. M AC C O R M I C K .. K.. A coherent distributed file cache with directory write-behind. R. B. N I C H O L S . Sean Owen.  P I K E . Java Native Interface: Programmer’s Guide and Reference... E. M. L. pp. The Zephyr Program. T. 1987. Evolving the UNIX sys. 1988). W.  F I S C H E R . N. M A N N . H. Springer–Verlag. M... S T E E L E . L. Dave Presotto. D. M U R P H Y. 18–25.tems 8. ACM 32. A N D S WA RT. Ramsey Haddad connected the API to Google’s file system interface. 2 (1995). W. 149–154. C H A N D R A . pp. MIT Project Athena. T. ACM TOCS G H E M AWAT. Addison-Wesley. 133–169. A.  O K I . A N D S WA RT. R.  H OWA R D . DEC SRC.tem interface to support multithreaded programs.D R E NA . S AT YA performance in a distributed file system. C.. WA L L AC H . P R E S OT TO . T. 221–254. A. D.. M.. G. In 6th OSDI (2004). 2 (1994). P.. Tech.. 1 (Feb.. 365–375. M. 137–150.   D E A N . 1989. NA R AYA NA N . N. 2003). L. M. J. R e f e r e n c e s  B I R M A N . D E A N . 2 (1998). A sim. S. A N D T H O M P S O N . 51–81. How to build a highly available system using consensus.  G O S L I N G .. 123–138. K... MapReduce: Simplified data processing on large clusters. D.. D. 7 (1974). pp. Plan 9 from Bell Labs. In Distributed Algorithms. Apr.  G H E M AWAT. In ACM PODC (1988). M. A. Virtual time.. Computing Sys.. A N D B R AC H A . Java. 29–43. 11th SOSP (1987). F L A N . In USENIX (June 1999).T. In 7th OSDI (2006). 3 for small databases.  L A M P O RT. A N D G H E M AWAT. C.. M... 1151 of LNCS. S. A. J. S. S. L I A N G . B. distributed structured data storage system. E. S. 2000... M E N E E S .. K. A N D G RU B E made simple. Leases: An ef. B U R ROW S . M. H I S G E N .. The UNIX timesharing system. Tushar Chandra and Robert Griesemer wrote the replicated database that replaced Berkeley DB. pp.). A N D J O S E P H .  G R AY. A N D T H I E L . pp.  O L S O N . J.. W.7 A c k n o w l e d g m e n t s Many contributed to the Chubby system: Sharon Perl wrote the replication layer on Berkeley DB. P.ple and efficient implementation  J E FFE R S O N . pp. S. Paxos K E S . pp. G. Berke.dles and file-absence. and many Google developers uncovered early weaknesses...tions as the foundation for storage infrastructure. 1996.. J.mer’s Manual. H . G. J O N E S . Exploiting virtual synchrony in distributed systems..jay Ghemawat gave valuable design advice. C. R. A N D L I S KOV. R. A N D PAT E R S O N . G. A N D W E S T.. 123– 164. L A M P S O N . M. S I D  E B OT H A M . pp. vol. In 12th SOSP (1989). In 6th OSDI (2004). 1–17. ACM TOCS 6. A N D K O H L . A N D W O B B E R . M. In 11th SOSP (1987). CACM 17. Rob Pike. H S I E 16. B.. . B I R R E L L . 404–425. (2nd Ed. 1999. Impossibility of distributed consensus with one faulty process. 2 (April 1985). B. T. J.. B.. 4 (2001)..  R I T C H I E . The Google file system. G. B. A N D C H E R I TO N . and naming protocol-converters. D O RWA R D ... ACM TOPLAS. Doug Zongker and Praveen Tamara wrote the Chubby DNS..  S NA M A N .
. L. 253–267.  Y I N .The VAX/VMS distributed lock manager. M A RT I N . Digital Technical Journal 1. 5 (Sept. Separating agreement from execution for byzantine fault tolerant services.-P.. DA H L I N .. V E N K ATA R A M ANI. In 19th SOSP (2003). J. A LV I S I . A N D A. pp. J. 1987). M. 29–44.. .
This action might not be possible to undo. Are you sure you want to continue?
We've moved you to where you read on your other device.
Get the full title to continue listening from where you left off, or restart the preview.