Professional Documents
Culture Documents
June 5, 2014
1 Introduction
For our nal project we wanted to create a key value store that would be consistent,
fast, and easy to understand/implement. Given these requirements we chose to build our
store based on Raft, the brainchild of the RAMCloud group at Stanford. This algorithm
guarantees global linearizability across the space of keys as long as a quorum of writers
is maintained.
1.1 Motivation
Existing key-value stores, like Chubby, are immensely valuable in large systems, both
for their ability to manage locks as well as for their high consistency guarantees. By
allowing simple, and exible consistency these applications provide a useful framework
upon which developers can build larger and more complex applications.
Generally, consistent key value stores are based on an a replicated multi-Paxos implementation. Paxos is notorious for being dicult to understand and implement which
means these applications are more prone to programmer errors, even if the underlying
algorithm is sound. Since our goal for this project was a well-tested and cleanly written
implementation, we ruled out Paxos as a candidate algorithm.
We also considered implementing a distributed hash table, such as Chord, for its scalable
performance and relatively simple implementation. However, the consistency model of
DHTs is signicantly weaker than we would have desired (see below for a more thorough
explanation of our consistency model) and as such we were forced to rule out the use of
a DHT.
Having ruled out both DHTs and multi-Paxos we eventually concluded that Raft would
provide all of the guarantees that we desired without overly complicating the implementation.
1.2 Consistency requirements
The general goal for our project was to create a system that was strongly consistent
even in the face of dropped/delayed packets or a network partition. More specically,
under our consistency model we assure that as soon as a setResponse has been received
by the client, any getResponse's to the cluster return with the latest value of that key,
regardless of the underlying state of the cluster. We do not make any guarantees about
response time in the case of arbitrary failure. Moreover, we do not protect against
Byzantine failures, only fail stop and network partitions.
2 Implementation
We chose to write our implementation of Raft in Python, owing to the language's
widespread popularity and general ease of use. Our code is written as a single class Node
that processes all requests in sequence. Built on the example code, it uses the native
MQ support for timeouts and blocking requests, and avoids much of the complexity of
locking behavior. Because we wanted to match the variable names and indexing scheme
(starting at one instead of at zero) used in the Raft paper, we included an additional
class that handled a consistent log for each Node instance. This log ensured that request
IDs were never repeated, and that appends to the log would only progress when the log
was consistent with the state given in the incoming message.
2.1 Overview
For building our consensus we used the standard algorithm documented in the Raft
paper. Here we will present the general approach of the algorithm without diving into
too many specics. Raft has a strong leader property so only leaders can serve get or
set requests. The leader will push out all changes to all followers and only considers a
change committed if it is known to have been saved on a majority of the cluster. Most of
the diculty of the algorithm comes into play when electing a leader; how we approach
this problem is detailed in the next section.
In addition to the basic Raft algorithm we wrote several chunks of code to make sure
that our implementation matched the expectations of the broker. More specically the
broker expects gets and sets to be synchronous, that it will only receive a response
from the node who it sent the request to, and that it will never receive stray/duplicate
responses.
2.2 Algorithm
Each node in the cluster maintains a persistent randomized timer that attempts to elect
a new leader indenitely. As soon as this timeout expires this node begins an election
by setting its state to candidate and sending out a series of RequestVote RPCs to all
peers. If a candidate gets a majority of votes for a specic term it will declare itself the
leader and send out an AppendEntires RPC to all peers in order to notify them that
it is now the leader. It will also ensure that it sends out an AppendEntries RPC every
second to maintain its leadership.
Once a leader has been elected (and the cluster is unavailable until this is the case)
gets and sets are processed by forwarding. When the cluster node receives a get RPC,
the node forwards it to the leader, which then reads the latest value of the key from
its internal store. Then the leader sends a response back to the node that relayed the
message so that that node can respond to the client.
Writes in Raft are slightly more complicated. If a non leader node receives a set request
it will simply forward that request onto the leader. Upon receiving a set RPC, the leader
adds the entry to its log and sends out an AppendEntries RPC to each of the nodes in
the cluster instructing them to add the entry to their logs. The leader will not respond
to the request until at least half of the cluster has responded saying that they have
added the item to their log. At this point the leader considers the operation committed,
but other nodes will not be aware of this fact. However, the leader includes the latest
commited index on every AppendEntries RPC, so eventually the entire cluster will learn
of the new commit. When the node that originally handled the request learns that it
was committed, it will respond to the client with a setResponse.
Some of the procedure outlined above is slightly complicated by the requirements of
the broker. Weve already outlined how we route messages internally to the leader
and eventually the relay node is notied of the success. To handle the case where these
messages are dropped the relay node will retry its request every second until it gets a
response. To facilitate this we make sure that we ignore duplicate responses and the log
guarantees that each operation will be appended only once.
2.3 Verification
script is as follows:
# S t a r t a l l the nodes .
s t a r t node1 peer names=node2 , node3
s t a r t node2 peer names=node1 , node3
s t a r t node3 peer names=node1 , node2 l e a d e r
# Ensure that a l l nodes a re connected :
s e t node3 bar 40
get node3 bar
# Simulate a network p a r t i t i o n on node3 .
delay 15 from node3 by 30
delay 15 to node3 from 30
get node1 bar
s e t node2 f o o 5
get node1 f o o
get node2 bar
When run, our implementation successfully processes the gets and sets in a consistent
manner, demonstrating the replication of data. This test ensured that our algorithm
was able to run a new leader election in the face of an extended network delay.
Next, we demonstrated that our algorithm requires a quorum in order to propagate
requests:
# S t a r t a l l the nodes .
s t a r t node1 peer names=node2 , node3
s t a r t node2 peer names=node1 , node3
s t a r t node3 peer names=node1 , node2 l e a d e r
# Ensure that a l l nodes a re connected :
s e t node3 bar 40
get node3 bar
# D e a c t i v a t e network l i n k around node3 .
drop 15 from node3
drop 15 to node3
get node1 bar
s e t node2 f o o 5
get node1 f o o
get node2 bar
This test assured us that our algorithm halts until the packet link is restored before allowing writes. Though this does not guarantee that our algorithm maintains consistency,
it puts our system squarely on the CP side of the CAP latent robustness triangle.
After that, we ran a script to ensure that our implementation is resistant to multi node
partitions:
# s t a r t nodes i n t h i s c l u s t e r .
s t a r t node1 peer names node2 , node3 , node4 , node5
Number of operations
100
500
1000
Run time
3.617 +/- 0.771
6.115 +/- 0.789
9.865 +/- 0.120
Run time
7.952 +- 1.742
37.634 +/- 6.124
69.693 +- 3.512
Table 3.2: Run time with moderate fail rate (10 nodes, 10 keys, 10:1 op:fail ration, failure
bound: 10)
as a result of a co-incidence of a leader election with a network partition. Appends
to the log occasionally resulted in extraneous entries added to the set of acknowledged
responses, triggering erroneous get responses in certain cases. Relayed messages would
occasionally be sent through the wrong node in the event of a leader failover. AppendEntries retries occasionally sidestepped leader status verication, triggering spurious
commits.
As a result, we've continued to write more tests that target both specic types of failures
(partitions et. al.) as well as generalized failure domains that can only be evaluated by
randomized search.
3 Performance
In designing this system our primary goal was consistency and fault tolerance. However,
we still ran the code through a suite of performance tests to understand its behavior
under various circumstances. For each data point we ran our fuzz script 5 times with
the appropriate parameters; we then averaged these values to create the nal data point
(we also give standard deviations)
In our rst performance test we wanted to see how our system would perform in the
absence of failures. Thus, we ran 100, 500, and 1000 operations with no failures to see
how long each run took. Looking at the data it seems that after paying the start up
costs of around 2 seconds we run at around 100 operations / second. Please see Table
3.1.
Next we wanted to test how our implementation performed in the presence of modest
failures. To test this we chose to have one failure(either a drop or a delay) for every 10
operations. This introduces a signicant hit to the run time causing the operations per
second to drop to around 14. See Table 3.2.
Finally, we wanted to test our code in the face of extreme failures. Thus, we increased
the failure rate to one failure for every two operations and increased the maximum length
Despite the ease of implementation claimed by the Raft paper, we nevertheless had
a fair amount of diculty with the various subtle bugs that can manifest themselves
in a concurrent algorithm. Even with three experienced programmers on the team,
the debugging of a one hour program took almost sixteen person-hours of automated
testing, patching, and discussion. Indeed, many of the problems were uncovered only
with systematic use of assert statements throughout the code, as well as abnormally
high failure rates. The use of fuzz testing with large numbers of operations and failures
proved to be invaluable in tracking down these bugs.
On the note of testing, we also learned the importance of continually testing the code.
At one point, after having xed all of our bugs, we wanted to clarify and restructure
the code. We did so in one large push, which resulted in a number of bugs that we had
previously removed reappearing. We concluded that we would have been better served
by running tests after changing every function instead of waiting until we had rewritten
most of the code.
4.2 Future work
Although the current implementation meets all of the requirements that we laid out for
ourselves there are still several improvements we would make if we had more time. First
of all we would add log compaction which is detailed in the Raft paper which would
reduce memory overhead and start up times. Another possible improvement is to add
dynamic node joining and departure. As it stands nodes must know about every other
node in the network; Raft supports dynamic membership although the details get a
little bit hairy. Finally, we might try to add Byzantine fault tolerance. Raft doesnt
handle Byzantine failures at all, so it would require some extra algorithmic legwork to
update our implementation so that it was tolerant of Byzantine faults; however, we feel
that having a system that was resilient to both Byzantine and fail stop faults would
prove to be useful in numerous real world applications.
4.3 Contributions
Tristan did the lion's share of the actual implementation and helped write the design
document.
Alex helped track down a number of particularly pernicious bugs.
Jeremy was a pointy-haired manager and wrote the design document and fuzz test.