You are on page 1of 8

Advanced Distributed Systems

Final Pro ject: Raft Implementation

Tristan Rasmussen, Jeremy Archer, and Alex Kolchinski

June 5, 2014

1 Introduction
For our nal project we wanted to create a key value store that would be consistent,
fast, and easy to understand/implement. Given these requirements we chose to build our
store based on Raft, the brainchild of the RAMCloud group at Stanford. This algorithm
guarantees global linearizability across the space of keys as long as a quorum of writers
is maintained.
1.1 Motivation

Existing key-value stores, like Chubby, are immensely valuable in large systems, both
for their ability to manage locks as well as for their high consistency guarantees. By
allowing simple, and exible consistency these applications provide a useful framework
upon which developers can build larger and more complex applications.
Generally, consistent key value stores are based on an a replicated multi-Paxos implementation. Paxos is notorious for being dicult to understand and implement which
means these applications are more prone to programmer errors, even if the underlying
algorithm is sound. Since our goal for this project was a well-tested and cleanly written
implementation, we ruled out Paxos as a candidate algorithm.
We also considered implementing a distributed hash table, such as Chord, for its scalable
performance and relatively simple implementation. However, the consistency model of
DHTs is signicantly weaker than we would have desired (see below for a more thorough
explanation of our consistency model) and as such we were forced to rule out the use of
a DHT.

Having ruled out both DHTs and multi-Paxos we eventually concluded that Raft would
provide all of the guarantees that we desired without overly complicating the implementation.
1.2 Consistency requirements

The general goal for our project was to create a system that was strongly consistent
even in the face of dropped/delayed packets or a network partition. More specically,
under our consistency model we assure that as soon as a setResponse has been received
by the client, any getResponse's to the cluster return with the latest value of that key,
regardless of the underlying state of the cluster. We do not make any guarantees about
response time in the case of arbitrary failure. Moreover, we do not protect against
Byzantine failures, only fail stop and network partitions.

2 Implementation
We chose to write our implementation of Raft in Python, owing to the language's
widespread popularity and general ease of use. Our code is written as a single class Node
that processes all requests in sequence. Built on the example code, it uses the native
MQ support for timeouts and blocking requests, and avoids much of the complexity of
locking behavior. Because we wanted to match the variable names and indexing scheme
(starting at one instead of at zero) used in the Raft paper, we included an additional
class that handled a consistent log for each Node instance. This log ensured that request
IDs were never repeated, and that appends to the log would only progress when the log
was consistent with the state given in the incoming message.
2.1 Overview

For building our consensus we used the standard algorithm documented in the Raft
paper. Here we will present the general approach of the algorithm without diving into
too many specics. Raft has a strong leader property so only leaders can serve get or
set requests. The leader will push out all changes to all followers and only considers a
change committed if it is known to have been saved on a majority of the cluster. Most of
the diculty of the algorithm comes into play when electing a leader; how we approach
this problem is detailed in the next section.
In addition to the basic Raft algorithm we wrote several chunks of code to make sure
that our implementation matched the expectations of the broker. More specically the
broker expects gets and sets to be synchronous, that it will only receive a response
from the node who it sent the request to, and that it will never receive stray/duplicate
responses.

2.2 Algorithm

Each node in the cluster maintains a persistent randomized timer that attempts to elect
a new leader indenitely. As soon as this timeout expires this node begins an election
by setting its state to candidate and sending out a series of RequestVote RPCs to all
peers. If a candidate gets a majority of votes for a specic term it will declare itself the
leader and send out an AppendEntires RPC to all peers in order to notify them that
it is now the leader. It will also ensure that it sends out an AppendEntries RPC every
second to maintain its leadership.
Once a leader has been elected (and the cluster is unavailable until this is the case)
gets and sets are processed by forwarding. When the cluster node receives a get RPC,
the node forwards it to the leader, which then reads the latest value of the key from
its internal store. Then the leader sends a response back to the node that relayed the
message so that that node can respond to the client.
Writes in Raft are slightly more complicated. If a non leader node receives a set request
it will simply forward that request onto the leader. Upon receiving a set RPC, the leader
adds the entry to its log and sends out an AppendEntries RPC to each of the nodes in
the cluster instructing them to add the entry to their logs. The leader will not respond
to the request until at least half of the cluster has responded saying that they have
added the item to their log. At this point the leader considers the operation committed,
but other nodes will not be aware of this fact. However, the leader includes the latest
commited index on every AppendEntries RPC, so eventually the entire cluster will learn
of the new commit. When the node that originally handled the request learns that it
was committed, it will respond to the client with a setResponse.
Some of the procedure outlined above is slightly complicated by the requirements of
the broker. Weve already outlined how we route messages internally to the leader
and eventually the relay node is notied of the success. To handle the case where these
messages are dropped the relay node will retry its request every second until it gets a
response. To facilitate this we make sure that we ignore duplicate responses and the log
guarantees that each operation will be appended only once.
2.3 Verification

Although we attempted to follow Raft as closely as possible, we understand that going


from a paper to an implementation is an imperfect process. Thus, to show that we really
do provide our guarantees we made extensive use of assert statements and automated
testing. The assertions scattered throughout the code ensure that the system is always
in the expected state when running specic functions. These are generally easy to
understand while looking through the code, so we wont say anything more in this
writeup. However, we will discuss our various automated and synthetic tests and the
failure conditions they check against.
Our rst test was a simple availability test. To ensure that the implementation could
survive network partitions, we wrote a test script that dropped the network access of
one node from its peers and attempted to read data from a dierent node. This test

script is as follows:
# S t a r t a l l the nodes .
s t a r t node1 peer names=node2 , node3
s t a r t node2 peer names=node1 , node3
s t a r t node3 peer names=node1 , node2 l e a d e r
# Ensure that a l l nodes a re connected :
s e t node3 bar 40
get node3 bar
# Simulate a network p a r t i t i o n on node3 .
delay 15 from node3 by 30
delay 15 to node3 from 30
get node1 bar
s e t node2 f o o 5
get node1 f o o
get node2 bar
When run, our implementation successfully processes the gets and sets in a consistent
manner, demonstrating the replication of data. This test ensured that our algorithm
was able to run a new leader election in the face of an extended network delay.
Next, we demonstrated that our algorithm requires a quorum in order to propagate
requests:
# S t a r t a l l the nodes .
s t a r t node1 peer names=node2 , node3
s t a r t node2 peer names=node1 , node3
s t a r t node3 peer names=node1 , node2 l e a d e r
# Ensure that a l l nodes a re connected :
s e t node3 bar 40
get node3 bar
# D e a c t i v a t e network l i n k around node3 .
drop 15 from node3
drop 15 to node3
get node1 bar
s e t node2 f o o 5
get node1 f o o
get node2 bar
This test assured us that our algorithm halts until the packet link is restored before allowing writes. Though this does not guarantee that our algorithm maintains consistency,
it puts our system squarely on the CP side of the CAP latent robustness triangle.
After that, we ran a script to ensure that our implementation is resistant to multi node
partitions:
# s t a r t nodes i n t h i s c l u s t e r .
s t a r t node1 peer names node2 , node3 , node4 , node5

s t a r t node2 peer names node1 , node3 , node4 , node5


s t a r t node3 peer names node1 , node2 , node4 , node5
s t a r t node4 peer names node1 , node2 , node3 , node5
s t a r t node5 peer names node1 , node2 , node3 , node4
#
s p l i t p a r t i t i o n 1 node1 , node2
drop 10 from node3
drop 10 to node3
s e t node4 key1 value1
get node5 key1
# expect r e s u l t " value1 "
join partition1
# r e j i g g e r p a r t i t i o n s
s p l i t p a r t i t i o n 2 node4 , node5
get node1 key1
# expect r e s u l t " value1 "
# f l u s h b u f f e r s
s e t node1 key2 value2
get node1 key2
# expect r e s u l t " value2 "
For this test, we start by partitioning node1 and node2, and node3, and then triggering
a commit. Despite the lack of a quorum, Raft eventually succeeds with set operation
after the temporary partition around node3 is restored, resulting in stable data being
committed to a majority of the nodes in the cluster.
We then repartition the original two nodes from the group, resulting in a new quorum
that cannot include the original node. Because Raft waits until consensus before returning a commit, we cannot return from the set until node3 has returned, and so when
node4 and node5 are partitioned the data remains available. Finally, we ush the buers
(ensuring that the result is actually printed) and then print the result of the get to the
user.
Finally, we measured the stability of the algorithm with a randomized test algorithm,
somewhat similar to the "GoCrazy" operation specied in the problem statement. This
algorithm generated a set of 1000 reads and writes and interspersed 700 random failures
(each of which aects from 1 to 50 RPCs at delays of up to 50 messages or total packet
drops) uniformly distributed across the space of nodes.
Crucially, the script also generates an "expected output" le that assumes a completely
sequential datastore; this ensures that the order of sets and gets is, at least from an
independent synchronous client, totally consistent. While this is not a proof of robustness
under all circumstances, we feel that the large number of situations considered, as well
as the number of bugs found, justify the eort used in the creation of the script.
As a result of this procedure, we uncovered a number of bugs in our implementation:
The AppendEntries RPC did not correctly return error responses when there were conicting log entries. The failover AppendEntries call occasionally lost committed data

Number of operations
100
500
1000

Run time
3.617 +/- 0.771
6.115 +/- 0.789
9.865 +/- 0.120

Table 3.1: Run time with no failures (10 nodes, 10 keys)


Number of operations (Number of failures)
100 (10)
500 (50)
1000 (100)

Run time
7.952 +- 1.742
37.634 +/- 6.124
69.693 +- 3.512

Table 3.2: Run time with moderate fail rate (10 nodes, 10 keys, 10:1 op:fail ration, failure
bound: 10)
as a result of a co-incidence of a leader election with a network partition. Appends
to the log occasionally resulted in extraneous entries added to the set of acknowledged
responses, triggering erroneous get responses in certain cases. Relayed messages would
occasionally be sent through the wrong node in the event of a leader failover. AppendEntries retries occasionally sidestepped leader status verication, triggering spurious
commits.
As a result, we've continued to write more tests that target both specic types of failures
(partitions et. al.) as well as generalized failure domains that can only be evaluated by
randomized search.

3 Performance
In designing this system our primary goal was consistency and fault tolerance. However,
we still ran the code through a suite of performance tests to understand its behavior
under various circumstances. For each data point we ran our fuzz script 5 times with
the appropriate parameters; we then averaged these values to create the nal data point
(we also give standard deviations)
In our rst performance test we wanted to see how our system would perform in the
absence of failures. Thus, we ran 100, 500, and 1000 operations with no failures to see
how long each run took. Looking at the data it seems that after paying the start up
costs of around 2 seconds we run at around 100 operations / second. Please see Table
3.1.
Next we wanted to test how our implementation performed in the presence of modest
failures. To test this we chose to have one failure(either a drop or a delay) for every 10
operations. This introduces a signicant hit to the run time causing the operations per
second to drop to around 14. See Table 3.2.
Finally, we wanted to test our code in the face of extreme failures. Thus, we increased
the failure rate to one failure for every two operations and increased the maximum length

Number of operations (Number of failures) Run time


Terms required
100 (50)
82.915 +- 8.004 31.8 +- 7.414
200 (100)
172.212 +- 9.137 64.6 +- 8.663
Table 3.3: Run time with high fail rate (10 nodes, 10 keys, 2:1 op:fail ration, failure
bound: 50)
of the failures to 50 messages. In this case even small tests took a long time to run so
we only gathered data points about 100 and 200 operations. In both cases we process a
little more than 1 operation a second and we require around n / 3 total terms to nish
processing the data. See Table 3.3.

4 Conclusions and Reflections


4.1 Lessons learned

Despite the ease of implementation claimed by the Raft paper, we nevertheless had
a fair amount of diculty with the various subtle bugs that can manifest themselves
in a concurrent algorithm. Even with three experienced programmers on the team,
the debugging of a one hour program took almost sixteen person-hours of automated
testing, patching, and discussion. Indeed, many of the problems were uncovered only
with systematic use of assert statements throughout the code, as well as abnormally
high failure rates. The use of fuzz testing with large numbers of operations and failures
proved to be invaluable in tracking down these bugs.
On the note of testing, we also learned the importance of continually testing the code.
At one point, after having xed all of our bugs, we wanted to clarify and restructure
the code. We did so in one large push, which resulted in a number of bugs that we had
previously removed reappearing. We concluded that we would have been better served
by running tests after changing every function instead of waiting until we had rewritten
most of the code.
4.2 Future work

Although the current implementation meets all of the requirements that we laid out for
ourselves there are still several improvements we would make if we had more time. First
of all we would add log compaction which is detailed in the Raft paper which would
reduce memory overhead and start up times. Another possible improvement is to add
dynamic node joining and departure. As it stands nodes must know about every other
node in the network; Raft supports dynamic membership although the details get a
little bit hairy. Finally, we might try to add Byzantine fault tolerance. Raft doesnt
handle Byzantine failures at all, so it would require some extra algorithmic legwork to
update our implementation so that it was tolerant of Byzantine faults; however, we feel

that having a system that was resilient to both Byzantine and fail stop faults would
prove to be useful in numerous real world applications.
4.3 Contributions

Tristan did the lion's share of the actual implementation and helped write the design
document.
Alex helped track down a number of particularly pernicious bugs.
Jeremy was a pointy-haired manager and wrote the design document and fuzz test.

You might also like