BXX 016

© The British Computer Society 2017. All rights reserved.
For Permissions, please email: journals.permissions@oup.com

Advance Access publication on 1 March 2017 doi:10.1093/comjnl/bxx016
Ring Paxos: High-Throughput Atomic

Broadcast†
Downloaded from https://academic.oup.com/comjnl/article-abstract/60/6/866/3058780 by Sri Sathya Sai Institute of Higher learning user on 23 January 2020
PARISA JALILI MARANDI1*, MARCO PRIMI2, NICOLAS SCHIPER3
4
AND FERNANDO PEDONE
1
Microsoft Research, Cambridge, UK
2
Apple, Cupertino, USA
3
Midokura, Minato, Japan
4
University of Lugano, Lugano, Switzerland
*
Corresponding author: pajalili@microsoft.com
Atomic broadcast is an important communication primitive often used to implement state machine
replication. Despite the large number of atomic broadcast algorithms proposed in the literature,
few papers have discussed how to turn these algorithms into efficient executable protocols. This
paper focuses on a class of atomic broadcast algorithms based on Paxos, with its corresponding
desirable properties: safety under asynchrony assumptions, liveness under weak synchrony
assumptions and resiliency optimality. The paper presents two protocols, M-Ring Paxos and
U-Ring Paxos, derived from Paxos. The protocols inherit the properties of Paxos and can be imple-
mented very efficiently. We report a detailed performance analysis of M-Ring Paxos and U-Ring
Paxos and compare them to other atomic broadcast protocols.
Keywords: software fault-tolerance; replication; total order broadcast; cluster computing
Received 31 March 2016; revised 9 November 2016; editorial decision 30 January 2017
Handling editor: George Loukas
1. INTRODUCTION of a replicated service and a lot of effort has been put into
designing efficient atomic broadcast algorithms [4]. Fewer
State machine replication is a fundamental approach to build-
papers, however, have discussed how to turn these algorithms
ing fault-tolerant distributed systems [1, 2]. The idea is to rep-
into efficient systems. We focus on a class of atomic broad-
licate a service so that faulty replicas do not prevent
cast algorithms based on Paxos [5]. Paxos is a fundamental
operational replicas from executing service requests. State
protocol, used to implement core infrastructure (e.g. storage
machine replication can be decomposed into two require-
systems [6], locking services [7] and distributed databases
ments regarding the dissemination of requests to replicas: (i)
[8]). Paxos has important properties: it is safe under asyn-
every non-faulty replica receives every request and (ii) no
chrony assumptions, live under weak synchrony assumptions
two replicas disagree on the order of delivered requests.
and resiliency-optimal, that is, it requires a majority of non-
These two requirements are often encapsulated in a group
faulty processes to ensure progress. We are interested in
communication primitive known as atomic broadcast or total-
Paxos implementations that aim at high throughput.
order broadcast [3].
We define the efficiency of an atomic broadcast protocol as
Being at the core of state machine replication, atomic
the rate between its maximum achieved throughput per
broadcast has an important impact on the overall performance
receiver, and the nominal transmission capacity of the system
per receiver. For example, a protocol that allows a receiver to
†
This paper is an extended version of ‘Ring Paxos: A High-Throughput deliver up to 500 Mbps of application data in a system
Atomic Broadcast Protocol’, published at DSN 2010 by the same authors. equipped with a gigabit network has efficiency 0.5 or 50%.
This paper extends our previous work in four directions: (i) it proposes a new
protocol well suited for environments in which IP-multicast is not available,
An efficient protocol would have high efficiency (e.g. 90% or
including its implementation and detailed performance results; (ii) it extends more), ideally independent of the number of receivers. Due to
the design considerations for the algorithms we propose; (iii) it adds flow inherent limitations of an algorithm, implementation details,
control to Ring Paxos and evaluates its effectiveness; (iv) it evaluates both and various overheads (e.g. added by the network layers),
protocols when data are persisted through synchronous disk writes.
SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS

THE COMPUTER JOURNAL, VOL. 60 NO. 6, 2017
RING PAXOS: HIGH-THROUGHPUT ATOMIC BROADCAST 867
typical atomic broadcast protocols are not ideal according to describes our system model. Section 3 reviews Paxos. Section
this metric. 4 presents the observations our protocols are based on and
This paper presents M-Ring Paxos and U-Ring Paxos, two introduces M-Ring Paxos and U-Ring Paxos. Section 6 com-
efficient atomic broadcast protocols: they have efficiency ments on related work. Section 5 evaluates the performance of
above 90%, which in some cases does not depend on the the protocols and compares them quantitatively to a number
number of receivers. The protocols are based on Paxos and of other protocols. Section 7 concludes the paper. The
inherit many of its characteristics: they are safe under asyn- Appendix contains a correctness argument for our protocols.
chrony assumptions, live under weak synchrony assumptions
and resiliency optimal. We revisit Paxos in light of a number
of optimizations and from these we derive M-Ring Paxos and 2. MODEL AND DEFINITIONS
U-Ring Paxos. Although our optimizations mostly address
atomic broadcast in a clustered environment, some of our 2.1. System model
design considerations are general enough to be useful in the We assume a distributed system model in which processes
development of other distributed protocols. communicate by exchanging messages. Processes can fail by
M-Ring Paxos uses IP-multicast and unicast (i.e. UDP). crashing but never perform incorrect actions (i.e. no
Network-level multicast is a powerful communication primi- Byzantine failures). Communication can be based on unicast,
tive to propagate messages to a set of processes in a cluster through the primitives send (p, m) and receive (m), and multi-
since it delegates to the interconnect (i.e. ethernet switch) cast, through the primitives ip-multicast (g, m) and ip-deliver
most of the communication. However, IP-multicast is subject (m), where m is a message, p is a process and g is the group
to message losses, mostly due to network congestion and buf- of processes m is addressed to. Messages can be lost but not
fer overflow (i.e. when the receiver is not able to consume corrupted. In the text, we refer sometimes to multicast mes-
messages at the rate they are transmitted). M-Ring Paxos can sages as packets.
cope with message loss and therefore benefit from the Our protocols, like Paxos, ensure safety under both asyn-
throughput that IP-multicast can provide without falling prey chronous and synchronous execution periods. The FLP impos-
to its shortcomings. To evenly balance the incoming and out- sibility result [12] states that in an asynchronous system
going communication needed to totally order messages, M- consensus and atomic broadcast cannot be both safe and live.
Ring Paxos places f + 1 nodes in a logical ring, where f is the We thus assume that the system is partially synchronous [13],
number of tolerated faulty processes. that is, it is initially asynchronous and eventually becomes
U-Ring Paxos uses unicast communication only (i.e. TCP) synchronous. The time when the system becomes synchronous
and disposes all processes in a ring. U-Ring Paxos is not the is called the Global Stabilization Time (GST) [13], and it is
first atomic broadcast protocol to place nodes in a logical ring unknown to the processes.
(e.g. Totem [9], LCR [10] and the protocol in [11] have done Before GST, there are no bounds on the time it takes for
it before), but it is the first to achieve high throughput, almost messages to be transmitted and actions to be executed. After
constant with the number of receivers, with the resilience and GST, such bounds exist but are unknown. Moreover, in order
synchrony assumptions of Paxos. to prove liveness, we assume that after GST all remaining pro-
We implemented M-Ring Paxos and U-Ring Paxos and cesses are correct—a process that is not correct is faulty. A cor-
compared their performance to other protocols. In particular, rect process is operational ‘forever’ and can reliably exchange
our Paxos-derived protocols can reach an efficiency of >90% messages with other correct processes. Notice that in practice,
in a gigabit network. M-Ring Paxos has low delivery latency, ‘forever’ means long enough for consensus to terminate.
below 5 msec, which remains approximately constant with an
increasing number of receivers (up to 25 receivers in our
experiments). Previous implementations of the Paxos protocol,
2.2. Consensus and atomic broadcast
based either on IP-multicast only or on unicast only, have effi-
ciency below 90%. The only other atomic broadcast protocol Since consensus and atomic broadcast are two distributed
that achieves high efficiency we know of is LCR [10], a pure agreement problems at the core of state machine replication,
ring-based protocol. But LCR relies on stronger synchrony we briefly define them here. The problems are related: atomic
assumptions than Paxos. U-Ring Paxos and LCR have latency broadcast can be implemented using a sequence of consensus
that depends on the number of processes in the ring. executions [14]. Consensus is defined by the primitives pro-
This paper makes three contributions: First, it proposes pose (v) and decide (v), where v is an arbitrary value; atomic
novel atomic broadcast algorithms for clustered networks broadcast is defined by the primitives broadcast (m) and
derived from Paxos. Second, it describes implementations of deliver (m), where m is a message.
these algorithms. Third, it analyses their performance and Consensus guarantees that (i) if a process decides v then
compares them to other atomic broadcast protocols. The some process proposed v; (ii) no two processes decide different
remainder of the paper is structured as follows. Section 2 values; and (iii) if one or more correct processes propose a

868 P.J. MARANDI et al.
value then eventually some value is decided by all correct pro- coordinator can propose the value received from the proposer.
cesses. Atomic broadcast guarantees that (i) if a process deli- In some cases, it may happen that more than one acceptor
vers m, then all correct processes deliver m; (ii) no two have cast a vote in a previous round. In this case, the coordin-
processes deliver messages in different orders; and (iii) if a cor- ator picks the value that was voted for in the highest num-
rect process broadcasts m, then all correct processes deliver m. bered round. From the algorithm, two acceptors cannot cast
votes for different values in the same round.
An acceptor will vote for a value c-val with corresponding
round c-rnd in Phase 2 if the acceptor has not received any
3. BACKGROUND: PAXOS
Phase 1 message for a higher round (Task 4). Voting for a
Paxos is a fault-tolerant consensus algorithm intended for value means setting the acceptor’s variables v-rnd and v-val
state machine replication [5]. We describe next how a value to the values sent by the coordinator. If the acceptor votes for
is decided in a single instance of consensus. the value received, it replies to the coordinator. When the
Paxos distinguishes three roles: proposers, acceptors and coordinator receives replies from an m-quorum (Task 5), it
learners. A process can execute any of these roles, and mul- knows that a value has been decided and sends the decision
tiple roles simultaneously. Proposers propose a value, accep- to the learners.
tors choose a value and learners learn the decided value. In order to know whether their values have been decided,
Hereafter, Na denotes the set of acceptors, Nl the set of lear- proposers are typically also learners. If a proposer does not
ners and Qa any majority quorum of acceptors (m-quorum), learn its proposed value after a certain time (e.g. because its
that is, a subset of Na of size éê (∣ Na ∣ + 1) 2ùú . The execution message to the coordinator was lost), it proposes the value
of one consensus instance proceeds in a sequence of rounds, again. As long as a non-faulty coordinator is eventually
uniquely identified by a round number, a positive integer. For
each round, one process, typically among the proposers or
acceptors, plays the role of coordinator of the round. To pro-
pose a value, proposers send the value to the coordinator. The Algorithm 1 Paxos (for process p).
coordinator maintains two variables: (i) c-rnd is the highest 1: Task 1 (coordinator)
numbered round that the coordinator has started; and (ii) c-val 2: upon receiving value v from proposer P(v)
is the value that the coordinator has picked for round c-rnd. 3: increase c-rnd to an arbitrary unique value
The first is initialized to 0 and the second to null. 4: for all q Î Na do send (q, (PHASE 1A, c-rnd))
Acceptors maintain three variables: (i) rnd is the highest 5: Task 2 (acceptor)
numbered round in which the acceptor has participated, ini- 6: upon receiving (PHASE 1A, c-rnd) from coordinator
tially 0; (ii) v-rnd is the highest numbered round in which 7: if c-rnd > rnd then
the acceptor has cast a vote, initially 0—it follows that v-rnd 8: let rnd be c-rnd
£ rnd always holds; and (iii) v-val is the value voted by the 9: send (coordinator, (PHASE 1B, rnd, v-rnd, v-val))
acceptor in round v-rnd, initially null. 10: Task 3 (coordinator)
Algorithm 1 provides an overview of Paxos. The algorithm 11: upon receiving (PHASE 1B, rnd, v-rnd, v-val) from Qa
has two phases. To execute Phase 1, the coordinator picks a such that c-rnd = rnd
unique round number c-rnd greater than any round it has 12: let k be the largest v-rnd value received
picked so far, and sends it to the acceptors (Task 1). Upon 13: let V be the set of (v-rnd, v-val) received with
receiving such a message (Task 2), an acceptor checks v-rnd = k
whether the round proposed by the coordinator is greater than 14: if k = 0 then let c-val be v
any round it has received so far; if so, the acceptor promises 15: else let c-val be the only v-val in V
not to accept any future message with a round smaller than c- 16: for all q Î Na do send (q, (PHASE 2A, c-rnd, c-val))
rnd. The acceptor then replies to the coordinator with the 17: Task 4 (acceptor)
highest numbered round in which it has cast a vote, if any, 18: upon receiving (PHASE 2A, c-rnd, c-val) from coordinator
and the value it voted for in that round. Notice that the coord- 19: if c-rnd ³ rnd then
inator does not send any proposal in Phase 1. 20: let rnd be c-rnd
The coordinator starts Phase 2 after receiving a reply from 21: let v-rnd be c-rnd
an m-quorum (Task 3). Before proposing a value in Phase 2, 22: let v-val be c-val
the coordinator checks whether some acceptor has already 23: send (coordinator, (PHASE 2B, v-rnd, v-val))
cast a vote in a previous round. This mechanism guarantees 24: Task 5 (coordinator)
that only one value can be chosen in an instance of consen- 25: upon receiving (PHASE 2B, v-rnd, v-val) from Qa such that
sus. If an acceptor has voted for a value in a previous round, c-rnd = v-rnd
then the coordinator will propose this value; otherwise, if no 26: for all q Î Nl do send (q, (Decision, v-val))
acceptor has cast a vote in a previous round, then the

selected, there is a majority quorum of non-faulty acceptors, Phase 2 messages tend to be small (recall that we only order
and at least one non-faulty proposer, every consensus message identifiers) and many of these messages can be
instance will eventually decide on a value. batched together (i.e. instead of performing one send operation
Algorithm 1 can be optimized in a number of ways [5]. per each message, we perform one send operation for a batch
The coordinator can execute Phase 1 before a value is of messages, thus improving performance). In addition, this
received from a proposer. In doing so, once the coordinator communication scheme better balances the incoming and out-
receives a value from a proposer, consensus can be reached going communication at the acceptors and coordinator.
in four communication steps, as opposed to six. Moreover, if We illustrate the advantage of pipelined communication by
acceptors send Phase 2B messages directly to the learners, considering the standard many-to-one communication pattern,
the number of communication steps for a decision is further where upon receiving a Phase 2A message from the coordin-
reduced to three (see Fig. 3(a)). ator, an acceptor directly replies to the coordinator. In pipe-
lined communication, upon receiving a Phase 2A message,
the first acceptor at the head of the pipeline builds a Phase 2B
4. RING PAXOS message and forwards it to its successor. This Phase 2B mes-
sage contains an integer field whose value is incremented by
Our Ring Paxos protocols are variations of Paxos, optimized each acceptor along the ring as an indication of the acceptor’s
for local area networks. In Section 4.1, we motivate the most vote, without additional Phase 2B messages being initiated on
important design decisions behind our protocols. In sections the path to the coordinator. Thus, the size of a Phase 2B mes-
4.2 and 4.3, we detail the multicast- and unicast-based ver- sage that reaches the coordinator in the pipelined communica-
sions of Ring Paxos under normal conditions, that is, in the tion is equal to the size of individual Phase 2B messages that
absence of process crashes and message losses. In Sections acceptors send to the coordinator in the many-to-one commu-
4.4, 4.5 and 4.6, we discuss, respectively, abnormal condi- nication pattern. Therefore, to make one decision the coordin-
tions, flow control and garbage collection in Ring Paxos. We ator consumes n times more incoming bandwidth and CPU in
argue for the correctness of the protocols in the Appendix. the many-to-one communication pattern than in pipelined
communication pattern, where n is the number of acceptors
other than the coordinator.
4.1. Design considerations We compare the efficiency of pipelined and many-to-one
communication in an experiment with five acceptors, where
Ring Paxos was motivated by best practices in the design of
networked systems where messages are subject to small one acceptor plays the role of the coordinator. We ran the
delays (e.g. a local-area network).1 experiments in a cluster of Dell SC1435 nodes equipped with
2 dual-core AMD-Opteron 2.0 GHz CPUs and 4 GB of main
We designed Ring Paxos around two main ideas: First, we
separate message ordering from payload propagation, where memory. The servers were interconnected with an HP
payload is the value proposed by the application. Second, we ProCurve2900–48G Gigabit switch (0.1 msec of round-trip
time). For more details about these experiments, we refer
use efficient techniques to order messages and propagate pay-
load, as explained next. the reader to Section 5. The top left graph of Fig. 1 shows the
Atomic broadcast protocols typically require several rounds incoming throughput at the coordinator from only one of the
acceptors connected to it. In pipeline, there is only one acceptor
of message exchange, thus separating message ordering from
payload propagation saves bandwidth: the payload needs to directly connected to the coordinator whereas in the many-to-
be reliably propagated to the acceptors and learners only one there are four. The bottom left graph shows the correspond-
ing number of instances decided at the coordinator. The graphs
once. Care must be taken to ensure that at least f + 1 accep-
tors store the message payload before it is ordered however, on the right show the CPU usage at the acceptors (top) and the
otherwise we risk losing payloads. Ring Paxos solves this coordinator (bottom). For the pipelined communication, the
CPU is shown for an acceptor at the middle of the path.
issue with a simple technique, as we explain next.
To efficiently order messages, Phase 2 of Ring Paxos relies Pipeline is more CPU intensive at the acceptors as each
on a communication pattern called pipelined communication. acceptor (except the first one) both receives and forwards a
Phase 2B message, whereas the coordinator only receives
Acceptors are placed along a logical ring and when an acceptor
receives a Phase 2 message (either a Phase 2A or 2B), it for- Phase 2B messages. However, as we will show later, acceptors
wards this message along with its own Phase 2B message to are not the most CPU-loaded nodes in Ring Paxos, when the
rest of the protocol is also considered. From the results, pipeline
the next acceptor in the ring. This process continues until the
coordinator is reached. Pipelined communication allows exten- is preferable to many-to-one with respect to throughput. For
sive use of batching, which results in fewer protocol messages: small messages, this happens because pipeline is less CPU
intensive at the coordinator. For large messages, the advantage
1
While Ring Paxos could be deployed in a wide-area environment, in such stems from a balanced use of the incoming and outgoing band-
settings latency, not throughput, tends to be the parameter of optimization. width of the nodes.

1000 100
Throughput (Mbps) at coord.

800 80
CPU (%) at acceptor
600 60
400 40
200 20
0 0
0.5 1 2 4 8 0.5 1 2 4 8
Packet size (KB) Packet size (KB)
100K 100
pipeline
many-to-one
Throughput (inst/s) at coord.
80K 80
CPU (%) at coord.

60k 60
40k 40
20k 20
0 0
0.5 1 2 4 8 0.5 1 2 4 8
Packet size (KB) Packet size (KB)
FIGURE 1. How to efficiently propagate Phase 2B messages to the coordinator with four acceptors and one coordinator. The throughput in the
top left graph is the receiving throughput from only one of the incoming links of the coordinator. The number of incoming links at coordinator
for the pipeline and many-to-one patterns is one and four, respectively. As an example in this graph when the throughput shown for many-to-
one is 200 Mbps the aggregate incoming bandwidth consumed is 800 Mbps. (The left-most value in all the graphs corresponds to packets with
32 bytes.)
The ring topology allows Ring Paxos to order message iden- much better than one-to-many unicast (i.e. when the sender
tifiers efficiently. To obtain a high-throughput atomic broadcast contacts each destination individually). Moreover, the results
protocol, we need a way to efficiently propagate payloads. We depicted in the graph on the bottom of Fig. 2 show that this per-
do so by either relying on network-level multicast or the ring formance advantage does not come at the expense of increased
itself, in case multicast is not available. Network-level multi- CPU utilization, as all three strategies compare similarly with
cast enables high-throughput propagation of messages to the respect to this metric. This happens because in all cases the
nodes of a cluster [15]. There are two reasons for this. First, nodes sending messages communicate at maximum rate.
multicast delegates to the interconnect (i.e. ethernet switch) the
work of transferring messages to each one of the destinations.
Second, to propagate a message to all destinations there is only
4.2. Multicast-based Ring Paxos (M-Ring Paxos)
one system call and one context switch from a user process to
the operating system at the sender. If the sender contacts each We now present the Ring Paxos algorithm based on multicast
destination individually using unicast communication, how- communication. In Algorithm 2, statements in gray are the
ever, there is one system call and one context switch per des- same for Paxos and M-Ring Paxos. As in Paxos, the execu-
tination. A throughput-efficient alternative to multicast is to tion is divided in two phases. Moreover, the mechanism to
have nodes communicate in a pipelined pattern such that each ensure that only one value can be decided in an instance of
node sends the message only once, making better use of its consensus is the same as in Paxos.
CPU and bandwidth resources [10]. Differently than Paxos, M-Ring Paxos disposes a majority
Figure 2 assesses the performance of the three communica- quorum of acceptors (m-quorum) in a logical directed ring
tion strategies. Clearly, multicast and pipeline unicast perform (see Fig. 3b and c). The coordinator also plays the role of

1000 100
unicast
Max throughput (Mbps) per receiver

multicast
800 80 pipeline
CPU (%) per sender
600 60
400 40
200 20
0 0
0 5 10 15 20 25 0 5 10 15 20 25
Number of receivers Number of receivers
FIGURE 2. Performance comparison of unicast, multicast and pipeline to propagate payloads.
(a) (b) (c)

Proposer Proposers
v v
Coordinator Coord/Acceptor n
Acceptor n
Phase 2A Phase 2A Decision
Phase 2B
Acceptor 1 Acceptor 1
Phase 2B Phase 2B
Acceptor 2 unicast
...
Phase 2B Acceptor 2
(up to n-1)
...
...
Learners ip-multicast Learners
FIGURE 3. Optimized Paxos (a) and M-Ring Paxos (b,c).
acceptor in M-Ring Paxos, and it is the last acceptor in the acceptor in the ring sends a Phase 2B message to its succes-
ring. Placing acceptors in a ring reduces the number of incom- sor in the ring. Although learners also ip-deliver the pro-
ing messages at the coordinator and balances the communica- posed value, they do not learn it since it has not been
tion among acceptors. Before the coordinator starts Phase 1, it accepted yet. The next acceptor in the ring to receive a Phase
proposes the ring on which acceptors will be located. By 2B message (Task 5) checks whether it has ip-delivered the
replying to the coordinator, the acceptors acknowledge that value proposed by the coordinator in a Phase 2A message—
they abide by the proposed ring. Since Phase 1 can be exe- the acceptor can only vote if it has received the value, not
cuted before values are proposed, there is little gain in opti- only its unique identifier. The check is done by comparing
mizing the sending of Phase 1B messages. Thus, we use the the acceptor’s v-vid variable to the value’s identifier calcu-
ring only to propagate Phase 2B messages. In addition to lated by the coordinator—this is to ensure that when consen-
checking what value can be proposed in Phase 2 (Task 3), the sus is reached, a majority of acceptors knows the chosen
coordinator also creates a unique identifier for the value to be value and this value can be retrieved by learners at any time.
proposed. Ring Paxos executes consensus on value ids [11, If the condition holds, then there are two possibilities: either
16]; proposed values are disseminated to the m-quorum and to the acceptor is not the coordinator (i.e. the last process in the
the learners in Phase 2A messages using IP-multicast. ring), in which case it sends a Phase 2B message to its suc-
Upon ip-delivering a Phase 2A message (Task 4), an cessor, or it is the coordinator and then it IP-multicasts a
acceptor checks that it can vote for the proposed value. If so, decision message including the identifier of the chosen value.
it updates its v-rnd and v-val variables, as in Paxos, and its v- Once a learner ip-delivers this message, it can learn the value
vid variable. Variable v-vid contains the unique identifier of received previously from the coordinator in the Phase 2A
the proposed value; it is initialized with null. The first message.

used only when an acceptor in the ring fails.2 Finally,

Algorithm 2 M-Ring Paxos (for process p).
although IP-multicast is used by the coordinator in Tasks 3
1: Task 1 (coordinator) and 5, this can be implemented more efficiently by overlap-
2: upon receiving value v from proposer ping consecutive consensus instances, such that the message
3: increase c-rnd to an arbitrary unique value sent by Task 5 of consensus instance i is IP-multicast together
4: for all q Î Qa do send (q, (PHASE 1A, c-rnd)) with the message sent by Task 3 of consensus instance i + 1.
5: Task 2 (acceptor)
6: upon receiving (PHASE 1A,c-rnd) from coordinator
7: if c-rnd > rnd then 4.3. Unicast-based Ring Paxos (U-Ring Paxos)
8: let rnd be c-rnd
9: send (coordinator, (PHASE 1B, rnd, v-rnd, v-val)) In the following, we present U-Ring Paxos. In Algorithm 3,
10: Task 3 (coordinator) statements in gray are the same for M-Ring Paxos and U-
11: upon receiving (PHASE 1B, rnd, v-rnd, v-val) from Qa Ring Paxos. The execution is divided in two phases and the
such that rnd = c-rnd mechanism to ensure that only one value can be decided in
12: let k be the largest v-rnd value received an instance of consensus is the same as in Paxos.
13: let V be the set of (v-rnd,v-val) received with v-rnd = k U-Ring Paxos disposes proposers, learners and a majority
14: if k = 0 then let c-val be v quorum of acceptors in a logical directed ring (see Fig. 4).
15: else let c-val be the only v-val in V Pipelining all the processes in a ring is an alternative to multi-
16: let c-vid be a unique identifier for c-val cast that can reach high throughput, a fact that was proved in
17: IP-multicast (Qa È Nl , (PHASE 2A, c-rnd, c-val, c-vid)) [10]. Processes in the ring can assume multiple roles and there
18: Task 4 (acceptor) is no restriction on the relative position of these processes in
19: upon ip-delivering (PHASE 2A, c-rnd, c-val, c-vid) the ring, regardless of their roles. However, for simplicity of
20: if c-rnd ³ rnd then discussion, hereafter it is assumed that acceptors are lined up
21: let rnd be c-rnd one after the other in the ring (see Fig. 4b). To reduce latency,
22: let v-rnd be c-rnd the coordinator is the first acceptor in the ring.
23: let v-val be c-val Once a proposer proposes a value, it is forwarded along the
24: let v-vid be c-vid ring until it reaches the coordinator, which will proceed with
25: if p = first (ring) then Phase 1 as in Paxos. When the coordinator receives Phase 1B
26: send (successor (p, ring), (PHASE 2B, c-rnd, c-vid)) messages from an m-quorum (Task 3), it will check which
27: Task 5 (coordinator and acceptors) value can be proposed and assign a unique identifier to the
28: upon receiving (PHASE 2B,c-rnd,c-vid) value to be proposed as in M-Ring Paxos. The coordinator
29: if v-vid = c-vid then then sends Phase 2A and Phase 2B messages to its successor
30: if p ¹ last (ring) then in the ring (Task 3). Similarly to Paxos and M-Ring Paxos,
31: send (successor (p, ring), (PHASE 2B, c-rnd, c-vid)) the coordinator in U-Ring Paxos can execute Phase 1 before
32: else a value is proposed, reducing the latency of the protocol.
33: IP-multicast (Qa È Nl , (DECISION, c-vid)) Upon receiving a Phase 2A/2B message (Task 4), an
Note: acceptor checks that it can vote for the proposed value. If so,
first(ring): process that succeeds the coordinator in it updates its v-rnd, v-val and v-vid variables. If the acceptor
ring does not precede the last acceptor in the ring, it sends the
last(ring): the coordinator process in ring Phase 2A/2B message to its successor. Differently than M-
successor (p, ring): process that succeeds p in ring Ring Paxos where the coordinator is the one who checks
whether a decision has been made in the instance, in U-Ring
Paxos this is delegated to the last acceptor in the ring. After
deciding, the last acceptor sends the decision, possibly
Ring Paxos can make use of a number of optimizations, together with the value chosen, to its successor in the ring.
most of which have been described previously in the litera- Forwarding the chosen-value ends at the predecessor of the
ture: when a new coordinator is elected, it executes Phase 1 process who proposed the chosen value as at this point the
for several consensus instances [5]; Phase 2 is executed for a value has been received by all the processes in the ring. The
batch of proposed values, and not for a single value (e.g. decision, i.e. the chosen-value identifier, should be forwarded
[17]); one consensus instance can be started before the previ- along the ring until it reaches the predecessor of the last
ous one has finished [5]. acceptor (Task 5).
Placing an m-quorum in the ring, as opposed to placing all 2
This idea is conceptually similar to Cheap Paxos [16], although Cheap
acceptors in the ring, reduces the number of communication Paxos uses a reduced set of acceptors in order to save hardware resources,
steps to reach a decision. The remaining acceptors are spares, and not to reduce latency.

4.4. Handling abnormal cases

Algorithm 3 U-Ring Paxos (for process p).
We now discuss how Ring Paxos tolerates lost messages and
1: Task 1 (all)
process crashes. The simplest way to handle abnormal cases is
2: upon receiving value v proposed by P(v) from
to switch back to original Paxos. A coordinator that suspects
predecessor (p , ring)
the failure of one or more acceptors may simply try to contact
3: if p = coordinator then
all acceptors in order to gather an m-quorum. This solution
4: increase c-rnd to an arbitrary unique value
would reduce throughput but allow progress despite failures.
5: for all q Î Qa do send (q, (PHASE 1A, c-rnd))
Ring Paxos recovers from network and process failures as
6: else
follows. Lost messages are handled with retransmissions. If the
7: send v to successor (p, ring)
coordinator does not receive a response to its Phase 1A/2A
8: Task 2 (acceptor) messages, it resends them, possibly with a bigger round num-
9: upon receiving (PHASE 1A, c-rnd) from coordinator ber. Eventually, the coordinator will receive a response or will
10: if c-rnd > rnd then suspect the failure of a process (this suspicion might be errone-
11: let rnd be c-rnd ous). In this situation, the coordinator lays out a new ring,
12: send (coordinator, (PHASE 1B, rnd, v-rnd, v-val)) excluding the suspected process, and re-executes Phase 1.
13: Task 3 (coordinator) With M-Ring Paxos, message loss may cause learners to
14: upon receiving (PHASE 1B, rnd, v-rnd, v-val) from Qa receive only the value proposed and not the notification that
such that rnd = c-rnd it was accepted, only the notification without the value, or
15: let k be the largest v-rnd value received none of them. Learners can recover lost messages by inquir-
16: let V be the set of (v-rnd,v-val) received with v-rnd = k ing other processes. M-Ring Paxos assigns each learner to a
17: if k = 0 then let c-val be v preferential acceptor in the ring, which the learner can ask for
18: else let c-val be the only v-val in V lost messages.
19: let c-vid be a unique identifier for c-val With U-Ring Paxos, the value is received by a learner only
20: send (successor (p, ring), (PHASE 2A/2B, c-rnd, c-val, after it was accepted. Thus, the problem described above can-
c-vid)) not arise. However, in both protocols learners may not have
21: Task 4 (acceptor) sufficient time to handle the decisions, and this may lead lear-
22: upon receiving (PHASE 2A/2B, c-rnd, c-val, c-vid) ners to drop some decisions. In fact, both versions of Ring
23: if c-rnd ³ rnd then Paxos need flow control to mitigate this phenomenon. We
24: let rnd be c-rnd discuss flow control in Section 4.5.
25: let v-rnd be c-rnd With both protocols, when an acceptor replies to a Phase
26: let v-val be c-val 1A or to a Phase 2A message, it must not forget its state (i.e.
27: let v-vid be c-vid variables rnd, ring, v-rnd, v-val and v-vid) despite failures.
28: if p = last _acceptor (ring) then There are two ways to ensure this. First, by assuming that a
29: Send_Decision (c-vid, c-val) majority of acceptors never fails. Second, by requiring accep-
30: else tors to keep their state on stable storage before replying to
31: send (successor (p, ring), (PHASE 2A/2B, c-rnd, Phases 1A and 2A messages.
c-val, c-vid)) Finally, a failed coordinator is detected by the other pro-
cesses, which select a new coordinator. Before GST (see
32: Task 5 (all)
Section 2.1) it is possible that multiple coordinators coexist.
33: upon receiving (DECISION, c-vid, c-val)
However, as Paxos, Ring Paxos guarantees safety even when
34: if p ¹ predecessor (last _acceptor (ring))
multiple coordinators execute at the same time, although it
35: Send_Decision (c-vid, c-val)
may not guarantee liveness. After GST, eventually a single
36: Send_Decision (c-vid, c-val)
correct coordinator is selected.
37: if p¹ predecessor (P(c-val), ring)
38: send (successor (p, ring), (DECISION, c-vid, c-val))
39: else
40: send (successor (p, ring), (DECISION, c-vid, –)) 4.5. Flow control
Note: P(v): proposer of value v In Ring Paxos, flow control helps regulate the speed at which
predecessor (p , ring): process that precedes p in ring consensus instances are executed. In doing so, we not only
successor (p, ring): process that succeeds p in ring reduce the likelihood of message loss for both unicast and
last _acceptor (ring): the f-th acceptor after the multicast communications but we also ensure that learners are
coordinator in ring given enough time to process decisions. For instance, when

(a) (b) unicast
Proposer
Coord/Acceptor 1
v Decision Proposers
Coordinator
Acceptor 1 Decision
Phase 2A,2B
Acceptor 2
Phase 2A,2B Decision Acceptor 2
Acceptor 3
...
(up to n) Decision,v
Decision,v
...
Learners Acceptor 3
Learners
FIGURE 4. U-Ring Paxos.
Ring Paxos is used to implement state machine replication, This technique allows learners to slow down the coordinator
decisions represent commands that read or write the applica- before decisions are dropped. In case some decisions do get
tion state, and may need extra processing time. lost, for instance if the coordinator starts with a window that is
To illustrate why flow control is important, consider a too large, lost decisions are retrieved from the acceptors.
scenario where commands are ordered faster than they can be To allow M-Ring Paxos to recover from temporarily slow
applied at the learners. Without flow control, buffers storing learners, the coordinator slowly increases the window size
decisions to be processed will eventually overflow and lear- when no notifications to slow down have been received for
ners will start dropping decisions, leading learners to do extra some time.
work to retrieve the lost decisions. As a consequence, the per-
formance of the replicated system may decrease, to the point
where clients time out and retransmit their commands, lead-
ing to further inefficiencies. 4.6. Garbage collection
Flow control in U-Ring Paxos is easily implemented. Acceptors store variables for each instance of Ring Paxos
Communication between two consecutive processes in the they participate in. The variables are the highest round rnd in
ring is done using TCP and, on each process, TCP buffers are which the acceptor executed a Phase 1 or 2, the highest round
made sufficiently large to take into account the processing v-rnd in which it cast a vote in Phase 2, as well as the corre-
time of Phases 1 and 2 messages. To ensure that learners sponding value v-val and value identifier v-vid the acceptor
have sufficient time to handle the decided values, U-Ring voted for. Variable rnd is shared across consensus instances
Paxos (i) let learners apply a decision before forwarding it to and does not need to be garbage collected. The other vari-
the next process in the ring and (ii) limits the number of out- ables are discarded when f + 1 learners have applied the cor-
standing consensus instances. responding decision to their application state. To do so, each
In M-Ring Paxos, communication is based on UDP and learner maintains its version, the largest instance for which it
flow control at the coordinator ensures that messages are sent applied the corresponding decision—learners apply decisions
at a rate the network can handle. With M-Ring Paxos, since in instance order so if a learner has applied decision of
learners are not part of the ring, we use the following mech- instance x, it also has applied all decisions of instances lower
anism for flow control. Learners constantly monitor their buf- than x. In M-Ring Paxos, each learner periodically communi-
fer for decisions that remain to be processed and when the cates its version to one of the acceptors (learners are assigned
used slots reach a threshold, they notify one of the acceptors. different acceptors to balance the associated load), and accep-
The notification informs the acceptors about the number of tors propagate this information along the ring. Once an
unprocessed requests at the learner. Acceptors forward this acceptor receives a version from f + 1 learners, it computes
notification along the ring until it reaches the coordinator. At the smallest version received and garbage collects variables
this point, the coordinator reduces the window of outstanding for instances up to the smallest version. In U-Ring Paxos,
consensus instances accordingly, leading it to start fewer learners are part of the ring and directly forward their version
instances in parallel. Provided that the coordinator slows to their successor.
down sufficiently, learners will be able to apply decisions as The M-Ring Paxos and U-Ring Paxos coordinators also
fast as they are ordered, and they will stop sending notifica- store state, namely the highest-round started c-rnd and the
tions to the acceptors. value picked c-val for a particular instance and round.

Variable c-rnd, similarly to rnd, is shared across instances In U-Ring Paxos, as the last acceptor in the ring is the pro-
and does not need to be garbage collected. The value picked cess that checks whether a decision has been reached, each
for a given instance and round can be discarded as soon as message originated in the last acceptor includes the ids of
the coordinator receives the corresponding Phase 2b messages decided values. This message is carried along the ring until
from its predecessor. everyone is informed about the decided values. The coordin-
Acceptors also store the decisions sent by the coordinator ator can piggyback new proposals on this message before for-
in order to let learners retrieve decisions that they may not warding it.
have received. These decisions can be discarded similarly as
with the acceptor variables. If a learner requests a decision
that has been garbage collected, the learner can be brought up
to date by communicating with a learner with a sufficiently 5.2. Ring Paxos versus other protocols
recent version, i.e. one that is larger than the instance of the We experimentally compare Ring Paxos to other four atomic
decision missing at the learner. Such a learner will always broadcast protocols: LCR [10], Libpaxos,3 S-Paxos [21], and
exist since we garbage collect a decision only after it is the protocol presented in [17], which hereafter we refer to as
reflected in the state of f + 1 learners. PFSB. LCR is a ring-based protocol that achieves very high
throughput (see also Section 6). Libpaxos, PFSB and S-Paxos
are implementations of Paxos. The first is entirely based on
5. PERFORMANCE EVALUATION IP-multicast; the second is based on unicast. S-Paxos is a
unicast-based implementation of Paxos which disseminates
In this section, we describe our Ring Paxos prototypes and the task of receiving and forwarding client requests among all
the experimental evaluation we conducted. We consider the the acceptors.
performance of Ring Paxos in the presence of message losses We implemented all protocols, except for PFSB and S-
and in the absence of process failures. Process failures are Paxos. The performance data of PFSB were taken from [17].
hopefully rare events; message losses happen relatively often The setup reported in [17] has slightly more powerful proces-
because of high network traffic. sors than the ones used in our experiments, but both setups
We ran the experiments in a cluster of Dell SC1435 nodes use a gigabit switch. The performance data of S-Paxos was
equipped with 2 dual-core AMD-Opteron 2.0 GHz CPUs and obtained with its open-source dissemination. Libpaxos is an
4 GB of main memory. The servers were interconnected with open-source Paxos implementation developed by our research
an HP ProCurve2900–48G Gigabit switch (0.1 msec of group.
round-trip time). For the experiments with disk writes we use Figure 6 shows the throughput in megabits per second (left
OCZ-VERTEX3 SSDs. Each experiment (i.e. point in the graph) and the number of messages delivered per second
graph) was repeated 3–10 times, with a few million messages (right graph) as the number of receivers increases. In both
broadcast in each execution. Every process is deployed on a graphs, the y-axis is in log scale. For all protocols, with the
dedicated node in the experiments. exception of PFSB, we explored the space of message sizes
In M-Ring Paxos, each process maintains a circular buffer and selected the value corresponding to the best throughput.
of packets; each packet is 8 Kbytes long and the buffer is 160 Table 1 shows the message sizes used in our experiments.
Mbytes long. In U-Ring Paxos, processes maintain a circular We assess the effect of different message sizes on the per-
buffer of packets, where each packet is 32 Kbytes long. U- formance of Ring Paxos in Section 5.4.
Ring Paxos allocates 16 Mbytes of buffer space per proposer. As it is seen in the graph on the left of Fig. 6 protocols
For example, with five proposers the total space needed for based on a ring only (LCR and U-Ring Paxos), on IP-multicast
buffers is 80 Mbytes. (Libpaxos), and on both (M-Ring Paxos) present throughput
approximately constant with the number of receivers.
5.1. Implementation
In M-Ring Paxos, the acceptors and the learners use the buf- 5.3. Impact of processes in the ring
fer to match proposal ids to proposal contents, as these are We now consider how the number of processes affects the
decomposed by the coordinator. Messages received out of throughput and latency of the Ring Paxos protocols, LCR and
sequence (e.g. because of transmission losses) are stored in S-Paxos.
the buffer until they can be delivered (i.e. learned) in order. In Fig. 5, the x-axis shows the number of acceptors in M-
Each packet sent by the coordinator is composed of two parts. Ring Paxos, U-Ring Paxos and S-Paxos. In U-Ring Paxos,
In one part the coordinator stores the ids of decided values, every acceptor is also a proposer and a learner.LCR does not
and in the second part it stores new proposed values with
3
their unique ids. http://libpaxos.sourceforge.net

1200 60
U-Ring Paxos LCR
1000 50 M-Ring Paxos S-Paxos
Throughput (Mbps)
Latency (msec)
800 40
600 30
400 20
200 10
0 0
0 5 10 15 20 25 30 0 5 10 15 20 25 30
Number of processes Number of processes
FIGURE 5. Throughput and latency when varying the number of processes in the ring.
1000 100000
10000
Messages per second

Throughput (Mbps)
100
1000
U-Ring Paxos
100
10 M-Ring Paxos
LCR
Libpaxos 10
PFSB
S-Paxos
1 1
0 5 10 15 20 25 30 0 5 10 15 20 25 30
Number of receivers Number of receivers
FIGURE 6. Ring Paxos and other atomic broadcast protocols (message sizes c.f. Table 1). For Ring Paxos protocols f is equal to two. In PFSB,
U-Ring Paxos and LCR, the number of receivers is equal to the total number of processes. In Libpaxos and M-Ring Paxos it is equal to the num-
ber of learners.
TABLE 1. Message sizes used in experiments and protocol effi- in a ring with n processes, 1 n of the messages delivered by a
ciency (for 10 nodes). process are created by the process itself. Thus, the process
can use its available incoming bandwidth to receive messages
Protocol Message size Efficiency (%) broadcast by the other processes [10]. In order for LCR and
LCR 32 kbytes 91 U-Ring Paxos to achieve high throughput, every process in
U-Ring Paxos 32 kbytes 90.4 their ring must broadcast messages. M-Ring Paxos does not
M-Ring Paxos 8 kbytes 90 have this constraint.
S-Paxos 32 kbytes 31.2 Figure 5 shows the latency measured by the message’s pro-
PFSB 200 bytes 4 poser. In U-Ring Paxos and LCR, latencies vary according to
Libpaxos 4 kbytes 3 the location of the proposer in the ring. The values reported
for these two protocols are for the best-located proposer, that
The bold font indicates the achievement of the protocols proposed is, for the proposer with the lowest latency.
by this work. Latency in LCR and U-Ring Paxos degrades with the num-
ber of processes; M-Ring Paxos presents a less-pronounced
distinguish process roles and requires all processes to be in increase in latency as more acceptors are placed in the ring
the ring. (top right graph in Fig. 5). Notice that there is less informa-
M-Ring Paxos has constant throughput with the number of tion circulating in the ring of M-Ring Paxos than in the rings
processes in the ring. Throughput of LCR and U-Ring Paxos in LCR and U-Ring Paxos. In LCR and U-Ring Paxos, the
slightly decreases as processes are added. With few processes, content of each message is sent n - 1 times, where n is the
LCR and U-Ring Paxos can achieve efficiency greater than number of processes in the ring. Message content is propa-
one, which may look counterintuitive. This happens because gated only once in M-Ring Paxos (using IP-multicast).

20 1
18
16 0.8
Latency (msec)
14
12 0.6
CDF
10
8 0.4
6
U-Ring Paxos
4 0.2 U-Ring Paxos
M-Ring Paxos M-Ring Paxos
2 LCR LCR
0 0
2 3 4 5 6 7 8 9 10 0 20 40 60 80 100
Number of processes Latency (ms) with 9 processes in the ring
FIGURE 7. Impact of synchronous disk writes on the latency when varying the number of processes in the ring.
1000 5
Throughput (Mbps)
Latency (msec)
800 4
600 3
400 2
200 1
0 0
120 14
Messages/sec (x1000)
Batches/sec (x1000)
100 12
batch size: 8 kbytes
80 10
8
60
6
40 4
20 2
0 0
0.2k 1k 2k 4k 8k 0.2k 1k 2k 4k 8k
Message size Message size
FIGURE 8. Impact of application message size on M-Ring Paxos.
LCR and U-Ring Paxos present similar latency in Fig. 5, writes across processes happen sequentially. In all protocols,
despite the expected difference, as presented in table 4. The data are written on disk in units of 32 Kbytes.
reason is that for a given setup with n processes, LCR can tol-
erate f = n - 1 failures, while U-Ring Paxos can tolerate
f = (n - 1) 2 failures, for an odd n. Thus, for the same n,
5.4. Impact of message size
the number of communication steps for each protocol is,
respectively, 2 (n - 1) and 2.5 (n - 1). Figures 8 and 9 quantify the effects of application message
In S-Paxos experiments as we observed substantial vari- size (payload) on the performance of M-Ring Paxos and U-
ability in the results, due to Java’s garbage collection mech- Ring Paxos, respectively. In both figures throughput (top left
anism, average values for all experiments were above 35 ms. graphs) increases with the size of application messages, up to
Figure 7 shows the performance of M-Ring Paxos, U-Ring 8 Kbytes and 32 Kbytes in M-Ring Paxos and U-Ring Paxos,
Paxos and LCR when processes store accepted values on respectively, after which it decreases. Notice that in our
disk. All the techniques are essentially disk bound with con- prototype IP-multicast packets are 8 Kbytes long, but data-
stant throughput of almost 270 Mbps, regardless the number grams are fragmented since the maximum transmission unit
of processes. However, latency increases as nodes are added in our network is 1500 bytes. In U-Ring Paxos communica-
to the ring. The right-most graph shows the CDF for latency tion is based on TCP. Latency is less sensitive to application
when there are nine processes in the ring. LCR and U-Ring message size (top right graphs).
Paxos have comparable latency. M-Ring Paxos has lower Figures 8 and 9 also show the number of application mes-
latency than LCR and U-Ring Paxos as processes write their sages delivered as a function of their size (bottom left
values on disk in parallel; in LCR and U-Ring Paxos disk graphs). Many small application messages can fit a single

7
1000
Throughput (Mbps)
Latency (msec)
800 5
600 4
3
400
2
200 1
0 0
120 4
Messages/sec (x1000)
Batches/sec (x1000)
100 batch size: 32 kbytes
3
80
60 2
40
1
20
0 0
0.2k 1k 2k 4k 8k 32k 0.2k 1k 2k 4k 8k 32k
Message size Message size
FIGURE 9. Impact of application message size on U-Ring Paxos.
Paxos message and Phase 2 is executed for a batch of pro- Table 3 shows the results for U-Ring Paxos. In this case,
posed values. As a consequence, many application messages all the processes play the roles of proposer, acceptor and
can be delivered per time unit (left-most bars). Small mes- learner. Therefore, the CPU and memory usage of all of them
sages, however, do not lead to high throughput since they are similar. In both tables, memory consumption at coordin-
result in high overhead (bottom right graphs). ator, acceptors and learners is mostly used by the circular buf-
fer of proposed values. For efficiency, in our prototype the
buffer is statically allocated. As a reference, the average CPU
usage per process in LCR is in the range of 65–70% and for
5.5. Flow control S-Paxos it is ∼270% (i.e. S-Paxos is multithreaded).
Figure 10 illustrates the flow control mechanism with three lear-
ners and two proposers using M-Ring Paxos. The aggregate pro-
posing rate of proposers is 850 Mbps (right graphs). After 5.7. Discussion
20 seconds one of the learners (left bottom graph) slows down. In the following, we summarize the main conclusions from
As soon as the number of pending instances reaches a prede- the experiments:
fined threshold, the slow learner sends a notification to one of
the acceptors in the ring. This acceptor propagates the notifica- • Judicious use of IP-multicast, as in M-Ring Paxos, and
tion along the ring until it reaches the coordinator. Having a topology entirely based on a ring, as in U-Ring
received this, the coordinator reduces its proposing rate and, as a Paxos and LCR, can achieve throughput near the lim-
consequence, the delivery rate drops. As proposers keep submit- its of the network (Section 5.2).
ting requests at a constant rate, eventually the receiving buffer of • Latency increases with the size of the ring in the Ring
the coordinator overflows and requests are dropped. Proposers Paxos protocols, although M-Ring Paxos is less prone
continue submitting new requests and also re-submitting pend- to the effects of ring size on latency (Section 5.3).
ing requests (proposers reduce their proposing rate if they detect • Both M-Ring Paxos and U-Ring Paxos are susceptible
buffer overflows). After 40 seconds, the slow learner restores its to message size. M-Ring Paxos achieves high through-
original rate. The coordinator does not receive any slow-down put with messages of 4 Kbytes or bigger; U-Ring
requests and restores its original proposing rate. Paxos reaches maximum performance with 8-Kbyte
messages. If application messages are small, batching
can improve performance (Section 5.4).
• In-memory deployments of M-Ring Paxos and U-Ring
5.6. CPU and memory usage
Paxos are network-bound; the performance of on-disk
Table 2 shows the CPU and memory usage of M-Ring Paxos deployments is determined by the throughput sus-
for processes with different roles under peak throughput. Not tained by the storage device (Section 5.4).
surprisingly, the coordinator is the process with the maximum • Our simple flow control mechanism in M-Ring Paxos
load since it should both receive a large stream of values proved effective in avoiding message losses by slow-
from the proposers and IP-multicast these values. ing down the rate of proposers (Section 5.5).

Throughput (Mbps) at learner-proposer1

1000 1000
Throughput (Mbps) at coord.

800 800
600 600
delivery rate
drop rate
400 400
200 200 delivery rate

propose rate
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Time (seconds) Time (seconds)
Throughput (Mbps) at learner-proposer2

1200 1000
Throughput (Mbps) at slow learner
delivery rate
1000
800
800
600
delivery rate
600 propose rate
400
400
200 200
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Time (seconds) Time (seconds)
FIGURE 10. Flow control in M-Ring Paxos. The two learners on the right graphs are also proposers. The learner on the left bottom graph slows
down after 20 seconds and returns to its original rate after 40 seconds.
TABLE 2. CPU and memory use for M-Ring Paxos. 6. RELATED WORK
Role CPU (%) Memory (Mbytes) Several papers argued that Paxos is not an easy algorithm to
implement [17–19]. Essentially, this is because Paxos is a sub-
Proposer 37.2 90 tle algorithm that leaves many non-trivial design decisions
Coordinator 88.0 168 open. Besides providing insight into these matters, two of these
Acceptor 24.0 168 papers present performance results of their Paxos implementa-
Learner 21.3 168 tions. In contrast to our Ring Paxos protocols, none of these
papers present algorithmic modifications to Paxos. In [20], an
analytical analysis of the impact of several optimizations on the
performance of the original Paxos algorithm is presented. The
TABLE 3. CPU and memory use for U-Ring Paxos.
authors also present extensive experimental results of their
Role CPU Memory implementation in both LAN and WAN environments.
Several papers have built on Paxos to provide various
proposer-acceptor-learner 48.0% 80 Mbytes abstractions such as a storage system [6], a locking service [7]
and a distributed database [8]. Some of the optimizations pre-
sented in these papers, such as executing read-only operations
TABLE 4. Comparison of atomic broadcast algorithms (f: number at a single learner while ensuring linearizability or using a log
of tolerated failures). to speed-up the execution of write requests are orthogonal to
the design of Ring Paxos and could be used here as well.
Algorithm Communication Number of Synchrony Paxos is not the only algorithm to implement atomic broad-
steps processes assumption cast. Some protocols implement atomic broadcast through the
virtual synchrony model introduced by the Isis system [22].
LCR [10] 2f f+1 Strong
With virtual synchrony, processes are part of a group. They
Ring+FD [11] (f 2 + 2f ) f (f + 1) + 1 Weak
may join and leave the group at any time. When processes
S-Paxos [21] 5 2f + 1 Weak
are suspected of crashing they are evicted from the group; vir-
M-Ring Paxos (f + 3) 2f + 1 Weak
tual synchrony ensures that processes observe the same
U-Ring Paxos 5f 2f + 1 Weak
sequence of group memberships or views and non-faulty

members deliver the same set of messages in each view. privilege to order messages circulates among broadcasters in
Implementing such properties requires solving consensus. the form of a token. Differently from moving sequencer algo-
The literature on atomic broadcast is abundant. In [4], five rithms, message ordering is provided by the broadcasters and
classes of broadcast algorithms have been identified: fixed not by the sequencers. In [9], the authors propose Totem, a
sequencer, moving sequencer, destination agreement, commu- protocol based on the virtual synchrony model. In the case of
nication history-based and privilege-based. Below, we review process or network failures, the ring is reconstructed and the
the five classes of atomic broadcast protocols. token regenerated using the new group membership. In [11],
In fixed sequencer algorithms (e.g. [23, 24]), broadcast fault-tolerance is provided by relying on a failure detector;
messages are sent to a distinguished process, called the tolerating f process failures requires a quadratic number of
sequencer, who is responsible for ordering these messages. processes. A general drawback of privilege-based protocols is
The role of sequencer is unique and only transferred to their high latency: before a process p can totally order a mes-
another process in case of failure of the current sequencer. In sage m, p must receive the token, which delays m’s delivery.
this class of algorithms, the sequencer may eventually M-Ring Paxos and U-Ring Paxos combine ideas from sev-
become the system bottleneck. eral broadcast protocols to provide high throughput and low
Moving sequencer protocols are based on the observation latency. In this sense, they fit multiple classes, as defined
that rotating the role of the sequencer distributes the load above. To ensure high throughput, our protocols decouple mes-
associated with ordering messages among processes. The sage dissemination from ordering. The former is accomplished
ability to order messages is passed from process to process using IP-multicast or pipelined unicast; the latter is done using
using a token. The majority of moving sequencer algorithms consensus on message identifiers. To use the network effi-
are optimizations of [25]. These protocols differ in the way ciently, processes executing consensus communicate using a
the token circulates in the system: in some protocols the ring, similarly to the majority of privilege-based protocols.
token is propagated along a ring [25, 26], in others, the token In Table 4, we compare algorithms that are closest to our
is passed to the least loaded process [27]. All the moving Ring Paxos protocols in terms of throughput efficiency. All
sequencer protocols we are aware of are based on the broad- these protocols use a logical ring for process communication,
cast–broadcast communication pattern. According to this pat- which is a good communication pattern when optimizing for
tern, to atomically broadcast a message m, m is broadcast to throughput. For each algorithm, we report the minimum num-
all processes in the system; the token holder process then ber of communication steps required by the last process to
replies by broadcasting a unique global sequence number for deliver a message, the number of processes required as a
m. High-throughput can be obtained by resorting to network- function of f, and the synchrony assumption needed for cor-
level broadcast. Mencius is another moving sequencer-based rectness. For the Ring Paxos protocols, we assume that each
protocol that implements state machine replication and is process plays the roles of proposer, acceptor and learner.
derived from Paxos [28]. Mencius is designed for wide-area There are f + 1 processes in the ring of M-Ring Paxos and
networks in which optimizing for latency is the main object- 2f + 1 processes in the ring of U-Ring Paxos (i.e. all pro-
ive in contrast to throughput, the focus of Ring Paxos. cesses are in the ring of U-Ring Paxos).
Protocols falling in the destination agreement class com- With M-Ring Paxos, delivery occurs as soon as messages
pute the message order in a distributed fashion (e.g. [14, 29]). make one revolution around the ring. Its latency is f + 3 mes-
These protocols typically exchange a quadratic number of sage delays since each message is first sent to the coordinator,
messages for each message broadcast, and thus are not good circulates around the ring of f + 1 processes, and is delivered
candidates for high throughput. after the final IP-multicast is received. With U-Ring Paxos, the
In communication history-based algorithms, the message worst-case latency is 5f . This happens when the process that
ordering is determined by the message sender, that is, the pro- broadcasts the message follows the coordinator in the ring. It
cess that broadcasts the message (e.g. [30, 31]). Message takes 2f steps to reach the coordinator, and another f steps for
ordering is usually provided using logical or physical time. the decision. The decision must circulate around the ring in
Of special interest is LCR, which arranges processes along a order to reach all processes, taking another 2f steps.
ring and uses vector clocks for message ordering [10]. This LCR requires two revolutions and thus has a latency in
protocol has similar throughput to our Ring Paxos protocols between the two Ring Paxos algorithms. In Totem, each
but requires perfect failure detection: erroneously suspecting message must also rotate twice along the ring to guarantee
a process to have crashed is not tolerated. Perfect failure safe-delivery, a property equivalent to uniform agreement: if
detection implies strong synchrony assumptions about pro- a process (correct or not) delivers a message m then all cor-
cessing and message transmission times. rect processes eventually deliver m. The atomic broadcast
The last class of atomic broadcast algorithms, denoted as protocol in [11] has a latency that is quadratic in f since a
privilege-based, allows a single process to broadcast mes- ring requires more than f2 nodes.
sages at a time; the message order is thus defined by the An efficient implementation of Paxos protocol is S-Paxos
broadcaster. Similarly to moving sequencer algorithms, the [21]. The key idea in S-Paxos is to distribute the tasks of

request reception and dissemination among all replicas. A cli- [2] Schneider, F. B. (1993) What good are models and what mod-
ent selects a replica arbitrarily and submits its requests to it. els are good? In Mullender, S. Distributed Systems, Addison-
After receiving a request, a replica forwards it to all the other Wesley. chapter 2, 17–26.
replicas. A replica receiving a forwarded request sends an [3] Hadzilacos, V. and Toueg, S. (1993) Fault-tolerant broadcasts
and related problems. In Distributed Systems, (2nd Ed.). ACM
acknowledgement to all other replicas. When a replica receives
f + 1 acknowledgements, it declares the request as stable. As Press/Addison-Wesley. chapter 5, pp. 97–145.
in classical Paxos, the leader is responsible for ordering [4] Défago, X., Schiper, A. and Urbán, P. (2004) Total order
broadcast and multicast algorithms: Taxonomy and survey.
requests; differently from Paxos, ordering is performed on
ACM Comput. Surv., 36, 372–421.
request ids. S-Paxos makes a balanced use of CPU and net-
[5] Lamport, L. (1998) The part-time parliament. ACM Trans.
work resources; on the negative side, many messages must be
Comput. Syst., 16, 133–169.
exchanged before a request can be ordered. Due to the number
[6] Bolosky, W. J., Bradshaw, D., Haagens, R. B., Kusters, N. P.
of messages exchanged, this protocol is CPU-intensive.
and Li, P. (2011) Paxos Replicated State Machines as the Basis
of a High-Performance Data Store. Proc. 8th USENIX Conf.
Networked Systems Design and Implementation, Berkeley, CA,
7. CONCLUSION USA NSDI’11, pp. 11–11. USENIX Association.
[7] Burrows, M. (2006) The Chubby Lock Service for Loosely-
This paper presents M-Ring Paxos and U-Ring Paxos, two
Coupled Distributed Systems. 7th Symp. Operating System
Paxos-like algorithms designed for high throughput. The proto-
Design and Implementation, Seattle, WA, November.
cols are optimized for modern interconnects. In order to show
[8] Baker, J., Bond, C., Corbett, J., Furman, J. J., Khorlin, A.,
that the techniques used are effective, we implemented both Larson, J., Leon, J.-M., Li, Y., Lloyd, A. and Yushprakh, V.
protocols and compared them to other atomic broadcast proto- (2011) Megastore: Providing Scalable, Highly Available
cols. Our selected protocols include a variety of techniques Storage for Interactive Services. CIDR’11, pp. 223–234.
typically used to implement atomic broadcast. The experiments [9] Amir, Y., Moser, L., Melliar-Smith, P., Agarwal, D. and
revealed that both Ring Paxos protocols have the property of Ciarfella, P. (1995) The Totem single-ring membership
providing almost-constant throughput with the number of protocol. ACM Trans. Comput. Syst., 13, 311–342.
receivers, an important feature in clustered environments. The [10] Guerraoui, R., Levy, R. R., Pochon, B. and Quéma, V. (2010)
experiments pointed out the tradeoffs with pure ring-based Throughput optimal total order broadcast for cluster environ-
protocols. Protocols based on unicast only or IP-multicast ments. ACM Trans. Comput. Syst., 28, 5:1–5:32.
only have low latency, but poor throughput. The study sug- [11] Ekwall, R., Schiper, A. and Urbán, P. (2004) Token-Based
gests that a combination of techniques can lead to the best of Atomic Broadcast using Unreliable Failure Detectors. Proc.
both: high throughput and low latency, under weak syn- Int. Symp. Reliable Distributed Systems (SRDS), pp. 52–65.
chrony assumptions. [12] Fischer, M. J., Lynch, N. A. and Paterson, M. S. (1985)
M-Ring Paxos and U-Ring Paxos are available for down- Impossibility of distributed consensus with one faulty proces-
load from SourceForge4 and GitHub,5 respectively. sor. J. ACM, 32, 374–382.
[13] Dwork, C., Lynch, N. and Stockmeyer, L. (1988) Consensus in
the presence of partial synchrony. J. ACM, 35, 288–323.
[14] Chandra, T. D. and Toueg, S. (1996) Unreliable failure detec-
FUNDING
tors for reliable distributed systems. J. ACM, 43, 225–267.
Swiss National Science Foundation (SNSF) [200021–- [15] Vigfusson, Y., Abu-Libdeh, H., Balakrishnan, M., Birman, K.,
121931]; the Hasler Foundation (2316). Burgess, R., Chockler, G., Li, H. and Tock, Y. (2010) Dr.
Multicast: Rx for Data Center Communication Scalability.
Proc. 5th Eur. Conf. Computer systems, New York, NY, USA
EuroSys ‘10, pp. 349–362. ACM.
ACKNOWLEDGEMENTS
[16] Lamport, L. and Massa, M. (2004) Cheap Paxos. Int. Conf.
The authors would like to thank Antonio Carzaniga, Leslie Dependable Systems and Networks (DSN), pp. 307–314.
Lamport and Robbert van Renesse for the valuable discus- [17] Kirsch, J. and Amir, Y. (2008) Paxos for System Builders: An
sions about Ring Paxos. Overview. Proc. 2nd Workshop on Large-Scale Distributed
Systems and Middleware (LADIS), pp. 1–6.
[18] Chandra, T., Griesemer, R. and Redstone, J. (2007) Paxos Made
REFERENCES Live: An Engineering Perspective. Proc. 26th Ann. ACM Symp.
principles of distributed computing (PODC), pp. 398–407.
[1] Lamport, L. (1978) Time, clocks, and the ordering of events in [19] van Renesse, R. (2011) Paxos Made Moderately Complex.
a distributed system. Commun. ACM, 21, 558–565. Technical Report. Cornell University.
4 [20] Santos, N. and Schiper, A. (2012) Tuning Paxos for High-
http://libpaxos.sourceforge.net
5 Throughput with Batching and Pipelining. ICDCN, pp. 153–167.
https://github.com/sambenz/URingPaxos

[21] Biely, M., Milosevic, Z., Santos, N. and Schiper, A. (2012) S- M-Ring Paxos: Let r (r ¢ ) be the round in which some
paxos: offloading the leader for high throughput state machine coordinator c (c¢) IP-multicast a decision message with v-id
replication. IEEE Symp. Rel. Distrib. Syst., 0, 111–120. (v¢-id). In M-Ring Paxos, c IP-multicasts a decision message
[22] Birman, K. P. and Joseph, T. A. (1987) Reliable communica- with v-id after: (i) c receives f+1 messages of the form
tion in the presence of failures. ACM Trans. Comput. Syst., 5, (Phase 1B, r, *, *); (ii) c selects the value vval = v with the
47–76. highest round number vrnd among the set M1B of Phase 1B
[23] Birman, K. P., Schiper, A. and Stephenson, P. (1991) Lightweight messages received, or picking a value v if vrnd = 0; (iii) c IP-
causal and atomic group multicast. ACM Trans. Comput. Syst.,
multicasts (Phase 2A, r, v, v-id); and (iv) c receives (Phase
9, 272–314.
2B, r, v-id) from the second last process in the ring, say q.
[24] Kaashoek, M. F. and Tanenbaum, A. S. (1991) Group
When c receives this message from q, it is equivalent to c
Communication in the Amoeba Distributed Operating System.
11th Int. Conf. Distributed Computing Systems (ICDCS),
receiving f+1 (Phase 2B, r, v-id) messages directly because
Washington, USA, pp. 222–230. the ring is composed of f+1 acceptors. Let M2B be the set of f
[25] Chang, J.-M. and Maxemchuk, N. (1984) Reliable broadcast +1 phase 2B messages. Now, consider that coordinator c
protocols. ACM Trans. Comput. Syst., 2, 251–273. received the same set of messages M1B and M2B in a system
[26] Cristian, F. and Mishra, S. (1995) The Pinwheel Asynchronous where all processes ran Paxos on value identifiers. In this
Atomic Broadcast Protocols. Int. Symp. Autonomous case, c would send a decide message with v-id as well. Since
Decentralized Systems (ISADS), Phoenix, Arizona, USA. the same reasoning can be applied to coordinator c¢, and
[27] Kim, J. and Kim, C. (1997) A total ordering protocol using a Paxos implements consensus, v-id = v¢-id.
dynamic token-passing scheme. Distrib. Syst. Eng., 4, 87–95. U-Ring Paxos: Let r (r ¢ ) be the round in which the last
[28] Mao, Y., Junqueira, F. P. and Marzullo, K. (2008) Mencius: acceptor al (al¢) sends a decision message mD with v-id (v¢-id)
Building Efficient Replicated State Machines for Wans. Proc. along the ring. The proof for U-Ring Paxos is similar to the
8th USENIX Conf. Operating systems design and implementa- proof for M-Ring Paxos since U-Ring Paxos only differs in the
tion, pp. 369–384. USENIX Association. way mD is propagated to the learners and the identity of the
[29] Fritzke, U., Ingels, P., Mostéfaoui, A. and Raynal, M. (1998) process who first sends mD. In contrast to M-Ring Paxos where
Fault-Tolerant Total Order Multicast to Asynchronous Groups. it is the coordinator that sends mD, in U-Ring Paxos, it is the
Proc. Int. Symp. Reliable Distributed Systems (SRDS), pp. last acceptor in the ring al that initiates the propagation of mD
578–585. along the ring. Despite this difference, al sends mD when al
[30] Lamport, L. (1978) The implementation of reliable distributed received, including itself, f + 1 (Phase 2B, r, v, v-id) messages,
multiprocess systems. Comput. Netw., 2, 95–114. just as in M-Ring Paxos. Since M-Ring Paxos guarantees that
[31] Ng, T. (1991) Ordered Broadcasts for Large Applications. v-id=v¢-id, the same holds in U-Ring Paxos. □
Symp. Reliable Distributed Systems (SRDS), pp. 188–197.
PROPOSITION 2. (iii) If one (or more) process proposes a
value and does not crash then eventually some value is
APPENDIX decided by all correct processes.
CORRECTNESS PROOF Proof sketch: The proof is almost identical for M-Ring
Paxos and U-Ring Paxos. We note the differences when
We provide a proof sketch of the correctness of M-Ring necessary. After GST, processes eventually select a correct
Paxos and U-Ring Paxos. We focus on properties (ii) and (iii) coordinator c. c considers a ring c-ring composed entirely of
of consensus. Property (i) holds trivially from the algorithms. correct acceptors, for M-Ring Paxos, and a ring c-ring com-
posed entirely of correct proposers, acceptors and learners,
PROPOSITION 1. (ii) No two processes decide different for U-Ring Paxos. Coordinator c sends a message of the form
values. (Phase 1A, *, c-ring) to the acceptors in c-ring. Because after
GST, all processes are correct and all messages exchanged
Proof sketch. Let v and v¢ be two decided values, and v-id between correct processes are received, all correct processes
and v¢-id their unique identifiers. We prove that v-id = v¢-id. eventually decide some value. □


BXX 016

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BXX 016

Uploaded by

Copyright:

Available Formats

© The British Computer Society 2017. All rights reserved.

For Permissions, please email: journals.permissions@oup.com

Ring Paxos: High-Throughput Atomic

Keywords: software fault-tolerance; replication; total order broadcast; cluster computing

SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS

SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS

SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS

SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS

Throughput (Mbps) at coord.

CPU (%) at acceptor

CPU (%) at coord.

SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS

Max throughput (Mbps) per receiver

CPU (%) per sender

FIGURE 2. Performance comparison of unicast, multicast and pipeline to propagate payloads.

(a) (b) (c)

Learners ip-multicast Learners

FIGURE 3. Optimized Paxos (a) and M-Ring Paxos (b,c).

SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS

used only when an acceptor in the ring fails.2 Finally,

SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS

4.4. Handling abnormal cases

SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS

(a) (b) unicast

FIGURE 4. U-Ring Paxos.

SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS

SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS

Messages per second

SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS

FIGURE 8. Impact of application message size on M-Ring Paxos.

SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS

FIGURE 9. Impact of application message size on U-Ring Paxos.

SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS

Throughput (Mbps) at learner-proposer1

Throughput (Mbps) at coord.

200 200 delivery rate

Throughput (Mbps) at learner-proposer2

SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS

SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS

SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS

SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS

You might also like