Fabric Proposal: A Raft-Based Ordering Service

Fabric Proposal: A Raft-Based Ordering Service
Author: Kostas Christidis <kostas AT christidis DOT io>
A. Summary
A.I. What is being proposed?
A.II. How does this affect Fabric users?
B. Background and motivation
B.I. Why not BFT directly?
B.II. How does Raft help in that regard?
B.III. Which Raft library? Why?
C. Proposed design
C.I. A few high-level notes to get us started
C.I.1. A Raft cluster per channel
C.I.2. What do we order?
C.I.3. The Raft FSM
C.I.4. Raft data and garbage collection
C.II. Issues we need to address
C.II.1. Encoding orderer endpoints in the configuration
Introduce the notion of a consenter set
How is the consenter set information consumed?
C.II.2. How do we assign Raft node IDs to consenter nodes?
When bootstrapping
1
When modifying the consenter set via Fabric configuration update
transactions
C.II.3. How do consenter nodes exchange messages with each other?
A new gRPC service
A transport package for the Raft plugin
Allowing consenter plugins to register their own handler
C.II.4. What do we snapshot?
C.II.5. How do Fabric configuration updates flow throughout this system?
Type A
Type B
2
A. Summary
A.I. What is being proposed?
A Raft-based ordering service for Fabric.
A.II. How does this affect Fabric users?
For those who deploy Fabric networks —
1. They should be able to use the Raft-based service as an alternative to
the Kafka-based service, with the same trust assumptions.

2. Migrating an existing Kafka network to a new Raft-based one will not be
supported out of the gate.

3. Administrators should expect setup and operation to be easier; there are
no third-party dependencies to take care of; everything is managed by
the same Fabric process.
For those who interact with Fabric networks (i.e. those who operate clients or
peers) —
1. A SUCCESS status code from an invocation of the Broadcast RPC used to
indicate that the transaction was guaranteed to be committed to the
blockchain, if it were still valid come blockchain-appending time (i.e.
past the consensus/ordering stage). This will no longer be the case. The
Raft cluster leader may crash before committing the entry into a block.
It is up to the user to attempt to re-submit their transaction if it's
not committed in whatever amount of time they deem to be reasonable.
3
See Fig. 1 for the obligatory diagram. The cloud in the middle shows one
possible configuration of the Raft-based ordering service. The number of
ordering service nodes (OSNs) may differ.
a client that uses the Fabric client SDK to invoke the Broadcast RPC on
Figure 1 — Left:
the ordering service and have the proposal submitted for ordering. Center: the
Raft-based ordering service, consisting of 3 OSNs. Each of these nodes implements a Raft
finite state machine (Raft FSM). Collectively, these nodes form a Raft cluster,
exchanging protocol-specific messages with each other. OSN-1 is the current leader of
that cluster. Right: a Fabric Peer invokes the Deliver RPC on the ordering service in
order to receive an ordered stream of blocks. Notice how neither the client, nor the
peer need to know who the leader is in order to have their requests served.
4
B. Background and motivation
B.I. Why not BFT directly?
If we were to go straight to a BFT solution, we would have to juggle two
moving parts at the same time: (i) writing the BFT library, and (ii) fixing
the Fabric interfaces that wrap around it. This seemed like an unwise
endeavor.
B.II. How does Raft help in that regard?
Raft is a leader-based protocol, as is BFT. Integrating a well-tested,
leader-based consensus protocol, means we get to focus on improving the
interfaces Fabric provides to consensus plugin authors for this family of
protocols. In that sense, the Raft experiment helps us lay the groundwork for
the PBFT-based ordering service.
Along the same spirit, we make certain decisions in the proposed
implementation that only make sense in a BFT context, even though Raft is a
CFT protocol. We do that on purpose; when we move to the BFT-based service, we
want to keep the number of Fabric core changes to a minimum.
B.III. Which Raft library? Why?
CoreOS's implementation, the same one that is used in etcd (hereinafter:
etcd/raft). It is written in the same language as Fabric, it is a popular,
polished, and actively maintained Raft implementation, which has been tested1
rigorously. Finally, its license is compatible with Fabric.
1
See, for instance: fault injections and integration testing with etcd as part of GKE.
5
C. Proposed design

Everything described from this point on constitutes proposed changes.
At the beginning of each section, we group the relevant changes in a table,
along with a note of the affected component; "Raft plugin" if it belongs to
the Raft consensus plugin implementation, or "Fabric" if it's a change that
affects the Fabric orderer core codebase.
Past this point, the reader is expected to have a certain level of familiarity
with how consensus implementation generally work in Fabric — solo provides a
good barebones example.
When we talk about the "consenter plugin", we refer to the package that
implements the proposed consensus mechanism, and that we will need to develop.
This should (a) use the etcd/go library and (b) satisfy the Consenter and
Chain interfaces. Following the existing conventions, it should be placed
under orderer/consensus — perhaps in a package named etcd-raft — in the Fabric
project tree.
C.I. A few high-level notes to get us started
C.I.1. A Raft cluster per channel
Every channel will run on a separate instance of the Raft protocol. Put
another way: a network that serves N channels, means we have N Raft clusters,
each with their own leader. When the clusters are spread across different
nodes (as we would like it to be the case eventually), this further
6
decentralizes the service. Even when these clusters are spread across the same
nodes however (as we expect the current, default case to be), this gives us
the ability to have different leaders per channel.
The drawbacks of this decision are:
1. It is redundant for the trust domain that Raft operates in — it only
makes sense in the BFT context.

2. It incurs a non-negligible amount of overhead per channel, in the form
of redundant heartbeat messages and goroutines2.
Nonetheless, we decide to move forward with this now; as we noted earlier in
the document, we are laying the groundwork for BFT, and wish to move as few
pieces as possible when it's time to switch.
C.I.2. What do we order?
This might be obvious, but since it goes against the way the existing
production option (Kafka) works, it is worth spelling out explicitly: we use

the Raft protocol to order blocks.
In a departure from how we handle things today:
1. Transactions (i.e. proposals, or configuration updates) should be
automatically routed by the ingress OSNs to the OSN that serves as the
current leader for the channel. (If the ingress OSN is the current
leader, this step is omitted.)
2
This is the problem that lead CockroachDB to multiraft. multiraft was unfortunately dismantled a
few months after its introduction, due to tight coupling with the project's storage engine.
7
Then, as is the case with existing consenter plugins:
2. The leader checks that the configuration sequence number at which this
transaction was validated is equal to the current configuration sequence
number. (It revalidates the transaction if that is not the case, and
rejects it if it fails this validation check.)

3. The leader passes the incoming transaction to the blockcutter's Ordered
method3; if a block is not formed yet, it does nothing. Otherwise, it
creates a candidate block.
Here's where we diverge from existing implementations:
4. If a block is formed, the leader OSN proposes it to the local Raft
Finite State Machine (see Section C.I.3 "The Raft FSM").

5. The FSM will then attempt to replicate the block to enough OSNs so that
it can become committed.

6. It is only then that the block can be written to the local ledger of the
receiving replicas.
C.I.3. The Raft FSM
At the heart of each consenter process run by an OSN, we need a Raft Finite
State Machine (FSM). This will most likely be initiated during the Start call
on the object that satisfies the Chain interface4.
3
Of course, as is the case with existing implementations, if the incoming transaction is of type
CONFIGURATION, the leader should cut a block with the batch of transactions (if any) it currently has
in buffer, then cut a block only with that CONFIGURATION transaction. Consult the existing
implementations for an example.
4
This object is returned by the raft consenter plugin's HandleChain method.
8
We interact with this FSM in three ways:
1. Via a handling (infinite) loop, where we make sure to (a) increment its
internal logical clock, and (b) read from it data that we need to save
to disk and/or replicate to other nodes in the cluster.

2. Via the provided Propose / ProposeConfChange methods for messages that
the application wishes to have ordered (in our case —as noted above—
these are blocks)

3. Via the provided Step method for messages generated by other Raft nodes.
To better understand the details of etcd/raft the README file on the etcd/go
repository is a good read.
We will see how each of these three ways should be used or customized for our
needs in the sections that follow.
C.I.4. Raft data and garbage collection
CHANGE COMPONENT
Configuration settings for snapshot
frequency, snapshot count, WAL Raft plugin
filesize, Raft storage size
Table 1 — Summary of proposed changes in this section
In broad strokes, an application that integrates etcd/raft is expected to:
9
1. Continuously write Raft-related data (incl. uncommitted entries and
state) to write-ahead logs (WAL files).

2. Periodically take snapshots of the log that is sequenced by Raft.
3. Provide the latest snapshot and the most recently committed (ordered)
entries via an implementation of the Storage interface.
The etcd repository provides implementations for all of the above:
1. coreos/etcd/wal
2. coreos/etcd/etcdserver/api/snap
3. MemoryStorage in coreos/etcd/raft5
How these are used:
1. WAL files and snapshots are used when a node is recovering from a crash
fault.6
2. Snapshots and entries from the Storage implementation are used to sync
up lagging replicas. The norm is for the storage to hold entries after a
certain snapshot. If a replica has fallen behind significantly, it will
need to be sent a snapshot to catch up.
When it comes to garbage collection, we should proceed as follows —
5
We can develop the Raft-based ordering service using the default MemoryStorage, but we will need to
move a disk-backed implementation for production. Entries in our case are blocks (which occupy a
non-trivial amount of space), and with an OSN serving multiple channels, we're looking at a very
limited number of entries per storage if all of them are to be held in memory.
6
10
1. Delete WAL files containing data older than the oldest snapshot. PR 1810
in coreos/etcd and PR 1327 in docker/swarmkit should be handy
references.
2. Compact the storage periodically (this needs to be done for the default
memory-backed implementation).
We will need to decide on:
1. The log data that our snapshot function captures
2. How often we create snapshots
3. How many snapshots we keep around
4. The filesize of each WAL file
5. The number of entries in Raft's storage
We will cover the first item in the snapshotting section below (see Section
C.II.4 "What do we snapshot?"). Items 2 to 5 will be exposed as configuration
options that the OSN administrator can adjust.
C.II. Issues we need to address
C.II.1. Encoding orderer endpoints in the configuration
CHANGE COMPONENT
Add new configuration section for

Fabric
consenters
Add setters and getters for consensus

Fabric
metadata
11
Allow plugins to register their
metadata messages for proto to JSON Fabric
translation
Create Raft metadata proto Raft plugin
Write marshaling function to encode

Raft plugin
Raft metadata into configuration
Register metadata message factory with
Fabric orderer protos for proto to JSON Raft plugin
translation
Add consenter section to configtx.yaml Raft plugin
The problem: When bootstrapping, a consensus implementation might need
additional metadata other than just the addresses and the TLS certificates of
the other nodes in the cluster. (For example: voting weights for each node.)
Additionally, down the road7, we want to be able to have nodes that (a) simply
respond to Broadcast/Deliver RPCs (i.e. they implement the AtomicBroadcast
service), (b) others that just participate in the consensus/ordering of
7
N.B. To begin with, all OSNs will both implement the AtomicBroadcast service and the participate in
consensus, with no option to disable either function. Fabric clients/peers will be able to connect to
the OSNs the same way as before, by decoding the /Channel/Values[OrdererAddresses] list in a
channel's configuration. Down the road, enabling/disabling a role will be possible via a
configuration option, and the endpoints of all OSNs will be moved to their ordering service org
groups, as per FAB-7559.
12
transactions, and (c) others that do both. We lay the groundwork for this kind
of separation now.
We suggest the following changes.
Introduce the notion of a consenter set
The consenter set consists of those ordering service nodes that actively
participate in the consensus mechanism. How and where do we define this
consenter set in Fabric?
If we were to lock on a definition for the consenter (see Fig. 2 as an
example), then we take a hit on extensibility. (Consider for instance the case
where we wish to assign a different weight to each node in the replica set, as
might be the case in a BFT implementation.)
message Consenter {
string host = 1;
uint32 port = 2;
bytes client_tls_cert = 3;
bytes server_tls_cert = 4;
Figure 2 — A spartan definition for a consenter node using the protocol buffers syntax.
We will instead allow each consensus implementation to define their own
This should capture

consenter message, or more generally their own metadata.
all the information necessary for the consenter cluster to function.
13
We shall have this information stored in
/Channel/Orderer/Values[ConsensusType]. We therefore need a means to both
encode this data into the channel configuration and extract it from it.
To that effect, these are the changes we need to implement on the Fabric side
of things:
1. Extend the existing ConsensusType message definition and its setter so
that they include a generic metadata byte slice.

2. Allow consensus implementations to register their metadata messages, so
that the protolator tool can work with them.

3. Expand the channelconfig.Orderer interface with an opaque
ConsensusMetadata getter.
These are the changes that a consensus plugin author should implement: (We
refer to "Raft" here, but this can be generalized to any custom
implementation.)
1. Add to the Orderer section in configtx.yaml a subsection containing the
metadata that the Raft cluster needs in order to operate. For Raft, with
a suggested Metadata proto definition as shown in Figs. 3-4, the sample
"Raft" subsection for configtx.yaml is shown in Fig. 5. We do this so
that the configtxgen tool can generate a genesis block for a Raft-based
ordering service.
2. Introduce a package (e.g. "etcd-raft") under protos/orderer, that (a)
contains the Metadata proto, (b) provides a marshaling function to be
called by the encoder package when the Orderer config group is created,
and (c) registers the package's ConsensusTypeMetadataFactory with
Fabric's orderer protos (for protolator to work).
14
// Metadata is serialized to a configuration block's
// ConsensusType metadata when the type is 'etcd-raft'.
message Metadata {
repeated Consenter consenters = 1;
Figure 3 — Each consensus implementation should define their own Metadata message.
// Metadata is serialized to a configuration block's
// ConsensusType metadata when the type is 'raft'.
message Consenter {
string host = 1;
uint32 port = 2;
bytes client_tls_cert = 3;
bytes server_tls_cert = 4;
string msp_id = 5;
Figure 4 — For the Raft implementation, the metadata is a slice (array) containing the
connection info and the owning MSP info for each of the cluster's Raft nodes
(consenters).
# Defines configuration which must be set
# when the "etcd-raft" orderer type is chosen.
etcd-raft:
# The set of Raft-enabled consenting endpoints for this network.
Consenters:
- Host: consenter.org0 Port: 7050
ClientTLSCert: /path/to/TLS/clientCert0
ServerTLSCert: /path/to/TLS/serverCert0
15
MSPID: org0
- Host: consenter.org1
Port: 7050
ClientTLSCert: /path/to/TLS/cert1
MSPID: org1
- Host: consenter.org2 Port: 7050
ClientTLSCert: /path/to/TLS/cert2
MSPID: org2
Figure 5 — This section in configtx.yaml captures the initial state of the Raft cluster,
when the ordering service is bootstrapped.
How is the consenter set information consumed?
The only entities concerned with consenter nodes are the ordering service
nodes. Those that only respond to Broadcast/Deliver RPCs need it to relay the
incoming transactions to consenters for ordering, and the consenters
themselves need it to identify the other nodes in the cluster.
Following the proposed implementation above, all of the OSNs will have access
to a ConsensusMetadata getter that allows them to retrieve the consenter set;
an OSN has access to an object satisfying the the channelconfig.Orderer
interface via the SharedConfig getter which is itself accessible via the
ConsenterSupport interface passed on during the HandleChain call.
C.II.2. How do we assign Raft node IDs to consenter nodes?
CHANGE COMPONENT
16
Extract Raft IDs from consenters slice
Raft plugin
of genesis block
Persist ID-to-consenter mapping to disk Raft plugin
Each OSN that's part of a Raft cluster needs to be given a unique ID. From the
etcd/raft README:
An ID represents a unique node in a cluster for all time. A given ID MUST be
used only once even if the old node has been removed. [...] Node IDs must be
non-zero.
For example, the Raft FSM of an OSN tags outgoing message with node IDs, and
it is up to the transport layer8 to relay the messages to right recipients.
These ID assignments should be unique on a per-channel basis.
There are two cases where we concern ourselves with node ID assignments:
1. When bootstrapping, and
2. When we update the cluster by adding/removing consenters via Fabric
configuration update transactions. (See Section C.II.5 "How do Fabric
configuration updates flow throughout this system?".)
8
Which we will need to implement — see notes on transport layer in Section C.II.3 "How do cluster
nodes exchange messages with each other?"
17
When bootstrapping
When we bootstrap the ordering service, we are also bootstrapping the Raft
cluster9. It is essential that all Raft nodes parse the list of consenters in
the genesis block (see notes on consenter set in Section C.II.1 "Encoding
orderer endpoints in the configuration"), and assign to each of them the same
ID. Since we contained that list in a slice (see Fig. 3), every node can
iterate over that slice deterministically, so we're good. (Put differently: we
can infer the ID of a consenter by its index in the consenters slice.)
Given the example in Fig. 5, each of the three nodes that are part of the
bootstrapping cluster should come up with the assignments shown in Fig. 6:
NODE ID HOST:PORT TLS CLIENT/SERVER CERT BYTES MSP ID
1 consenter.org0.org:7050 foo_client/foo_server foo
2 consenter.org1.org:7050 bar_client/bar_server bar
3 consenter.org2.org:7050 baz_client/baz_server baz
Figure 6 — Node ID assignments after parsing the genesis block generated by the data in
Fig. 5.
During this iteration, each consenter also compares its own TLS certificates
with the ones in the slice in order to detect its own node ID. Both its own
9
In fact, until we introduce options that allow a node to implement just the AtomicBroadcast service
or just the consensus service, the ordering service is the Raft cluster, and we use the terms OSN,
consenter node, and Raft node interchangeably.
18
ID, and the list of IDs it collected during the slice iteration, should be
passed on to the Raft FSM constructor. (See Section C.I.3 "The Raft FSM".)
When modifying the consenter set via Fabric configuration update transactions
Going back to the example in Figs. 5-6, assume that we push a Fabric
configuration update transaction that removes consenter.org2.org (node ID: 3)
from the consenter list. Later on10, we decide to add consenter.org3.org as
new consenter to the ordering service.
Notice what happens here. If we were to keep inferring the ID of a Raft node
by its index on the consenters slice, we would assign consenter.org3.org with
ID 3. This was however the ID assigned to consenter.org2.org before we kicked
them off, and ID re-use is forbidden.
Conclusion: past bootstrapping, we can no longer infer the ID of a consenter
by its index on the consenters slice.
We therefore need to:
1. Come up with a mechanism of assigning global IDs to new, incoming nodes.
2. Persist the ID-to-consenter mapping to disk. As we've shown above, the
consenters list on the configuration block can only be consulted during
bootstrapping.
We propose the following:
10
(Set aside the fact that in this example the cluster would temporarily run with just 2 nodes,
which is a bad idea. This is just an example.)
19
1. Persist the current ID-to-consenter mapping — as well as the highest
assigned ID — in every configuration block's Metadata field (see ORDERER
index). This is inline with how we use this field today to write/read
the Kafka partition offsets. The value written in the metadata field
should make it clear whether this configuration added a node to the
consenter set, removed one, or left the consenter set unchanged. This
will allow the leader to follow-up with the appropriate Raft
configuration change, as we will see in Section C.II.5 "How do Fabric
configuration update transaction flow throughout this system?".

2. Require that a configuration update can only introduce nodes with IDs
that higher than the highest ID already assigned, meaning — once an ID
is removed, it is removed forever.

3. Encode logic into each consenting node that rejects Fabric configuration
update transactions modifying the consenter set by more than one node.
This restriction makes it easier for us to handle the case where the
leader crashes before committing the corresponding Raft configuration
changes, and the new leader is to identify which Raft configuration
changes are missing.
As these changes are intrinsically connected to the way the Raft-based
ordering service processes Fabric configuration update transactions, we shall
examine them in detail, complete with an example, in Section C.II.5 "How do
Fabric configuration update transaction flow throughout this system?").
C.II.3. How do consenter nodes exchange messages with each other?
CHANGE COMPONENT
20
Allow consensus plugins to bind their
own service handlers to the gRPC server Fabric
used by AtomicBroadcast
Define gRPC service for node-to-node

Raft plugin
message exchange
Write associated server handler Raft plugin
Implement transport layer that allows
the Raft FSM to invoke the gRPC service Raft plugin
and replicate messages
Note that we are concerned with two types of messages:
1. Fabric transactions (proposals, configuration updates): when these are
received by an OSN that is not the leader for the channel, they need to
be routed to the cluster leader (also see Section C.I.2. "What do we
order").
2. Raft-specific messages (see list here). The etcd/raft library does not
implement network transport; the Raft FSM simply tags outgoing messages
with the recipient's ID and it is up to the application to deliver the
messages.
From the consenter plugin side of things, we then need to:
1. Define a new gRPC service for sending the messages above.
2. Create a handler for this service, i.e. an object that satisfies the
Cluster interface.
21
3. Write a transport package for the plugin that gives us access to a gRPC
client that can invoke these RPCs. (This will be used by the Raft FSM.)
From the Fabric side of things, we need to:
1. Allow consenter plugins to register their own handlers with the same
gRPC server that is also used by the AtomicBroadcast service (the
grpcServer).
Let's expand on all of these items.
A new gRPC service
Let us refer to it as Cluster. This definition will reside in protos/orderer
in the Fabric tree. It consists of two RPCs, Propose and Step — see Fig. 6 for
a tentative definition.
service Cluster {
rpc Submit(stream SubmitRequest) returns(stream SubmitResponse)
rpc Step(StepRequest) returns (StepResponse)
Figure 6 — The two RPCs that the new Cluster gRPC service should support.
The Submit RPC —
● The clients are OSNs that (a) received a transaction from a Fabric
client, and (b) are currently operating as replicas on the Raft cluster.
22
They invoke this RPC to submit the incoming transactions to the cluster
leader for ordering.

● On the server side:
i. We check to confirm that the receiver is indeed still the leader,
then
ii. When a block is cut locally on the leader, pass the proposed
message to the leader's Propose method
The Step RPC —
● The clients are OSNs, asked by their local Raft FSM to send messages to
other nodes in the cluster. They invoke this RPC in the handling loop
described in Section C.I.3 "The Raft FSM".

● The server should pass the incoming message through to the local Step
method provided by the etcd/go library.
Since an OSN may belong to multiple Raft clusters (see Section C.I.1 "A Raft
cluster per channel"), on the client side each invocation of the Cluster RPCs
should be tagged with the cluster/channel it refers to. We can do this by
either including this information in the passed-in message11, or by attaching
it as a gRPC metadata key-value pair during the RPC invocation. On the server
side, the handler for these RPCs should maintain a mapping between a
channel/cluster and the corresponding Raft FSM, so that we can pipe the Submit
(Step) RPC to the local raft.Propose (raft.Step) method.
11
In the Propose case, we can find out the cluster by extracting the envelope's header, similar
to how the broadcast handler does it. For the Step RPC, we will need to create a container proto
message for raftpb.Message that includes a cluster/channel field.
23
A transport package for the Raft plugin
Fabric's core/comm package, along with the client stub that is automatically
generated by gRPC when implementing the Cluster service, should get us going
here.
Essentially, whenever we instantiate a Raft node (FSM), we should associate it
with a gRPC client. When the cluster membership changes, this client
establishes/drops connections to newly-added/recently-removed peers. When the
Raft FSM asks to send a message to another node, we tap into the gRPC client
and have it invoke the Step RPC12.
A couple of additional notes on the transport layer:
1. We need to report the status of transmissions back to the Raft FSM via
the provided ReportUnreachable and ReportSnapshot methods.

2. We will almost certainly need to give the transport layer access to a
function in the cluster membership package (see Section C.II.2 "How do
we assign Raft node IDs to consenter nodes?") that checks whether the
recipient ID has been removed (example).
Allowing consenter plugins to register their own handler
Recall that every consenter plugin that is loaded by the ordering service
should register its Consenter constructor in a consenters map. (This is then
passed to the multichannel Registrar).
12
Be mindful of this corner case: https://github.com/docker/swarmkit/issues/436
24
It is during this stage that we should pass a reference to the ordering
server's grpcServer to the consenter plugin constructor. The consenter plugin
constructor should then use that reference to bind its ClusterServer handler
to it.
See Fig. 7 for an infinitely less verbose and better explanation of the above:
import etcdraftprotos "github.com/hyperledger/fabric/protos/orderer/etcd-raft"
import etcdraftplugin "github.com/hyperledger/fabric/orderer/consensus/etcd-raft"
func Start(...) { // This gets executed when launching the orderer process
grpcServer := initializeGrpcServer(conf, serverConfig)
manager := initializeMultichannelRegistrar(conf, grpcServer, signer, tlsCallBack)
func initializeMultichannelRegistrar(...) {
consenters := make(map[string]consensus.Consenter)
consenters["solo"] = solo.New()
consenters[etcdraftprotos.PluginName] = etcdraftplugin.New(grpcServer)
Figure 7 — An example of how the orderer's server package should be modified (in bold)
so as to allow consenter plugins to bind an additional gRPC service handler to the
25
server object. What's implied here is that the raftplugin's New constructor will invoke
RegisterClusterServer(grpcServer, customHandler).
C.II.4. What do we snapshot?
CHANGE COMPONENT
Define application snapshot message Raft plugin
Parse application snapshot and allow

Raft plugin
lagging nodes to sync up
A Raft Snapshot message consists of the following fields:
1. Data: This is a byte slice populated by the application (say, via a
getSnapshot function), containing whatever it is that the application
wishes to snapshot. In the text that follows, we shall refer to this as
the application snapshot.

2. Metadata: Contains Raft-specific information; namely, the latest
configuration state, and the index/term of the most recent Raft log
message that was applied when the snapshot was captured.
What do we snapshot then? Before we answer that question, it helps to remember
what the snapshot is used in Raft for.
1. A snapshot is used to sync up replicas that have fallen behind
significantly. (See Section C.I.4 "Raft data and garbage collection".)
26
2. A node that is recovering from a crash failure (or is simply
restarting), typically loads the most recent snapshot, and uses the
information recorded in its Metadata to identify the position at which
it needs to start replaying its WAL.
Until Fabric gets support for state DB snapshots, let us define the
application snapshot for the ordering service as shown in Fig. 8.
type Snapshot struct {
commonprotos.Block newest = 1;
Figure 8 — The application snapshot for the ordering service consists simply of the most
recent block added to the blockchain, at the time of snapshotting.
OSNs should generate these snapshots at a user-defined pace. See the
configuration parameters we need to expose in Section C.I.4 "Raft data and
garbage collection".
Replicas that have fallen behind will use this snapshot information, as in the
example shown below:
1. Lagging replica R1 just got reconnected to the network. Its latest block
is 100.
2. Leader L is at block 196, and is configured to snapshot every 20 blocks.
3. R1 will receive a snapshot with newest set to 180. It should then issue
a Deliver request to L, asking for blocks 101 to 180.
27
4. Once R1 has received block 180, L should invoke the ReportSnapshot
method on its transport layer with R1's ID and a SnapshotFinish status,
signaling the successful application of the snapshot. (See notes on
trasport layer in Section C.II.1 "How do cluster nodes exchange messages
with each other?".)

5. R1 will then receive the rest of the blocks it needs to catch up
(181-196) via Raft's built-in syncing mechanism (L sending blocks to R1
via the entries temporarily held in the Raft Storage implementation —
again, see Section C.I.4 "Raft data and garbage collection").
We might want to abstract this 'receive snapshot message' - 'issue appropriate
Deliver request based on snapshot content' - 'signal successful reception of
missing block range' sequence behind a lightweight gap-filling service that
any consensus implementation can tap into13.
C.II.5. How do Fabric configuration updates flow throughout this system?
CHANGE COMPONENT
Remove tight coupling between
CreateNextBlock and WriteNextBlock in

Fabric
blockwriter to allow pipelining of
block proposals
Prevent ordering of normal blocks while

Raft plugin
configuration block is in flight
13
Once the peer state database supports pruning, the snapshot message could be extended with
additional fields (oldest, oldest_config, newest_config) to support range syncing with arbitrary
snapshot and prune points.
28
Prevent Fabric configuration changes
that modify the consenter set by more Raft plugin
than one node at a time
Translate valid Fabric configuration
changes affecting the consenter set

Raft plugin
into corresponding Raft configuration
messages
Force newly-assigned leader to inspect
tip of committed log and issue Raft
configuration message if Fabric

Raft plugin
configuration update hasn't been
translated into associated Raft
configuration message already
A clarification on the terms used:
1. By Fabric configuration updates, we refer to any Fabric transaction that
modifies the configuration of a channel (see Fabric guide on how to
create such transactions).

2. By Raft configuration changes, we refer to any Raft configuration
message that should be passed to (and committed by) the Raft FSM to
add/remove a node to/from the Raft cluster.
We distinguish between two types of Fabric configuration updates:
29
A. Those that leave the consenters slice (see Section C.II.2 "How do we
assign Raft node IDs to consenter nodes?") untouched, for which the
standard validation rules apply, and

B. Those that do modify the consenters slice, because these need to be
translated to Raft configuration changes. For these transactions, an
additional validation rule should be added: any transaction that
proposes to modify the consenter set by more than one node (i.e.
add/remove two or more nodes) should be rejected.
We are looking then at the following logic that needs to be implemented at the
consenter plugin:
1. Detection of configuration updates that modify the consenters set,
2. Rejection of those updates that modify the set by more than 1 node (this
should be invoked during the Configure method execution), and

3. Translation of those updates that modify the set by exactly 1 node into
the respective Raft configuration change messages.
With that in mind, let's see how the two types of Fabric configuration updates
should flow throughout the system:
Type A
On the Fabric side of things, the broadcast handling part remains unchanged:
1. The Broadcast handler filters, authorizes, and validates the incoming
CONFIG_UPDATE, transforming it into a CONFIGURATION transaction.

2. The handler then passes the CONFIGURATION transaction to the consensus
plugin's Configure method.
30
On the consenter plugin side, we follow the standard path for processing
configuration update transactions, minimally adapted for Raft. Namely:
1. If the OSN receiving the transaction is the Raft cluster leader:
a. It cuts the current batch in memory (if any), creates the
corresponding (normal) block via CreateNextBlock, then passes this
block to the local raft.Propose method (which will in turn feed
the block into the Raft FSM).

b. It then creates a follow-up block containing the configuration
transaction, and again passes it to raft.Propose locally.

2. If the ingress OSN is not the currently leader, it forwards the
CONFIGURATION transaction to the leader via the new Submit RPC (see
Section C.II.3 "How do consenter nodes exchange messages with each
other?").
Type B
The broadcast handler part remains the same as in Type A.
When the transaction eventually reaches the consenter's Configure method:
1. Regardless of whether the ingress OSN is the leader or not, the
transaction should be rejected if the transaction modifies the
consenters slice by more than one node. Then:

2. If the ingress OSN is a follower, it should forward the transaction to
the leader (via the new Propose RPC).

3. If the ingress OSN is the leader (or if the transaction reaches the
leader via a follower replica), the leader should act as in Type A above
(cut and order the current batch, then order the proposed Fabric
31
configuration), and it should follow this up by ordering the
corresponding Raft configuration change.
N.B. When a configuration change is pending ordering, the leader should not
attempt to order any other block. This implies that the leader has two queues:
one for configuration, and one for non-configuration transactions. The former
takes priority and ordering happens in a blocking manner. Any blocks
originating from the latter queue can be ordered in a pipelined fashion14.
Whenever a Fabric configuration block is created the leader should add to that
block's metadata the Raft node IDs of the cluster, as well as the highest
assigned Raft node ID, per the guidelines given in Section C.II.2 "How do we
assign Raft node IDs to consenter nodes?".
Note that in the case of Type B transactions, we encode the ID of the
newly-added/removed Raft node into the block's metadata, before we have
actually ordered and committed the actual Raft configuration change. This can
become problematic if the current leader crashes, before the Raft
configuration change is ordered. We may then have a blockchain that, for
example, expects consenter.org3.org:7050 to be part of the cluster, but this
node will actually never be referenced by any Raft FSM in the cluster. We
prevent this divergence by ensuring that whenever there is a change of
leadership, the leader checks the most recent committed entry in its log; if
it corresponds to a Type B Fabric configuration block, the new leader creates
14
For the leader to be able to order blocks before they're appended to the ledger, we need to change
the way the blockwriter works in Fabric today. Specifically, it is not possible to create a second
new block in a row (read: invoke CreateNextBlock twice), unless the first new block has been written
to the ledger (see: WriteBlock method). We should remove this tight coupling between the
CreateNextBlock and WriteBlock methods.
32
and orders the corresponding Raft configuration change, before ordering any
other block in the cluster15.
15
This means that the leader actually maintains three queues locally, listed in ascending priority:
one for normal blocks, one for configuration blocks, and one for Raft configuration messages.
33

Fabric Proposal: A Raft-Based Ordering Service

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Fabric Proposal: A Raft-Based Ordering Service

Uploaded by

Copyright:

Available Formats

Fabric Proposal: A Raft-Based Ordering Service

Author: Kostas Christidis <kostas AT christidis DOT io>

A.I. What is being proposed?

A.II. How does this affect Fabric users?

B. Background and motivation

B.I. Why not BFT directly?

B.II. How does Raft help in that regard?

B.III. Which Raft library? Why?

C.I. A few high-level notes to get us started

C.I.1. A Raft cluster per channel

C.I.2. What do we order?

C.I.3. The Raft FSM

C.I.4. Raft data and garbage collection

C.II. Issues we need to address

C.II.1. Encoding orderer endpoints in the configuration

Introduce the notion of a consenter set

How is the consenter set information consumed?

C.II.2. How do we assign Raft node IDs to consenter nodes?

C.II.3. How do consenter nodes exchange messages with each other?

A new gRPC service

A transport package for the Raft plugin

Allowing consenter plugins to register their own handler

C.II.4. What do we snapshot?

C.II.5. How do Fabric configuration updates flow throughout this system?

A.I. What is being proposed?

A Raft-based ordering service for Fabric.

A.II. How does this affect Fabric users?

For those who ​deploy​ Fabric networks —

1. They should be able to use the Raft-based service as an alternative to

the Kafka-based service, with the same trust assumptions.

supported out of the gate.

no third-party dependencies to take care of; everything is managed by

the same Fabric process.

1. A SUCCESS status code from an invocation of the Broadcast RPC used to

indicate that the transaction was guaranteed to be committed to the

blockchain, ​if​ it were still valid come blockchain-appending time (i.e.

It is up to the user to attempt to re-submit their transaction if it's

not committed in whatever amount of time they deem to be reasonable.

possible configuration of the Raft-based ordering service. The number of

ordering service nodes (​OSNs​) may differ.

B.I. Why not BFT directly?

If we were to go straight to a BFT solution, we would have to juggle two

B.II. How does Raft help in that regard?

Raft is a leader-based protocol, as is BFT. Integrating a well-tested,

leader-based consensus protocol, means we get to focus on improving the

interfaces Fabric provides to consensus plugin authors for this family of

the PBFT-based ordering service.

Along the same spirit, we make certain decisions in the proposed

CFT protocol. We do that on purpose; when we move to the BFT-based service, we

want to keep the number of Fabric core changes to a minimum.

B.III. Which Raft library? Why?

CoreOS's implementation​, the same one that is used in etcd (hereinafter:

etcd/raft​). It is written in the same language as Fabric, it is a popular,

rigorously. Finally, its license is compatible with Fabric.

At the beginning of each section, we group the relevant changes in a table,

along with a note of the affected component; "​Raft plugin​" if it belongs to

the Raft consensus plugin implementation, or "​Fabric​" if it's a change that

affects the Fabric orderer core codebase.

with how consensus implementation generally work in Fabric — ​solo​ provides a

good barebones example.

Chain​ interfaces. Following the existing conventions, it should be placed

under orderer/consensus — perhaps in a package named etcd-raft — in the Fabric

C.I. A few high-level notes to get us started

C.I.1. A Raft cluster per channel

For those who deploy Fabric networks —

blockchain, if it were still valid come blockchain-appending time (i.e.

ordering service nodes (OSNs) may differ.

CoreOS's implementation, the same one that is used in etcd (hereinafter:

etcd/raft). It is written in the same language as Fabric, it is a popular,

along with a note of the affected component; "Raft plugin" if it belongs to

the Raft consensus plugin implementation, or "Fabric" if it's a change that

with how consensus implementation generally work in Fabric — solo provides a

Chain interfaces. Following the existing conventions, it should be placed

4. If a block is formed, the leader OSN proposes it to the local Raft

Finite State Machine (see Section C.I.3 "The Raft FSM").

on the object that satisfies the Chain interface4.

repository is a good read.

Table 1 — Summary of proposed changes in this section

state) to write-ahead logs (WAL files).

entries via an implementation of the Storage interface.

in coreos/etcd and PR 1327 in docker/swarmkit should be handy

C.II.4 "What do we snapshot?"). Items 2 to 5 will be exposed as configuration

Table 2 — Summary of proposed changes in this section

respond to Broadcast/Deliver RPCs (i.e. they implement the AtomicBroadcast

service), (b) others that just participate in the consensus/ordering of

If we were to lock on a definition for the consenter (see Fig. 2 as an