You are on page 1of 33

Fabric Proposal: A Raft-Based Ordering Service

Author: Kostas Christidis <kostas AT christidis DOT io>

A. Summary

A.I. What is being proposed?

A.II. How does this affect Fabric users?

B. Background and motivation

B.I. Why not BFT directly?

B.II. How does Raft help in that regard?

B.III. Which Raft library? Why?

C. Proposed design

C.I. A few high-level notes to get us started

C.I.1. A Raft cluster per channel

C.I.2. What do we order?

C.I.3. The Raft FSM

C.I.4. Raft data and garbage collection

C.II. Issues we need to address

C.II.1. Encoding orderer endpoints in the configuration

Introduce the notion of a consenter set

How is the consenter set information consumed?

C.II.2. How do we assign Raft node IDs to consenter nodes?

When bootstrapping

1
When modifying the consenter set via Fabric configuration update

transactions

C.II.3. How do consenter nodes exchange messages with each other?

A new gRPC service

A transport package for the Raft plugin

Allowing consenter plugins to register their own handler

C.II.4. What do we snapshot?

C.II.5. How do Fabric configuration updates flow throughout this system?

Type A

Type B

2
A. Summary

A.I. What is being proposed?

A Raft-based ordering service for Fabric.

A.II. How does this affect Fabric users?

For those who ​deploy​ Fabric networks —

1. They should be able to use the Raft-based service as an alternative to

the Kafka-based service, with the same trust assumptions.


2. Migrating an existing Kafka network to a new Raft-based one will ​not​ be

supported out of the gate.


3. Administrators should expect setup and operation to be easier; there are

no third-party dependencies to take care of; everything is managed by

the same Fabric process.

For those who ​interact​ with Fabric networks (i.e. those who operate clients or

peers) —

1. A SUCCESS status code from an invocation of the Broadcast RPC used to

indicate that the transaction was guaranteed to be committed to the

blockchain, ​if​ it were still valid come blockchain-appending time (i.e.

past the consensus/ordering stage). This will no longer be the case. The

Raft cluster leader may crash before committing the entry into a block.

It is up to the user to attempt to re-submit their transaction if it's

not committed in whatever amount of time they deem to be reasonable.

3
See ​Fig. 1​ for the obligatory diagram. The cloud in the middle shows one

possible configuration of the Raft-based ordering service. The number of

ordering service nodes (​OSNs​) may differ.

​ a client that uses the Fabric client SDK to invoke the Broadcast RPC on
Figure 1​ — ​Left:

the ordering service and have the proposal submitted for ordering. ​Center​: the

Raft-based ordering service, consisting of 3 OSNs. Each of these nodes implements a Raft

finite state machine (​Raft FSM​). Collectively, these nodes form a Raft cluster,

exchanging protocol-specific messages with each other. OSN-1 is the current leader of

that cluster. ​Right​: a Fabric Peer invokes the Deliver RPC on the ordering service in

order to receive an ordered stream of blocks. Notice how neither the client, nor the

peer need to know who the leader is in order to have their requests served.

4
B. Background and motivation

B.I. Why not BFT directly?

If we were to go straight to a BFT solution, we would have to juggle two

moving parts at the same time: (i) writing the BFT library, and (ii) fixing

the Fabric interfaces that wrap around it. This seemed like an unwise

endeavor.

B.II. How does Raft help in that regard?

Raft is a leader-based protocol, as is BFT. Integrating a well-tested,

leader-based consensus protocol, means we get to focus on improving the

interfaces Fabric provides to consensus plugin authors for this family of

protocols. In that sense, the Raft experiment helps us lay the groundwork for

the PBFT-based ordering service.

Along the same spirit, we make certain decisions in the proposed

implementation that only make sense in a BFT context, even though Raft is a

CFT protocol. We do that on purpose; when we move to the BFT-based service, we

want to keep the number of Fabric core changes to a minimum.

B.III. Which Raft library? Why?

CoreOS's implementation​, the same one that is used in etcd (hereinafter:

etcd/raft​). It is written in the same language as Fabric, it is a popular,

polished, and actively maintained Raft implementation, which has been tested1

rigorously. Finally, its license is compatible with Fabric.

1
See, for instance: fault injections and integration testing with etcd as part of GKE.

5
C. Proposed design


Everything described from this point on constitutes proposed ​changes.

At the beginning of each section, we group the relevant changes in a table,

along with a note of the affected component; "​Raft plugin​" if it belongs to

the Raft consensus plugin implementation, or "​Fabric​" if it's a change that

affects the Fabric orderer core codebase.

Past this point, the reader is expected to have a certain level of familiarity

with how consensus implementation generally work in Fabric — ​solo​ provides a

good barebones example.

When we talk about the "consenter plugin", we refer to the package that

implements the proposed consensus mechanism, and that we will need to develop.

This should (a) use the etcd/go library and (b) satisfy the ​Consenter​ and

Chain​ interfaces. Following the existing conventions, it should be placed

under orderer/consensus — perhaps in a package named etcd-raft — in the Fabric

project tree.

C.I. A few high-level notes to get us started

C.I.1. A Raft cluster per channel

Every channel will run on a separate instance of the Raft protocol. Put

another way: a network that serves N channels, means we have N Raft clusters,

each with their own leader. When the clusters are spread across different

nodes (as we would like it to be the case eventually), this further

6
decentralizes the service. Even when these clusters are spread across the same

nodes however (as we expect the current, default case to be), this gives us

the ability to have different leaders per channel.

The drawbacks of this decision are:

1. It is redundant for the trust domain that Raft operates in — it only

makes sense in the BFT context.


2. It incurs a non-negligible amount of overhead per channel, in the form

of redundant heartbeat messages and goroutines2.

Nonetheless, we decide to move forward with this now; as we noted earlier in

the document, we are laying the groundwork for BFT, and wish to move as few

pieces as possible when it's time to switch.

C.I.2. What do we order?

This might be obvious, but since it goes against the way the existing

production option (Kafka) works, it is worth spelling out explicitly: we use


the Raft protocol to order ​blocks.

In a departure from how we handle things today:

1. Transactions (i.e. proposals, or configuration updates) should be

automatically routed by the ingress OSNs to the OSN that serves as the

current leader for the channel. (If the ingress OSN ​is​ the current

leader, this step is omitted.)

2
This is the problem that lead CockroachDB to ​multiraft​. multiraft was unfortunately ​dismantled a
few months after its introduction​, due to tight coupling with the project's storage engine.

7
Then, as is the case with existing consenter plugins:

2. The leader checks that the configuration sequence number at which this

transaction was validated is equal to the current configuration sequence

number. (It revalidates the transaction if that is not the case, and

rejects it if it fails this validation check.)


3. The leader passes the incoming transaction to the ​blockcutter's Ordered

method3; if a block is not formed yet, it does nothing. Otherwise, it

creates a candidate block.

Here's where we diverge from existing implementations:

4. If a block is formed, the leader OSN ​proposes​ it to the local Raft

Finite State Machine (see Section C.I.3 "​The Raft FSM​").


5. The FSM will then attempt to replicate the block to enough OSNs so that

it can become committed.


6. It is only then that the block can be ​written to the local ledger​ of the

receiving replicas.

C.I.3. The Raft FSM

At the heart of each consenter process run by an OSN, we need a Raft Finite

State Machine (FSM). This will most likely be initiated during ​the Start call

on the object that satisfies ​the Chain interface4.

3
Of course, as is the case with existing implementations, if the incoming transaction is of type
CONFIGURATION, the leader should cut a block with the batch of transactions (if any) it currently has
in buffer, then cut a block only with that CONFIGURATION transaction. Consult the existing
implementations ​for an example​.
4
This object is returned by the raft consenter plugin's ​HandleChain method​.

8
We interact with this FSM in three ways:

1. Via a handling (infinite) loop, where we make sure to (a) ​increment its

internal logical clock​, and (b) ​read from it data​ that we need to save

to disk and/or replicate to other nodes in the cluster.


2. Via the provided ​Propose​ / ​ProposeConfChange​ methods for messages that

the application wishes to have ordered (in our case —as noted above—

these are blocks)


3. Via ​the provided Step method​ for messages generated by other Raft nodes.

To better understand the details of etcd/raft ​the README file on the etcd/go

repository​ is a good read.

We will see how each of these three ways should be used or customized for our

needs in the sections that follow.

C.I.4. Raft data and garbage collection

CHANGE COMPONENT

Configuration settings for snapshot

frequency, snapshot count, WAL Raft plugin

filesize, Raft storage size

Table 1​ — Summary of proposed changes in this section

In broad strokes, an application that integrates etcd/raft is expected to:

9
1. Continuously write Raft-related data (incl. uncommitted entries and

state) to write-ahead logs (​WAL files​).


2. Periodically take ​snapshots​ of the log that is sequenced by Raft.

3. Provide the latest snapshot and the most recently committed (ordered)

entries via an implementation of ​the ​Storage​ interface​.

The etcd repository provides implementations for all of the above:

1. coreos/etcd/wal

2. coreos/etcd/etcdserver/api/snap

3. MemoryStorage​ in coreos/etcd/raft5

How these are used:

1. WAL files and snapshots are used when a node is recovering from a crash

fault.6
2. Snapshots and entries from the Storage implementation are used to sync

up lagging replicas. The norm is for the storage to hold entries after a

certain snapshot. If a replica has fallen behind significantly, it will

need to be sent a snapshot to catch up.

When it comes to garbage collection, we should proceed as follows —

5
We can develop the Raft-based ordering service using the default MemoryStorage, but we ​will need​ to
move a disk-backed implementation for production. Entries in our case are blocks (which occupy a
non-trivial amount of space), and with an OSN serving multiple channels, we're looking at a very
limited number of entries per storage if all of them are to be held in memory.
6

10
1. Delete WAL files containing data older than the oldest snapshot. ​PR 1810

in coreos/etcd and ​PR 1327​ in docker/swarmkit should be handy

references.
2. Compact the storage​ periodically (this ​needs​ to be done for the default

memory-backed implementation).

We will need to decide on:

1. The log data that our snapshot function captures

2. How often we create snapshots

3. How many snapshots we keep around

4. The filesize of each WAL file

5. The number of entries in Raft's storage

We will cover the first item in the snapshotting section below (see Section

C.II.4 "​What do we snapshot?​"). Items 2 to 5 will be exposed as configuration

options that the OSN administrator can adjust.

C.II. Issues we need to address

C.II.1. Encoding orderer endpoints in the configuration

CHANGE COMPONENT

Add new configuration section for


Fabric
consenters

Add setters and getters for consensus


Fabric
metadata

11
Allow plugins to register their

metadata messages for proto to JSON Fabric

translation

Create Raft metadata proto Raft plugin

Write marshaling function to encode


Raft plugin
Raft metadata into configuration

Register metadata message factory with

Fabric orderer protos for proto to JSON Raft plugin

translation

Add consenter section to configtx.yaml Raft plugin

Table 2​ — Summary of proposed changes in this section

The problem: When bootstrapping, a consensus implementation might need

additional metadata other than just the addresses and the TLS certificates of

the other nodes in the cluster. (For example: voting weights for each node.)

Additionally, down the road7, we want to be able to have nodes that (a) simply

respond to Broadcast/Deliver RPCs (i.e. they implement ​the AtomicBroadcast

service​), (b) others that just participate in the consensus/ordering of

7
N.B. To begin with, ​all OSNs​ will both implement the AtomicBroadcast service and the participate in
consensus, with no option to disable either function. Fabric clients/peers will be able to connect to
the OSNs the same way as before, by decoding the /Channel/Values[OrdererAddresses] list in a
channel's configuration. Down the road, enabling/disabling a role will be possible via a
configuration option, and the endpoints of all OSNs will be moved to their ordering service org
groups, as per ​FAB-7559​.

12
transactions, and (c) others that do both. We lay the groundwork for this kind

of separation now.

We suggest the following changes.

Introduce the notion of a consenter set

The consenter set consists of those ordering service nodes that actively

participate in the consensus mechanism. How and where do we define this

consenter set in Fabric?

If we were to lock on a definition for the consenter (see ​Fig. 2​ as an

example), then we take a hit on extensibility. (Consider for instance the case

where we wish to assign a different weight to each node in the replica set, as

might be the case in a BFT implementation.)

message Consenter {

string host = 1;

uint32 port = 2;

bytes client_tls_cert = 3;

bytes server_tls_cert = 4;

Figure 2 — A spartan definition for a consenter node using the protocol buffers syntax.

We will instead allow each consensus implementation to define their own

​ This should capture


consenter message, or more generally their ​own metadata.

all the information necessary for the consenter cluster to function.

13
We shall have this information stored in

/Channel/Orderer/Values[ConsensusType]. We therefore need a means to both

encode this data into the channel configuration and extract it from it.

To that effect, these are the changes we need to implement on the Fabric side

of things:

1. Extend ​the existing ConsensusType message definition​ and ​its setter​ so

that they include a generic metadata byte slice.


2. Allow consensus implementations to ​register​ their metadata messages, so

that the protolator tool can work with them.


3. Expand ​the channelconfig.Orderer interface​ with an opaque

ConsensusMetadata getter.

These are the changes that a consensus plugin author should implement: (We

refer to "Raft" here, but this can be generalized to any custom

implementation.)

1. Add to ​the Orderer section in configtx.yaml​ a subsection containing the

metadata that the Raft cluster needs in order to operate. For Raft, with

a suggested Metadata proto definition as shown in ​Figs. 3-4​, the sample

"Raft" subsection for configtx.yaml is shown in ​Fig. 5​. We do this so

that the configtxgen tool can generate a genesis block for a Raft-based

ordering service.
2. Introduce a package (e.g. "etcd-raft") under protos/orderer, that (a)

contains the Metadata proto, (b) provides a marshaling function to be

called by the encoder package when the Orderer config group is created,

and (c) registers the package's ConsensusTypeMetadataFactory with

Fabric's orderer protos (for protolator to work).

14
// Metadata is serialized to a configuration block's

// ConsensusType metadata when the type is 'etcd-raft'.

message Metadata {

repeated Consenter consenters = 1;

Figure 3 — Each consensus implementation should define their own Metadata message.

// Metadata is serialized to a configuration block's

// ConsensusType metadata when the type is 'raft'.

message Consenter {

string host = 1;

uint32 port = 2;

bytes client_tls_cert = 3;

bytes server_tls_cert = 4;

string msp_id = 5;

Figure 4 — For the Raft implementation, the metadata is a slice (array) containing the

connection info and the owning MSP info for each of the cluster's Raft nodes

(consenters).

# Defines configuration which must be set

# when the "etcd-raft" orderer type is chosen.

etcd-raft:

# The set of Raft-enabled consenting endpoints for this network.

Consenters:

- Host: consenter.org0 Port: 7050

ClientTLSCert: /path/to/TLS/clientCert0

ServerTLSCert: /path/to/TLS/serverCert0

15
MSPID: org0

- Host: consenter.org1

Port: 7050

ClientTLSCert: /path/to/TLS/cert1

ServerTLSCert: /path/to/TLS/serverCert1

MSPID: org1

- Host: consenter.org2 Port: 7050

ClientTLSCert: /path/to/TLS/cert2

ServerTLSCert: /path/to/TLS/serverCert2

MSPID: org2

Figure 5 — This section in configtx.yaml captures the initial state of the Raft cluster,

when the ordering service is bootstrapped.

How is the consenter set information consumed?

The only entities concerned with consenter nodes are the ordering service

nodes. Those that only respond to Broadcast/Deliver RPCs need it to relay the

incoming transactions to consenters for ordering, and the consenters

themselves need it to identify the other nodes in the cluster.

Following the proposed implementation above, all of the OSNs will have access

to a ConsensusMetadata getter that allows them to retrieve the consenter set;

an OSN ​has access​ to an object satisfying the ​the channelconfig.Orderer

interface​ via ​the SharedConfig getter​ which is itself accessible via ​the

ConsenterSupport interface​ passed on during the HandleChain call.

C.II.2. How do we assign Raft node IDs to consenter nodes?

CHANGE COMPONENT

16
Extract Raft IDs from consenters slice
Raft plugin
of genesis block

Persist ID-to-consenter mapping to disk Raft plugin

Table 3​ — Summary of proposed changes in this section

Each OSN that's part of a Raft cluster needs to be given a unique ID. From ​the

etcd/raft README​:

An ID represents a unique node in a cluster for all time. A given ID MUST be

used only once even if the old node has been removed. [...] Node IDs must be

non-zero.

For example, the Raft FSM of an OSN tags outgoing message with node IDs, and

it is up to the transport layer8 to relay the messages to right recipients.

These ID assignments should be unique on a per-channel basis.

There are two cases where we concern ourselves with node ID assignments:

1. When bootstrapping, and

2. When we update the cluster by adding/removing consenters via Fabric

configuration update transactions. (See Section C.II.5 "​How do Fabric

configuration updates flow throughout this system?​".)

8
Which we will need to implement — see notes on transport layer in Section C.II.3 "​How do cluster
nodes exchange messages with each other?​"

17
When bootstrapping

When we bootstrap the ordering service, we are also bootstrapping the Raft

cluster9. It is essential that all Raft nodes parse the list of consenters in

the genesis block (see notes on consenter set in Section C.II.1 "​Encoding

orderer endpoints in the configuration​"), and assign to each of them the same

ID. Since we contained that list in a slice (see ​Fig. 3​), every node can

iterate over that slice deterministically, so we're good. (Put differently: we

can infer the ID of a consenter by its index in the consenters slice.)

Given the example in ​Fig. 5​, each of the three nodes that are part of the

bootstrapping cluster should come up with the assignments shown in ​Fig. 6​:

NODE ID HOST:PORT TLS CLIENT/SERVER CERT BYTES MSP ID

1 consenter.org0.org:7050 foo_client/foo_server foo

2 consenter.org1.org:7050 bar_client/bar_server bar

3 consenter.org2.org:7050 baz_client/baz_server baz

Figure 6 — Node ID assignments after parsing the genesis block generated by the data in

Fig. 5.

During this iteration, each consenter also compares its own TLS certificates

with the ones in the slice in order to detect its own node ID. Both its own

9
In fact, until we introduce options that allow a node to implement just the AtomicBroadcast service
or just the consensus service, the ordering service is the Raft cluster, and we use the terms OSN,
consenter node, and Raft node interchangeably.

18
ID, and the list of IDs it collected during the slice iteration, should be

passed on to ​the Raft FSM constructor​. (See Section C.I.3 "​The Raft FSM​".)

When modifying the consenter set via Fabric configuration update transactions

Going back to the example in ​Figs. 5-6​, assume that we push a Fabric

configuration update transaction that removes consenter.org2.org (node ID: 3)

from the consenter list. Later on10, we decide to add consenter.org3.org as

new consenter to the ordering service.

Notice what happens here. If we were to keep inferring the ID of a Raft node

by its index on the consenters slice, we would assign consenter.org3.org with

ID 3. This was however the ID assigned to consenter.org2.org before we kicked

them off, and ID re-use is forbidden.

Conclusion: past bootstrapping, we can ​no longer​ infer the ID of a consenter

by its index on the consenters slice.

We therefore need to:

1. Come up with a mechanism of assigning global IDs to new, incoming nodes.

2. Persist​ the ID-to-consenter mapping to disk. As we've shown above, the

consenters list on the configuration block can only be consulted during

bootstrapping.

We propose the following:

10
(Set aside the fact that in this example the cluster would temporarily run with just 2 nodes,
which is a bad idea. This is just an example.)

19
1. Persist the current ID-to-consenter mapping — as well as the highest

assigned ID — in every configuration block's ​Metadata field​ (see ​ORDERER

index​). This is inline with ​how we use this field today​ to write/read

the Kafka partition offsets. The value written in the metadata field

should make it clear whether this configuration added a node to the

consenter set, removed one, or left the consenter set unchanged. This

will allow the leader to follow-up with the appropriate Raft

configuration change, as we will see in Section C.II.5 "​How do Fabric

configuration update transaction flow throughout this system?​".


2. Require that a configuration update can only introduce nodes with IDs

that higher than the highest ID already assigned, meaning — once an ID

is removed, it is removed forever.


3. Encode logic into each consenting node that rejects Fabric configuration

update transactions modifying the consenter set by more than one node.

This restriction makes it easier for us to handle the case where the

leader crashes before committing the corresponding Raft configuration

changes, and the new leader is to identify which Raft configuration

changes are missing.

As these changes are intrinsically connected to the way the Raft-based

ordering service processes Fabric configuration update transactions, we shall

examine them in detail, complete with an example, in Section C.II.5 "​How do

Fabric configuration update transaction flow throughout this system?​").

C.II.3. How do consenter nodes exchange messages with each other?

CHANGE COMPONENT

20
Allow consensus plugins to bind their

own service handlers to the gRPC server Fabric

used by AtomicBroadcast

Define gRPC service for node-to-node


Raft plugin
message exchange

Write associated server handler Raft plugin

Implement transport layer that allows

the Raft FSM to invoke the gRPC service Raft plugin

and replicate messages

Table 4​ — Summary of proposed changes in this section

Note that we are concerned with two types of messages:

1. Fabric transactions (proposals, configuration updates): when these are

received by an OSN that is not the leader for the channel, they need to

be routed to the cluster leader (also see Section C.I.2. "​What do we

order​").
2. Raft-specific messages (​see list here​). The etcd/raft library does ​not

implement network transport; the Raft FSM simply tags outgoing messages

with the recipient's ID and it is up to the application to deliver the

messages.

From the consenter plugin side of things, we then need to:

1. Define a new gRPC service for sending the messages above.

2. Create a handler for this service, i.e. an object that satisfies the

Cluster interface.

21
3. Write a transport package for the plugin that gives us access to a gRPC

client that can invoke these RPCs. (This will be used by the Raft FSM.)

From the Fabric side of things, we need to:

1. Allow consenter plugins to register their own handlers with the same

gRPC server that is also used by the AtomicBroadcast service (​the

grpcServer​).

Let's expand on all of these items.

A new gRPC service

Let us refer to it as Cluster. This definition will reside in protos/orderer

in the Fabric tree. It consists of two RPCs, Propose and Step — see ​Fig. 6​ for

a tentative definition.

service Cluster {

rpc Submit(stream SubmitRequest) returns(stream SubmitResponse)

rpc Step(StepRequest) returns (StepResponse)

Figure 6​ — The two RPCs that the new Cluster gRPC service should support.

The Submit RPC —

● The clients are OSNs that (a) received a transaction from a Fabric

client, and (b) are currently operating as replicas on the Raft cluster.

22
They invoke this RPC to submit the incoming transactions to the cluster

leader for ordering.


● On the server side:
i. We check to confirm that the receiver ​is indeed still the leader​,

then
ii. When a block is cut locally on the leader, pass the proposed

message to the leader's ​Propose method

The Step RPC —

● The clients are OSNs, asked by their local Raft FSM to send messages to

other nodes in the cluster. They invoke this RPC in the handling loop

described in Section C.I.3 "​The Raft FSM​".


● The server should pass the incoming message through to the ​local Step

method​ provided by the etcd/go library.

Since an OSN may belong to multiple Raft clusters (see Section C.I.1 "​A Raft

cluster per channel​"), on the client side each invocation of the Cluster RPCs

should be tagged with the cluster/channel it refers to. We can do this by

either including this information in the passed-in message11, or by attaching

it as a ​gRPC metadata key-value pair​ during the RPC invocation. On the server

side, the handler for these RPCs should maintain a mapping between a

channel/cluster and the corresponding Raft FSM, so that we can pipe the Submit

(Step) RPC to the local raft.Propose (raft.Step) method.

11
In the Propose case, we can find out the cluster by extracting the envelope's header, similar
to ​how the broadcast handler does it​. For the Step RPC, we will need to create a container proto
message for raftpb.Message that includes a cluster/channel field.

23
A transport package for the Raft plugin

Fabric's core/comm package, along with the client stub that is automatically

generated by gRPC when implementing the Cluster service, should get us going

here.

Essentially, whenever we instantiate a Raft node (FSM), we should associate it

with a gRPC client. When the cluster membership changes, this client

establishes/drops connections to newly-added/recently-removed peers. When the

Raft FSM asks to send a message to another node, we tap into the gRPC client

and have it invoke the Step RPC12.

A couple of additional notes on the transport layer:

1. We need to report the status of transmissions back to the Raft FSM via

the provided ​ReportUnreachable​ and ​ReportSnapshot​ methods.


2. We will almost certainly need to give the transport layer access to a

function in the cluster membership package (see Section C.II.2 "​How do

we assign Raft node IDs to consenter nodes?​") that checks whether the

recipient ID has been removed (​example​).

Allowing consenter plugins to register their own handler

Recall that every consenter plugin that is loaded by the ordering service

should ​register its Consenter constructor in a consenters map​. (This is then

passed to the ​multichannel Registrar​).

12
Be mindful of this corner case: ​https://github.com/docker/swarmkit/issues/436

24
It is during this stage that we should pass a reference to the ordering

server's ​grpcServer​ to the consenter plugin constructor. The consenter plugin

constructor should then use that reference to bind its ClusterServer handler

to it.

See ​Fig. 7​ for an infinitely less verbose and better explanation of the above:

import etcdraftprotos "github.com/hyperledger/fabric/protos/orderer/etcd-raft"

import etcdraftplugin "github.com/hyperledger/fabric/orderer/consensus/etcd-raft"

func Start(...) { // This gets executed when launching the orderer process

grpcServer := initializeGrpcServer(conf, serverConfig)

manager := initializeMultichannelRegistrar(conf, ​grpcServer​, signer, tlsCallBack)

func initializeMultichannelRegistrar(...) {

consenters := make(map[string]consensus.Consenter)

consenters["solo"] = solo.New()

consenters[etcdraftprotos.PluginName] = etcdraftplugin.New(grpcServer)

Figure 7​ — An example of how the orderer's server package should be modified (in ​bold​)

so as to allow consenter plugins to bind an additional gRPC service handler to the

25
server object. What's implied here is that the raftplugin's New constructor will invoke

RegisterClusterServer(grpcServer, customHandler).

C.II.4. What do we snapshot?

CHANGE COMPONENT

Define application snapshot message Raft plugin

Parse application snapshot and allow


Raft plugin
lagging nodes to sync up

Table 5​ — Summary of proposed changes in this section

A Raft ​Snapshot message​ consists of the following fields:

1. Data: This is a byte slice populated by the application (say, via a

getSnapshot function), containing whatever it is that the application

wishes to snapshot. In the text that follows, we shall refer to this as

the ​application snapshot.


2. Metadata: Contains Raft-specific information; namely, the latest

configuration state, and the index/term of the most recent Raft log

message that was applied when the snapshot was captured.

What do we snapshot then? Before we answer that question, it helps to remember

what the snapshot is used in Raft for.

1. A snapshot is used to sync up replicas that have fallen behind

significantly. (See Section C.I.4 "​Raft data and garbage collection​".)

26
2. A node that is recovering from a crash failure (or is simply

restarting), typically loads the most recent snapshot, and uses the

information recorded in its Metadata to identify the position at which

it needs to start replaying its WAL.

Until Fabric gets support for state DB snapshots, let us define the

application snapshot for the ordering service as shown in ​Fig. 8​.

type Snapshot struct {

commonprotos.Block newest = 1;

Figure 8​ — The application snapshot for the ordering service consists simply of the most

recent block added to the blockchain, at the time of snapshotting.

OSNs should generate these snapshots at a user-defined pace. See the

configuration parameters we need to expose in Section C.I.4 "​Raft data and

garbage collection​".

Replicas that have fallen behind will use this snapshot information, as in the

example shown below:

1. Lagging replica R1 just got reconnected to the network. Its latest block

is 100.
2. Leader L is at block 196, and is configured to snapshot every 20 blocks.

3. R1 will receive a snapshot with newest set to 180. It should then issue

a Deliver request to L, asking for blocks 101 to 180.

27
4. Once R1 has received block 180, L should invoke the ReportSnapshot

method on its transport layer with R1's ID and a ​SnapshotFinish status​,

signaling the successful application of the snapshot. (See notes on

trasport layer in Section C.II.1 "​How do cluster nodes exchange messages

with each other?​".)


5. R1 will then receive the rest of the blocks it needs to catch up

(181-196) via Raft's built-in syncing mechanism (L sending blocks to R1

via the entries temporarily held in the Raft Storage implementation —

again, see Section C.I.4 "​Raft data and garbage collection​").

We might want to abstract this 'receive snapshot message' - 'issue appropriate

Deliver request based on snapshot content' - 'signal successful reception of

missing block range' sequence behind a lightweight gap-filling service that

any consensus implementation can tap into13.

C.II.5. How do Fabric configuration updates flow throughout this system?

CHANGE COMPONENT

Remove tight coupling between

CreateNextBlock and WriteNextBlock in


Fabric
blockwriter to allow pipelining of

block proposals

Prevent ordering of normal blocks while


Raft plugin
configuration block is in flight

13
Once the peer state database supports pruning, the snapshot message could be extended with
additional fields (oldest, oldest_config, newest_config) to support range syncing with arbitrary
snapshot and prune points.

28
Prevent Fabric configuration changes

that modify the consenter set by more Raft plugin

than one node at a time

Translate valid Fabric configuration

changes affecting the consenter set


Raft plugin
into corresponding Raft configuration

messages

Force newly-assigned leader to inspect

tip of committed log and issue Raft

configuration message if Fabric


Raft plugin
configuration update hasn't been

translated into associated Raft

configuration message already

Table 6​ — Summary of proposed changes in this section

A clarification on the terms used:

1. By Fabric configuration updates, we refer to any Fabric transaction that

modifies the configuration of a channel (see ​Fabric guide on how to

create such transactions​).


2. By Raft configuration changes, we refer to ​any Raft configuration

message​ that should be passed to (and committed by) the Raft FSM to

add/remove a node to/from the Raft cluster.

We distinguish between two types of Fabric configuration updates:

29
A. Those that leave the consenters slice (see Section C.II.2 "​How do we

assign Raft node IDs to consenter nodes?​") untouched, for which the

standard validation rules apply, and


B. Those that do modify the consenters slice, ​because these need to be

translated to Raft configuration changes​. For these transactions, an

additional validation rule should be added: any transaction that

proposes to modify the consenter set by more than one node (i.e.

add/remove two or more nodes) should be rejected.

We are looking then at the following logic that needs to be implemented at the

consenter plugin:

1. Detection of configuration updates that modify the consenters set,

2. Rejection of those updates that modify the set by more than 1 node (this

should be invoked during ​the Configure method​ execution), and


3. Translation of those updates that modify the set by exactly 1 node into

the respective Raft configuration change messages.

With that in mind, let's see how the two types of Fabric configuration updates

should flow throughout the system:

Type A

On the Fabric side of things, the broadcast handling part remains unchanged:

1. The ​Broadcast handler​ filters, authorizes, and validates the incoming

CONFIG_UPDATE, transforming it into a CONFIGURATION transaction.


2. The handler then ​passes​ the CONFIGURATION transaction to the consensus

plugin's ​Configure method​.

30
On the consenter plugin side, we follow the standard path for processing

configuration update transactions, minimally adapted for Raft. Namely:

1. If the OSN receiving the transaction ​is the Raft cluster leader​:

a. It cuts the current batch in memory (if any), creates the

corresponding (normal) block via ​CreateNextBlock​, then passes this

block to the local ​raft.Propose method​ (which will in turn feed

the block into the Raft FSM).


b. It then creates a follow-up block containing the configuration

transaction, and again passes it to raft.Propose locally.


2. If the ingress OSN is not the currently leader, it forwards the

CONFIGURATION transaction to the leader via the new Submit RPC (see

Section C.II.3 "​How do consenter nodes exchange messages with each

other?​").

Type B

The broadcast handler part remains the same as in Type A.

When the transaction eventually reaches the consenter's Configure method:

1. Regardless of whether the ingress OSN is the leader or not, the

transaction should be rejected if the transaction modifies the

consenters slice by more than one node. Then:


2. If the ingress OSN is a follower, it should forward the transaction to

the leader (via the new Propose RPC).


3. If the ingress OSN is the leader (or if the transaction reaches the

leader via a follower replica), the leader should act as in Type A above

(cut and order the current batch, then order the proposed Fabric

31
configuration), and it should follow this up by ordering the

corresponding Raft configuration change.

N.B. When a configuration change is pending ordering, the leader should ​not

attempt to order any other block. This implies that the leader has two queues:

one for configuration, and one for non-configuration transactions. The former

takes priority and ordering happens in a blocking manner. Any blocks

originating from the latter queue can be ordered in a pipelined fashion14.

Whenever a Fabric configuration block is created the leader should add to that

block's metadata the Raft node IDs of the cluster, as well as the highest

assigned Raft node ID, per the guidelines given in Section C.II.2 "​How do we

assign Raft node IDs to consenter nodes?​".

Note that in the case of Type B transactions, we encode the ID of the

newly-added/removed Raft node into the block's metadata, ​before​ we have

actually ordered and committed the actual Raft configuration change. This can

become problematic if the current leader crashes, ​before​ the Raft

configuration change is ordered. We may then have a blockchain that, for

example, expects consenter.org3.org:7050 to be part of the cluster, but this

node will actually never be referenced by any Raft FSM in the cluster. We

prevent this divergence by ensuring that whenever there is a change of

leadership, the leader checks the most recent committed entry in its log; if

it corresponds to a Type B Fabric configuration block, the new leader creates

14
For the leader to be able to order blocks before they're appended to the ledger, we need to change
the way the blockwriter works in Fabric today. Specifically, it is not possible to create a second
new block in a row (read: invoke ​CreateNextBlock​ twice), unless the first new block has been written
to the ledger (see: ​WriteBlock method)​. We should remove this tight coupling between the
CreateNextBlock and WriteBlock methods.

32
and orders the corresponding Raft configuration change, before ordering any

other block in the cluster15.

15
This means that the leader actually maintains three queues locally, listed in ascending priority:
one for normal blocks, one for configuration blocks, and one for Raft configuration messages.

33

You might also like