Professional Documents
Culture Documents
A. Summary
C. Proposed design
When bootstrapping
1
When modifying the consenter set via Fabric configuration update
transactions
Type A
Type B
2
A. Summary
For those who interact with Fabric networks (i.e. those who operate clients or
peers) —
past the consensus/ordering stage). This will no longer be the case. The
Raft cluster leader may crash before committing the entry into a block.
3
See Fig. 1 for the obligatory diagram. The cloud in the middle shows one
a client that uses the Fabric client SDK to invoke the Broadcast RPC on
Figure 1 — Left:
the ordering service and have the proposal submitted for ordering. Center: the
Raft-based ordering service, consisting of 3 OSNs. Each of these nodes implements a Raft
finite state machine (Raft FSM). Collectively, these nodes form a Raft cluster,
exchanging protocol-specific messages with each other. OSN-1 is the current leader of
that cluster. Right: a Fabric Peer invokes the Deliver RPC on the ordering service in
order to receive an ordered stream of blocks. Notice how neither the client, nor the
peer need to know who the leader is in order to have their requests served.
4
B. Background and motivation
moving parts at the same time: (i) writing the BFT library, and (ii) fixing
the Fabric interfaces that wrap around it. This seemed like an unwise
endeavor.
protocols. In that sense, the Raft experiment helps us lay the groundwork for
implementation that only make sense in a BFT context, even though Raft is a
polished, and actively maintained Raft implementation, which has been tested1
1
See, for instance: fault injections and integration testing with etcd as part of GKE.
5
C. Proposed design
Everything described from this point on constitutes proposed changes.
Past this point, the reader is expected to have a certain level of familiarity
When we talk about the "consenter plugin", we refer to the package that
implements the proposed consensus mechanism, and that we will need to develop.
This should (a) use the etcd/go library and (b) satisfy the Consenter and
project tree.
Every channel will run on a separate instance of the Raft protocol. Put
another way: a network that serves N channels, means we have N Raft clusters,
each with their own leader. When the clusters are spread across different
6
decentralizes the service. Even when these clusters are spread across the same
nodes however (as we expect the current, default case to be), this gives us
the document, we are laying the groundwork for BFT, and wish to move as few
This might be obvious, but since it goes against the way the existing
the Raft protocol to order blocks.
automatically routed by the ingress OSNs to the OSN that serves as the
current leader for the channel. (If the ingress OSN is the current
2
This is the problem that lead CockroachDB to multiraft. multiraft was unfortunately dismantled a
few months after its introduction, due to tight coupling with the project's storage engine.
7
Then, as is the case with existing consenter plugins:
2. The leader checks that the configuration sequence number at which this
number. (It revalidates the transaction if that is not the case, and
receiving replicas.
At the heart of each consenter process run by an OSN, we need a Raft Finite
State Machine (FSM). This will most likely be initiated during the Start call
3
Of course, as is the case with existing implementations, if the incoming transaction is of type
CONFIGURATION, the leader should cut a block with the batch of transactions (if any) it currently has
in buffer, then cut a block only with that CONFIGURATION transaction. Consult the existing
implementations for an example.
4
This object is returned by the raft consenter plugin's HandleChain method.
8
We interact with this FSM in three ways:
1. Via a handling (infinite) loop, where we make sure to (a) increment its
internal logical clock, and (b) read from it data that we need to save
the application wishes to have ordered (in our case —as noted above—
To better understand the details of etcd/raft the README file on the etcd/go
We will see how each of these three ways should be used or customized for our
CHANGE COMPONENT
9
1. Continuously write Raft-related data (incl. uncommitted entries and
3. Provide the latest snapshot and the most recently committed (ordered)
1. coreos/etcd/wal
2. coreos/etcd/etcdserver/api/snap
3. MemoryStorage in coreos/etcd/raft5
1. WAL files and snapshots are used when a node is recovering from a crash
fault.6
2. Snapshots and entries from the Storage implementation are used to sync
up lagging replicas. The norm is for the storage to hold entries after a
5
We can develop the Raft-based ordering service using the default MemoryStorage, but we will need to
move a disk-backed implementation for production. Entries in our case are blocks (which occupy a
non-trivial amount of space), and with an OSN serving multiple channels, we're looking at a very
limited number of entries per storage if all of them are to be held in memory.
6
10
1. Delete WAL files containing data older than the oldest snapshot. PR 1810
references.
2. Compact the storage periodically (this needs to be done for the default
memory-backed implementation).
We will cover the first item in the snapshotting section below (see Section
CHANGE COMPONENT
11
Allow plugins to register their
translation
translation
additional metadata other than just the addresses and the TLS certificates of
the other nodes in the cluster. (For example: voting weights for each node.)
Additionally, down the road7, we want to be able to have nodes that (a) simply
7
N.B. To begin with, all OSNs will both implement the AtomicBroadcast service and the participate in
consensus, with no option to disable either function. Fabric clients/peers will be able to connect to
the OSNs the same way as before, by decoding the /Channel/Values[OrdererAddresses] list in a
channel's configuration. Down the road, enabling/disabling a role will be possible via a
configuration option, and the endpoints of all OSNs will be moved to their ordering service org
groups, as per FAB-7559.
12
transactions, and (c) others that do both. We lay the groundwork for this kind
of separation now.
The consenter set consists of those ordering service nodes that actively
example), then we take a hit on extensibility. (Consider for instance the case
where we wish to assign a different weight to each node in the replica set, as
message Consenter {
string host = 1;
uint32 port = 2;
bytes client_tls_cert = 3;
bytes server_tls_cert = 4;
Figure 2 — A spartan definition for a consenter node using the protocol buffers syntax.
13
We shall have this information stored in
encode this data into the channel configuration and extract it from it.
To that effect, these are the changes we need to implement on the Fabric side
of things:
ConsensusMetadata getter.
These are the changes that a consensus plugin author should implement: (We
implementation.)
metadata that the Raft cluster needs in order to operate. For Raft, with
that the configtxgen tool can generate a genesis block for a Raft-based
ordering service.
2. Introduce a package (e.g. "etcd-raft") under protos/orderer, that (a)
called by the encoder package when the Orderer config group is created,
14
// Metadata is serialized to a configuration block's
message Metadata {
Figure 3 — Each consensus implementation should define their own Metadata message.
message Consenter {
string host = 1;
uint32 port = 2;
bytes client_tls_cert = 3;
bytes server_tls_cert = 4;
string msp_id = 5;
Figure 4 — For the Raft implementation, the metadata is a slice (array) containing the
connection info and the owning MSP info for each of the cluster's Raft nodes
(consenters).
etcd-raft:
Consenters:
ClientTLSCert: /path/to/TLS/clientCert0
ServerTLSCert: /path/to/TLS/serverCert0
15
MSPID: org0
- Host: consenter.org1
Port: 7050
ClientTLSCert: /path/to/TLS/cert1
ServerTLSCert: /path/to/TLS/serverCert1
MSPID: org1
ClientTLSCert: /path/to/TLS/cert2
ServerTLSCert: /path/to/TLS/serverCert2
MSPID: org2
Figure 5 — This section in configtx.yaml captures the initial state of the Raft cluster,
The only entities concerned with consenter nodes are the ordering service
nodes. Those that only respond to Broadcast/Deliver RPCs need it to relay the
Following the proposed implementation above, all of the OSNs will have access
interface via the SharedConfig getter which is itself accessible via the
CHANGE COMPONENT
16
Extract Raft IDs from consenters slice
Raft plugin
of genesis block
Each OSN that's part of a Raft cluster needs to be given a unique ID. From the
etcd/raft README:
used only once even if the old node has been removed. [...] Node IDs must be
non-zero.
For example, the Raft FSM of an OSN tags outgoing message with node IDs, and
There are two cases where we concern ourselves with node ID assignments:
8
Which we will need to implement — see notes on transport layer in Section C.II.3 "How do cluster
nodes exchange messages with each other?"
17
When bootstrapping
When we bootstrap the ordering service, we are also bootstrapping the Raft
cluster9. It is essential that all Raft nodes parse the list of consenters in
the genesis block (see notes on consenter set in Section C.II.1 "Encoding
orderer endpoints in the configuration"), and assign to each of them the same
ID. Since we contained that list in a slice (see Fig. 3), every node can
Given the example in Fig. 5, each of the three nodes that are part of the
bootstrapping cluster should come up with the assignments shown in Fig. 6:
Figure 6 — Node ID assignments after parsing the genesis block generated by the data in
Fig. 5.
During this iteration, each consenter also compares its own TLS certificates
with the ones in the slice in order to detect its own node ID. Both its own
9
In fact, until we introduce options that allow a node to implement just the AtomicBroadcast service
or just the consensus service, the ordering service is the Raft cluster, and we use the terms OSN,
consenter node, and Raft node interchangeably.
18
ID, and the list of IDs it collected during the slice iteration, should be
passed on to the Raft FSM constructor. (See Section C.I.3 "The Raft FSM".)
When modifying the consenter set via Fabric configuration update transactions
Going back to the example in Figs. 5-6, assume that we push a Fabric
Notice what happens here. If we were to keep inferring the ID of a Raft node
bootstrapping.
10
(Set aside the fact that in this example the cluster would temporarily run with just 2 nodes,
which is a bad idea. This is just an example.)
19
1. Persist the current ID-to-consenter mapping — as well as the highest
index). This is inline with how we use this field today to write/read
the Kafka partition offsets. The value written in the metadata field
consenter set, removed one, or left the consenter set unchanged. This
update transactions modifying the consenter set by more than one node.
This restriction makes it easier for us to handle the case where the
CHANGE COMPONENT
20
Allow consensus plugins to bind their
used by AtomicBroadcast
received by an OSN that is not the leader for the channel, they need to
order").
2. Raft-specific messages (see list here). The etcd/raft library does not
implement network transport; the Raft FSM simply tags outgoing messages
messages.
2. Create a handler for this service, i.e. an object that satisfies the
Cluster interface.
21
3. Write a transport package for the plugin that gives us access to a gRPC
client that can invoke these RPCs. (This will be used by the Raft FSM.)
1. Allow consenter plugins to register their own handlers with the same
grpcServer).
in the Fabric tree. It consists of two RPCs, Propose and Step — see Fig. 6 for
a tentative definition.
service Cluster {
Figure 6 — The two RPCs that the new Cluster gRPC service should support.
● The clients are OSNs that (a) received a transaction from a Fabric
client, and (b) are currently operating as replicas on the Raft cluster.
22
They invoke this RPC to submit the incoming transactions to the cluster
then
ii. When a block is cut locally on the leader, pass the proposed
● The clients are OSNs, asked by their local Raft FSM to send messages to
other nodes in the cluster. They invoke this RPC in the handling loop
Since an OSN may belong to multiple Raft clusters (see Section C.I.1 "A Raft
cluster per channel"), on the client side each invocation of the Cluster RPCs
it as a gRPC metadata key-value pair during the RPC invocation. On the server
side, the handler for these RPCs should maintain a mapping between a
channel/cluster and the corresponding Raft FSM, so that we can pipe the Submit
11
In the Propose case, we can find out the cluster by extracting the envelope's header, similar
to how the broadcast handler does it. For the Step RPC, we will need to create a container proto
message for raftpb.Message that includes a cluster/channel field.
23
A transport package for the Raft plugin
Fabric's core/comm package, along with the client stub that is automatically
generated by gRPC when implementing the Cluster service, should get us going
here.
with a gRPC client. When the cluster membership changes, this client
Raft FSM asks to send a message to another node, we tap into the gRPC client
1. We need to report the status of transmissions back to the Raft FSM via
we assign Raft node IDs to consenter nodes?") that checks whether the
Recall that every consenter plugin that is loaded by the ordering service
12
Be mindful of this corner case: https://github.com/docker/swarmkit/issues/436
24
It is during this stage that we should pass a reference to the ordering
constructor should then use that reference to bind its ClusterServer handler
to it.
See Fig. 7 for an infinitely less verbose and better explanation of the above:
func Start(...) { // This gets executed when launching the orderer process
func initializeMultichannelRegistrar(...) {
consenters := make(map[string]consensus.Consenter)
consenters["solo"] = solo.New()
consenters[etcdraftprotos.PluginName] = etcdraftplugin.New(grpcServer)
Figure 7 — An example of how the orderer's server package should be modified (in bold)
25
server object. What's implied here is that the raftplugin's New constructor will invoke
RegisterClusterServer(grpcServer, customHandler).
CHANGE COMPONENT
configuration state, and the index/term of the most recent Raft log
26
2. A node that is recovering from a crash failure (or is simply
restarting), typically loads the most recent snapshot, and uses the
Until Fabric gets support for state DB snapshots, let us define the
commonprotos.Block newest = 1;
Figure 8 — The application snapshot for the ordering service consists simply of the most
garbage collection".
Replicas that have fallen behind will use this snapshot information, as in the
1. Lagging replica R1 just got reconnected to the network. Its latest block
is 100.
2. Leader L is at block 196, and is configured to snapshot every 20 blocks.
3. R1 will receive a snapshot with newest set to 180. It should then issue
27
4. Once R1 has received block 180, L should invoke the ReportSnapshot
CHANGE COMPONENT
block proposals
13
Once the peer state database supports pruning, the snapshot message could be extended with
additional fields (oldest, oldest_config, newest_config) to support range syncing with arbitrary
snapshot and prune points.
28
Prevent Fabric configuration changes
messages
message that should be passed to (and committed by) the Raft FSM to
29
A. Those that leave the consenters slice (see Section C.II.2 "How do we
assign Raft node IDs to consenter nodes?") untouched, for which the
proposes to modify the consenter set by more than one node (i.e.
We are looking then at the following logic that needs to be implemented at the
consenter plugin:
2. Rejection of those updates that modify the set by more than 1 node (this
With that in mind, let's see how the two types of Fabric configuration updates
Type A
On the Fabric side of things, the broadcast handling part remains unchanged:
30
On the consenter plugin side, we follow the standard path for processing
1. If the OSN receiving the transaction is the Raft cluster leader:
CONFIGURATION transaction to the leader via the new Submit RPC (see
other?").
Type B
leader via a follower replica), the leader should act as in Type A above
(cut and order the current batch, then order the proposed Fabric
31
configuration), and it should follow this up by ordering the
N.B. When a configuration change is pending ordering, the leader should not
attempt to order any other block. This implies that the leader has two queues:
one for configuration, and one for non-configuration transactions. The former
Whenever a Fabric configuration block is created the leader should add to that
block's metadata the Raft node IDs of the cluster, as well as the highest
assigned Raft node ID, per the guidelines given in Section C.II.2 "How do we
actually ordered and committed the actual Raft configuration change. This can
node will actually never be referenced by any Raft FSM in the cluster. We
leadership, the leader checks the most recent committed entry in its log; if
14
For the leader to be able to order blocks before they're appended to the ledger, we need to change
the way the blockwriter works in Fabric today. Specifically, it is not possible to create a second
new block in a row (read: invoke CreateNextBlock twice), unless the first new block has been written
to the ledger (see: WriteBlock method). We should remove this tight coupling between the
CreateNextBlock and WriteBlock methods.
32
and orders the corresponding Raft configuration change, before ordering any
15
This means that the leader actually maintains three queues locally, listed in ascending priority:
one for normal blocks, one for configuration blocks, and one for Raft configuration messages.
33