You are on page 1of 9

Journal of Information Security and Applications 59 (2021) 102821

Contents lists available at ScienceDirect

Journal of Information Security and Applications


journal homepage: www.elsevier.com/locate/jisa

A novel Byzantine fault tolerance consensus for Green IoT with intelligence
based on reinforcement
Peng Chen (first author) a ,1 , Dezhi Han (first author) a ,1 , Tien-Hsiung Weng b , Kuan-Ching Li b ,∗,
Arcangelo Castiglione c
a
Department of Computer Science and Technology, Shanghai Maritime University, Shanghai 200120, China
b Department of Computer Science and Information Engr. (CSIE), Providence University, Taichung 43301, Taiwan
c Department of Computer Science, University of Salerno, Via Giovanni Paolo II, 132, 84084, Fisciano, Salerno, Italy

ARTICLE INFO ABSTRACT

Keywords: To enhance the consensus performance of Blockchain in the Green Internet of Things (G-IoT) and improve
Blockchain the static network structure and communication overheads in the Practical Byzantine Fault Tolerance (PBFT)
Green IoT consensus algorithm, in this paper, we propose a Credit Reinforce Byzantine Fault Tolerance (CRBFT) consensus
Byzantine fault tolerance
algorithm by using reinforcement learning. The CRBFT algorithm divides the nodes into three types, each with
Reinforcement learning
different responsibilities: master node, sub-nodes, and candidate nodes, and sets the credit attribute to the
Consensus algorithm
Smart city
node. The node’s credit can be adjusted adaptively through the reinforcement learning algorithm, which can
dynamically change the state of nodes. CRBFT algorithm can automatically identify malicious nodes and invalid
nodes, making them exit from the consensus network. Experimental results show that the CRBFT algorithm can
effectively improve the consensus network’s security. Besides, compared with the PBFT algorithm, in CRBFT,
the consensus delay is reduced by about 40%, and the traffic overhead is reduced by more than 45%. This
reduction is conducive to save energy and reduce emissions.

1. Introduction line with the design concept of G-IoT. As the technological foundation
of smart cities, the IoT provides smart cities with urban perception
Green Internet of Things (G-IoT) is a new design concept of the capabilities. Therefore, the development of G-IoT contributes to the
Internet of Things (IoT) that optimizes network devices and introduces sustainable development of smart cities [8,9].
new technologies to achieve energy saving, pollution reduction, and The consensus algorithm is an essential part of Blockchain technol-
lower operating costs [1]. In recent years, with the development of ogy, and it is necessary to solve Byzantine faults. A suitable consensus
the IoT technology, energy consumption problems in network mainte- algorithm is conducive to reducing consensus energy consumption and
nance, communication overhead, and data management have become resource waste. A Byzantine fault takes its name from the Byzantine
increasingly prominent. Therefore, the demand for G-IoT is becoming general problem, proposed by Lamport [10]. The Byzantine general
more and more urgent [2]. A Blockchain is a list of records linked and problem is typically used to characterize consistency problems in dis-
protected by cryptography that holds the following main characteris- tributed systems and represents the underlying issue for the Blockchain
tics: decentralization, distribution, peer-to-peer (P2P) architecture [3]. consensus Algorithm. More precisely, the Byzantine general problem
The essence of a Blockchain is a distributed database technology [4].
in a computer system can be expressed as follows: assuming there is
More precisely, through cryptography, consensus algorithm, distributed
a reliable channel for the transmission of messages, how to avoid the
storage, and point-to-point technology, each node in the network main-
influence of malicious nodes in the system, so that the whole system
tains the network data’s consistency and validity and constructs a
can run well without the impact of such nodes, and ensure the integrity,
decentralized distributed system [5]. The use of Blockchain in the
reliability, and consistency of information data [11]?
IoT can enhance security and mitigate energy consumption problems
The Blockchain consensus algorithm aims to make the transaction
caused by centralized servers [6,7]. As a result, the combination of IoT
data verified by more than half of the nodes. Research on consensus
and blockchain can effectively reduce energy consumption, which is
algorithms has long been underway. Pease and Lamport first proposed
conducive to IoT energy saving and reducing operation costs and is in

∗ Corresponding author.
E-mail addresses: dzhan@shmtu.edu.cn (D. Han), thweng@pu.edu.tw (T.-H. Weng), kuancli@pu.edu.tw (K.-C. Li), arcastiglione@unisa.it (A. Castiglione).
1
Authors Peng Chen and Dezhi Han contributed equally to this work.

https://doi.org/10.1016/j.jisa.2021.102821

Available online 22 April 2021


2214-2126/© 2021 Published by Elsevier Ltd.
P. Chen et al. Journal of Information Security and Applications 59 (2021) 102821

the Byzantine Fault Tolerance (BFT) algorithm in the 1980s [12]. This (1) We apply for the first time reinforcement learning to the design
algorithm relies on the mutual transmission of information between of the BFT consensus algorithm, introduce the concept of credit,
the nodes to reach a deterministic consensus result. However, BFT is and propose a reinforcement-based BFT consensus algorithm,
not practical since, in this algorithm, the complexity of the messages CRBFT. With the proposed algorithm, the network has cognitive
exchanged between the nodes is exponential. The process of joining and intelligence, which allows it to identify automatically malicious
exiting nodes requires special processing. In 1993, Cynthia Dwork and nodes and failed nodes, adaptively adjust the credit of nodes
Moni Naor first proposed the proof-of-work (PoW) algorithm [13]. In so that the state between nodes can be dynamically adjusted
this algorithm, the client needs to perform a certain amount of compli- with time. It reduces the interference of malicious nodes to the
cated calculations. Therefore, this algorithm requires more nodes and consensus process and improving the security of the consensus
computing power, resulting in a long transaction time. In 1999, Miguel network;
Castro and Barbara Liskov improved the BFT by proposing the Practical (2) We improve the PBFT algorithm. Change from the C/S paradigm
Byzantine Fault Tolerance (PBFT) algorithm [14]. The PBFT inherited to the P2P paradigm, remove the confirmation phase, and per-
the advantages of BFT while reducing the algorithm complexity from form synchronous verification when the master node changes.
exponential to polynomial. However, the PBFT algorithm adopts the It makes the algorithm conform to decentralization, reduces
Client–Server (C/S) structure [15], and its consensus nodes are fixed, communication overhead, and lowers consensus energy con-
making it unable to perceive the changes in the number of nodes sumption.
dynamically. Therefore, with the further development of Blockchain (3) The experimental results show that compared with the tra-
technology, many researchers have proposed improved algorithms to ditional PBFT consensus algorithm, the CRBFT algorithm we
overcome the limitations of the PBFT algorithm. proposed significantly reduces the consensus delay. In the pro-
In particular, Malkhi et al. introduce the Flexible BFT [16], a new posed algorithm, the performance is less affected by the increase
approach for BFT consensus which is resilient to higher malicious levels in the number of nodes.
than possible in a pure Byzantine fault model. But the new fault model
it proposed cannot predict what these replicas would do if they can vio- The remaining of this paper is organized as follows. Section 2
late safety. Duan et al. present hBFT, a hybrid, Byzantine fault-tolerant introduces some preliminary knowledge and background necessary to
algorithm [17], which can detect and identify faulty clients. But if understand the consensus algorithm design better. Section 3 presents
clients are participating, it is necessary to ensure that the master node the design ideas and details concerning the CRBFT consensus algorithm
is correct to maintain performance. Liu introduces a Dynamic autho- based on reinforcement learning. In Section 4, we show the experi-
rization of the Byzantine fault-tolerant (DDBFT) algorithm [18], which mental results achieved through simulation. Finally, in Section 5, we
applies the Delegated Proof of Stake (DPoS) algorithm to PBFT. The provide some concluding remarks and future research prospects.
DDBFT algorithm is dynamic. This characteristic improves throughput
and reduces the delay. On the other hand, it causes the system blocking 2. Preliminaries and problem formulation
when the block size (transmitted data size) exceeds the node processing
capacity, resulting in wasting resources. Li and Zhang in [19] describe a To better illustrate the design details of the Credit Reinforcement
Group-Hierarchy (GH) algorithm based on PBFT, which divides replicas Byzantine Fault Tolerance (CRBFT) consensus algorithm proposed in
into groups. Each group executes the normal-case operation of PBFT this paper, in this section, we introduce the basic knowledge of the
concurrently to reach a local consensus. Then, every group’s primary Practical Byzantine Fault Tolerance (PBFT) algorithm and provide the
as the consensus representative reaches an agreement with other pri- concepts underlying the reinforcement learning (RL) algorithm used in
maries to reach a global consensus. This algorithm reaches a consensus our proposal.
faster than PBFT but does not deal with malicious nodes, wasting many
resources, and weakening system stability. The above algorithms have
2.1. Practical Byzantine Fault Tolerance (PBFT)
some drawbacks, especially in resource consumption, which do not
meet the G-IoT requirements.
Lamport et al. showed that an effective Byzantine Fault Tolerant
In recent years, with the coming of the research upsurge of machine
algorithm exists when the number of traitors (which can make mis-
learning (ML), the combination of Blockchain and machine learning
takes) in the Byzantine system does not exceed 1/3 of the total amount.
has attracted researchers’ attention [9,20,21]. For example, machine
Conversely, if it exceeds 1/3, there is no guarantee that the system
learning applies to the design of fair data transaction protocol based on
will achieve a consistent result [27]. Therefore, the solution to the
Blockchain [22], and deep learning (DL) finds application in financial
Byzantine problem can be determined if Eq. (1) is satisfied,
investment research based on Blockchain [23]–[24]. Reinforcement
learning (RL) is a behavior-based machine learning method, which can 𝑛 ≥ 3𝑓 + 1 (1)
adjust its behavior through the interaction between agent and envi-
ronment. More precisely, RL has attracted much attention in artificial where 𝑛 is the total number of nodes, and 𝑓 is the Byzantine nodes’
intelligence (AI) for its excellent decision-making ability [25]–[26]. maximum tolerable number.
Based on the above problems, in this paper, we propose a Credit The PBTF algorithm requires that all nodes have the same result of
Reinforcement Byzantine Fault Tolerance (CRBFT) consensus algorithm operation execution under the same given service state and parameters,
based on the RL algorithm, where ‘‘credit’’ is the quantitative per- and all nodes must start execution from the same state. Under this
formance of the credibility of each consensus node, that is, ‘‘credit’’ constraint, even if there are failed nodes, the PBFT algorithm agrees
is an indicator of whether a node can participate in the consensus. on the total execution order of all non-failed nodes, thereby ensuring
Besides, a node with a higher credit value has a higher priority to safety [28]. The PBFT algorithm divides all nodes into three types: a
participate in the consensus. The CRBFT consensus algorithm combines client node, a master node, and a backup node. Again, it divides the
RL with the PBFT algorithm, using the RL algorithm to identify ma- consensus process into three stages: pre-preparation stage, preparation
licious nodes and invalid nodes automatically. Furthermore, CRBFT stage, and confirmation stage. We show the consensus process of the
is reliable, efficient, safe, and dynamic. It can reduce communication PBFT algorithm in Fig. 1.
overhead and consensus energy consumption, which is conducive to The result of the process is not valid until the client node receives
G-IoT development. the same result from at least 𝑓 + 1 normal nodes that do not make
In detail, the main contributions of this paper are as follows: mistakes.

2
P. Chen et al. Journal of Information Security and Applications 59 (2021) 102821

Fig. 1. The PBFT algorithm consensus process. Fig. 2. Schematic diagram for implementations of the actor-critic.

2.2. Reinforcement learning (RL) parameters by reward 𝑅𝑡+1 to get a higher return. The Actor then
directs the update of the action based on the value derived from the
Reinforcement learning (RL) is a type of machine learning inspired Critic. Then, the objective function is constructed, and the parameters
by animal psychology and combined with psychology, control theory, are updated iteratively to make the output meet the threshold of the
and other related subjects [29]–[30]. In simple terms, RL is a cyclic objective function. The final output is the approximate optimal solution
process in which an agent takes action to change its state to gain reward of the Bellman equation. Therefore, the Actor-Critic method can solve
and interact with the environment. Markov Decision Processes (MDPs) the optimal Bellman equation adaptively, as shown by the diagram in
provide a framework for the study of RL. A typical MDPs problem can Fig. 2.
be expressed as a five-tuple (𝑆, 𝐴, 𝑃 , 𝑅, 𝛾), where 𝑆 is a set of states,
and 𝐴 is a set of actions. The transition probability 𝑃 describes the 3. Credit Reinforcement Byzantine Fault Tolerance consensus al-
probability distribution of the agent’s transition from the current state gorithm design
𝑠 ∈ 𝑆 to other states after the action of 𝑎 ∈ 𝐴. 𝑅 is the reward function,
which defines the reward for the action 𝑎 ∈ 𝐴. 𝛾 is a discount factor that In this section, we provide a detailed description of our proposal.
is mainly used to balance current and future rewards. The objective of More precisely, we first present the design approach underlying the
MDPs is to get the maximum return 𝐺𝑡 when taking the corresponding Credit Reinforcement Byzantine Fault Tolerance (CRBFT) consensus
action 𝑎 under state 𝑠. The return 𝐺𝑡 is the total discounted reward from algorithm, and then we give the relative details.
time-step 𝑡. Formally, 𝐺𝑡 is defined by Eq. (2),
3.1. Algorithm design approach
𝐺𝑡 = 𝑅𝑡+1 + 𝛾 1 𝑅𝑡+2 + 𝛾 2 𝑅𝑡+3 + ⋯ (2)

where 0 < 𝛾 < 1, 𝑅𝑡+1 represents the reward of the state 𝑠𝑡 to 𝑠𝑡+1 . The In the Blockchain, the primary purpose is to reach a consensus
optimal strategy to obtain the maximum return can be found by solving on the Blockchain transaction information across the entire network,
the optimal Bellman equation, which is defined as follows which does not involve the consensus request’s order. Therefore, to
∑ adapt better the Byzantine fault-tolerant consensus algorithm to the
𝑣∗ (𝑠) = 𝑚𝑎𝑥𝑎 𝑞 ∗ (𝑠, 𝑎) = 𝑚𝑎𝑥𝑎 (𝑅𝑎𝑠 + 𝛾 𝑎 ∗ ′
𝑃𝑠𝑠′ 𝑣 (𝑠 )) (3) Blockchain, we propose a Credit Reinforcement Byzantine Fault Tol-
𝑠′ erance (CRBFT) consensus algorithm.

𝑞 ∗ (𝑠, 𝑎) = 𝑚𝑎𝑥𝜋 𝑞 𝜋 (𝑠, 𝑎) = 𝑅𝑎𝑠 + 𝛾 𝑎
𝑃𝑠𝑠 ∗ ′ ′
′ 𝑚𝑎𝑥𝑎′ 𝑞 (𝑠 , 𝑎 ) (4) First of all, we adapt the PBFT algorithm to make it compatible
𝑠′ with the Blockchain system’s effective application. More precisely, we
where 𝑣∗ (𝑠) represents the optimal long-term value of the state 𝑠, that propose the following improvements to the PBFT algorithm:
is, the value of the state in which all possible actions are considered,
and the optimal action is selected; strategy 𝜋 is a method for agent to 1. According to the Blockchain decentralized architecture, we shift
choose actions; 𝑞 𝜋 (𝑠, 𝑎) is the action-value function, which represents from the Client–Server (C/S) paradigm to the Peer-to-Peer (P2P)
the expected return of using strategy 𝜋 and taking action 𝑎 in state 𝑠; paradigm. Thus, there is no client in the system;
𝑞 ∗ (𝑠, 𝑎) represents the optimal value among the action-value functions 2. We divide the consensus node into three types: master node,
generated under all strategies; 𝑅𝑎𝑠 is the expected return of taking action sub-node, and candidate node;
𝑎 in state 𝑠; 𝑃𝑠𝑠𝑎 is the probability of taking action 𝑎 from state 𝑠 to state 3. We set the ‘‘credit’’ attribute for consensus nodes. In this way,


𝑠. the system can dynamically divide the types of consensus nodes,
We can solve the optimal Bellman equation by the Temporal- and nodes can dynamically join and leave the system;
Difference (TD) algorithm [31]. The TD algorithm is a model-free RL 4. We remove the confirmation phase from the consensus process
algorithm, which updates the state value 𝑉 (𝑠𝑡 ) by predicting the TD of the PBFT algorithm and carry out the synchronous verification
target, where TD target= 𝑅𝑡+1 + 𝛾𝑉 (𝑠𝑡+1 ). process when the master node changes;
5. We perform adaptive consensus confirmation and reputation
𝑉 (𝑠𝑡 ) ← 𝑉 (𝑠𝑡 ) + 𝛼[𝑅𝑡+1 + 𝛾𝑉 (𝑠𝑡+1 ) − 𝑉 (𝑠𝑡 )]. (5) adjustment based on reinforcement learning.
The purpose of the TD algorithm update is to minimize the error We remark that removing the verification phase of the protocol re-
between the final predicted value and the true value, to reduce its error. duces the network’s bandwidth to a certain extent. However, to remove
TD error 𝛿𝑡 is defined by Eq. (6): state inconsistency and ensure the system consistency after the master
node change, the synchronous verification process is included after the
𝛿𝑡 ≐ 𝑅𝑡+1 + 𝛾𝑉 (𝑠𝑡+1 ) − 𝑉 (𝑠𝑡 ). (6)
change of such a node. More precisely, the synchronous verification
The Actor-Critic method [32] is an effective RL algorithm. In this process works as follows. The sub-node sends a synchronization request
method, the Actor selects the action according to the current system to the new master node to verify whether the master node number
state. On the other hand, the Critic gives the corresponding value is consistent. After the synchronization is successful, the master node
evaluation according to the current system state and the action selected sends backup data to the sub-node, and then the sub-node verifies
by the Actor, parameterizes the action-value function, and updates its the validity of the backup data. Through this operation, we reduce

3
P. Chen et al. Journal of Information Security and Applications 59 (2021) 102821

communication overhead and improve consensus efficiency, without


compromising the fault tolerance of the system.
Furthermore, according to the output of the action performed by
reinforcement learning, the consensus system adjusts the credit of
the nodes participating in the consensus each time. The node that
successfully participates in the consensus is that whose verification
message positively impacts the successful release of the block by the
master node. Nodes that successfully participate in the consensus will
get a ‘‘reward’’ so that the successful nodes have higher credit and are
more likely to participate in the next consensus. Conversely, malicious
Fig. 3. The consensus process of the CRBFT.
nodes and failed nodes will be ‘‘punished’’. The credit of such nodes
will decrease with the increase of the consensus. Besides, when a node’s
credit is lower than a given threshold (base value), such a node will
quit the consensus. In this way, the system can effectively identify in the consensus. The message format is ≪ 𝑃 𝑟𝑜𝑝𝑜𝑠𝑎𝑙, 𝑝, 𝑛, 𝑠𝑚 , 𝑑 >
malicious nodes and failed nodes and dynamically add and delete , 𝑏𝑙𝑜𝑐𝑘 >, where 𝑠𝑚 is the message signature and 𝑑 is the message
the nodes participating in the consensus. Consequently, these design digest computed using the SHA -256 algorithm;
choices make the consensus environment more in line with current 4. The sub-node first checks the message signature, proposal num-
network scenarios and make it more stable and secure while reducing ber, and master node number of the proposal. If the verification
resource consumption. fails, the sub-node discards the proposal message. On the other
hand, if the verification succeeds, the sub-node sends a veri-
3.2. Algorithm design content fication message to the master node. The verification message
format is ⟨𝐶𝑜𝑛𝑠𝑒𝑠𝑢𝑠𝐶𝑜𝑚, 𝑝, 𝑛, 𝑠𝑚 , 𝑑, 𝑟𝑒, 𝑘, 𝐶⟩, where 𝑘 is the sub-
3.2.1. Algorithm notation and preprocessing node number, 𝐶 is the reputation of node 𝑘, and 𝑟𝑒 denotes the
The basic parameters used in the system are defined as follows: verification type of the sub-node 𝑘 to the proposal message di-
gest. 𝑟𝑒 can be as simple as either a ‘‘1’’ or a ‘‘−1’’, corresponding
• Set ‘‘credit’’ 𝐶 as the index to measure the reliability of con- to ‘‘true’’ or ‘‘false’’, respectively, where ‘‘true’’ means that the
sensus nodes (i.e., the possibility of successfully participating in proposal message-digest 𝑑 is consistent with the data cached by
consensus), and take it as the main basis to divide the consensus the sub-node. Otherwise, ‘‘false’’ means inconsistency and that
nodes; the sub-node suspects of the master node as well;
• Consensus nodes are divided into three types: master node, sub- 5. The master node then uses the reinforcement learning to perform
nodes, and candidate nodes. The master node is the node with the consensus confirmation of the received verification message.
the highest credit. The number of consensus nodes is 𝑁, where If the master node receives 2𝐹 confirmation messages, it con-
𝑁 ≥ 3𝐹 + 1 and 𝐹 is the maximum number of malicious nodes siders that consensus has been reached and publishes a block.
that the system can tolerate. In general, 𝑁 takes 3𝐹 + 1; If the master node receives 2𝐹 suspicious messages, the con-
• Preset the value of the admission credit 𝐶𝑏𝑎𝑠𝑒 . 𝑁 − 1 candidate sensus network broadcasts the message of changing the master
nodes with a credit 𝐶 ≥ 𝐶𝑏𝑎𝑠𝑒 are selected as sub-nodes. Nodes node and reselects the master node. If not enough messages are
with a credit lower than 𝐶𝑏𝑎𝑠𝑒 or newly added nodes are selected received during the consensus timeout, all nodes discard the
as candidate nodes. block generated during the consensus process, replace the master
node, and reduce the ‘‘credit’’ of the replaced master node. Then,
3.2.2. An overview of the CRBFT algorithm reselect the master node, but no longer select the original master
When the system is initialized, all the consensus nodes’ credit is node. Finally, start a new round of consensus. If the master
set to 𝐶𝑏𝑎𝑠𝑒 , and 𝑁 consensus nodes are randomly selected as sub- node can successfully publish the block, after publishing the
nodes. The master node is then elected from the sub-nodes, and the block, such a node sends the credit adjustment message to the
number 𝑝 is assigned to it. Since the Blockchain’s consensus system sub-node;
is P2P-based and decentralized, each consensus node needs to start in 6. When the consensus node receives the published block, it con-
the same state. This assumption means that the data stored by each siders this round of consensus complete. Then such a node
consensus node needs to be consistent. Ensuring consistency requires updates the credit according to the adjusted information, clears
data backup and verification. After the data backup and verification the cache, updates the consensus node, and starts a new round.
phases are complete, the system begins the consensus process. We
remark that in the CRBFT algorithm, message delivery uses digital The consensus process of the CRBFT is shown in Fig. 3, and the
signature techniques and the SHA-256 algorithm to ensure the integrity CRBFT algorithm information flow is shown in Fig. 4.
and authenticity of the message.
The master node initiates the consensus process and sets the master 3.2.3. Reinforcement learning consensus confirmation
node consensus interval to 𝛥𝑡. The details of the CRBFT algorithm are Consensus confirmation combines the Actor-Critic method, TD algo-
as follows: rithm, and neural networks. It is mainly composed of an action network
that produces actions and a critic network, which is used to evaluate
1. The initiator starts its transaction by signing this transaction and the actions. The main function of Actor-Critic is to judge whether the
broadcasting it to the consensus node; sub-nodes are in agreement and provide a basis for credit adjustment.
2. The consensus node verifies the legality of the received transac- In Fig. 5, we show the neural network structure for the critic network.
tion. If it is legal, the transaction information is cached, and the The structure of the action network is essentially the same as the critic
master node generates a block. If the received transaction is not network. More precisely, both networks adopt the back propagation
legal, it is directly discarded; (BP) neural network with a hidden layer.
3. The master node 𝑝 generates a proposal message for the gen- Take 𝑟𝑒𝑘 and 𝐶𝑘 in the 𝑘𝑡ℎ verification message as the input of the
erated block after 𝛥𝑡 and assigns a unique number 𝑛 to this action network, and get the output 𝑢(𝑘). The critic network takes 𝑟𝑒𝑘 ,
proposal. This number is increased with each new proposal 𝐶𝑘 , and 𝑢(𝑘) as inputs, and obtains the output 𝐽 (𝑘) as the approximate
generated. Then the master node 𝑝 broadcasts this proposal value of the return 𝐺𝑡 in Eq. (2), which is used to estimate the output of
message to the sub-node to request the sub-node to participate the critic network approximately. Therefore, this approximation is used

4
P. Chen et al. Journal of Information Security and Applications 59 (2021) 102821

Fig. 6. Schematic diagram for implementations of the CRBFT.

In the critic network, the output 𝐽 (𝑘) is defined as follows:


𝑛𝑖𝑛 +1

𝜑𝑖 (𝑘) = 𝑤(1)
𝑐 (𝑘)𝑥𝑗 (𝑘), 𝑖 = 1, … , 𝑁ℎ (9)
𝑖𝑗
Fig. 4. The CRBFT algorithm flowchart. 𝑗=1
1 − 𝑒𝑥𝑝−𝜑𝑖 (𝑘)
𝜙𝑖 (𝑘) = , 𝑖 = 1, … , 𝑁ℎ (10)
1 + 𝑒𝑥𝑝−𝜑𝑖 (𝑘)
𝑁ℎ

𝐽 (𝑘) = 𝑤(2)
𝑐 (𝑘)𝜙𝑖 (𝑘) (11)
𝑖
𝑖=1

where 𝜑𝑖 is the 𝑖𝑡ℎ hidden node input of the critic network; 𝑛𝑖𝑛 + 1 is
the total number of inputs into the critic network, including the analog
action value from the action network; 𝜙𝑖 is the corresponding output;
𝑁ℎ is the total number of hidden nodes in the critic network; 𝑤𝑐 is the
weight vector in the critic network.
According to the error propagation equation of the back propaga-
tion algorithm and the chain rule, the gradient of the neural network
objective function to the weights can be obtained. We summarize the
Fig. 5. Schematic diagram for the implementation of a nonlinear critic network using
a feedforward network with one hidden layer.
adaptation of the critic network as follows:

• 𝛥𝑤(2)
𝑐 (hidden to output layer)

𝜕𝐸𝑐 (𝑘)
to quantify the output of 𝑢(𝑘) better. Then, compare the action output 𝛥𝑤(2)
𝑐 (𝑘) = 𝑙𝑐 (𝑘)[− ]
𝑢(𝑘) and 𝑟𝑒𝑘 to select the reinforcement signal 𝑟(𝑘). If 𝑢(𝑘) and 𝑟𝑒𝑘 have
𝑖
𝜕𝑤(2)
𝑐𝑖 (𝑘)

the same sign, the output is regarded as successful, and 𝑟(𝑘) is ‘‘0’’. On 𝜕𝐸 (𝑘) 𝜕𝐽 (𝑘) (12)
= −𝑙𝑐 (𝑘)[ 𝑐 ]
the other hand, if the sign is different, it is considered to be a failure, 𝜕𝐽 (𝑘) 𝜕𝑤(2) (𝑘)
𝑐𝑖
and 𝑟(𝑘) is ‘‘1’’. = −𝑙𝑐 (𝑘)[𝛼𝑒𝑐 (𝑘)𝜙𝑖 (𝑘)]
In the initial stage of consensus, the weights of the action network
• 𝛥𝑤(1)
𝑐 (input to hidden layer)
and critic network are random. During the neural network’s learning
process, the critic network uses the 𝑟(𝑘) to update the weights and 𝜕𝐸𝑐 (𝑘)
𝛥𝑤(1)
𝑐 (𝑘) = 𝑙𝑐 (𝑘)[− ]
obtain the optimal 𝐽 (𝑘). The action network then uses the optimal 𝐽 (𝑘) 𝑖𝑗
𝜕𝑤(1)
𝑐𝑖𝑗 (𝑘)
to update its weights and get the optimal output. The neural network’s 𝜕𝐸𝑐 (𝑘) 𝜕𝐽 (𝑘) 𝜕𝜙𝑖 (𝑘) 𝜕𝜑𝑖 (𝑘)
= −𝑙𝑐 (𝑘)[ ] (13)
detailed design process will be given in parts A and B of this subsection. 𝜕𝐽 (𝑘) 𝜕𝜙𝑖 (𝑘) 𝜕𝜑𝑖 (𝑘) 𝜕𝑤(1) (𝑘)
𝑐𝑖𝑗
In Fig. 6, we show a diagram characterizing the process for tuning the
1
Actor-Critic method’s parameters. = −𝛼𝑙𝑐 (𝑘)𝑒𝑐 (𝑘)𝑤(2) 2
𝑐𝑖 (𝑘) ⋅ [ 2 (1 − 𝜙𝑖 (𝑘))]𝑥𝑗 (𝑘)

A. Critic network. The learning objective of the critic network is to where 𝑙𝑐 (𝑘) > 0 is the learning rate of the critic network when
minimize the error between the approximation value of the value func- dealing with 𝑖𝑡ℎ message. We remark that this rate usually de-
tion and the actual value, while optimizing the maximum return. The creases over time to a small value.
prediction error function 𝑒𝑐 (𝑘) of the critic network and the minimized B. Action Network. Previously we defined that if the output is regarded
objective function 𝐸𝑐 (𝑘) is defined as follows: as successful, 𝑟(𝑘) is ‘‘0’’. That is, ‘‘0’’ was defined as the reinforcement
signal for ‘‘success’’. To satisfy the Bellman equation and maximize the
𝑒𝑐 (𝑘) = 𝛼𝐽 (𝑘) − [𝐽 (𝑘 − 1) − 𝑟(𝑘)] (7)
state value function, the ultimate learning target of the action network,
denoted by 𝑈𝑐 , is set to ‘‘0’’ in the algorithm. Through observation, we
1 2
𝐸𝑐 (𝑘) = 𝑒 (𝑘). (8) found that the parameter adjustment principle of the action network
2 𝑐

5
P. Chen et al. Journal of Information Security and Applications 59 (2021) 102821

is to indirectly back-propagate the error between 𝐽 and 𝑈𝑐 . We define Table 1


Experimental environment.
𝑒𝑎 (𝑘) and 𝐸𝑎 (𝑘) as follows:
CPU Intel(R) Core (TM) i7-6500U @2.60 GHz
Hardware
𝑒𝑎 (𝑘) = 𝐽 (𝑘) − 𝑈𝑐 (𝑘) (14) Memory 8 GB RAM
Operating System Window 10
Software
1 2 Simulation Software Pycharm 2019.2 (Python 3.8)
𝐸𝑎 (𝑘) = 𝑒 (𝑘). (15)
2 𝑎
The action network adopts a neural network structure, which is
similar to the critic network shown in Fig. 5. In detail, the relevant 4.1. Simulation setup
equations characterizing the action network are defined as follows:
𝑛𝑖𝑛
∑ The experimental environment for performance testing is shown in
𝑚𝑖 (𝑘) = 𝑤(1)
𝑎 (𝑘)𝑥𝑗 (𝑘), 𝑖 = 1, … , 𝑁ℎ (16) Table 1. The rules for adjusting the credit of the CRBFT algorithm are
𝑖𝑗
𝑗=1
as follows:
1 − 𝑒𝑥𝑝−𝑚𝑖 (𝑘)
𝑧𝑖 (𝑘) = , 𝑖 = 1, … , 𝑁ℎ (17)
1 + 𝑒𝑥𝑝−𝑚𝑖 (𝑘) 1. The initial credit of all nodes, denoted as 𝐶𝑏𝑎𝑠𝑒 , is set to 3;
𝑁ℎ
∑ 2. If the master node completes a consensus, its credit is incre-
ℎ(𝑘) = 𝑤(2)
𝑎 (𝑘)𝑧𝑖 (𝑘) (18) mented by 0.2. If the sub-node’s verification type is ‘‘true’’, the
𝑖
𝑖=1 credit of this sub-node is incremented by 0.1. On the other
1 − 𝑒𝑥𝑝−ℎ(𝑘) hand, if the verification type is ‘‘false’’, this sub-node’s credit
𝑢(𝑘) = (19)
1 + 𝑒𝑥𝑝−ℎ(𝑘) is decremented by 0.1. Also, the credit is reduced by 0.1 to a
where 𝑚𝑖 is input of the 𝑖𝑡ℎ hidden node of the action network, 𝑧𝑖 is sub-node that has not responded over time;
the corresponding output; ℎ is the input of the output node; 𝑤𝑎 is the 3. If the sub-node suspects that the master node is successful, it sets
weight vector in the action network. the credit of the master node to 2;
Similar to the critic network, the parameters update rules of the 4. If the master node does not complete the consensus within the
action network are summarized in Eqs. (20)–(21). timeout, the master node’s credit is decremented by 1.

Notice that the Actor-Critic structure implements the calculation of


• 𝛥𝑤(2)
𝑎 (hidden to output layer)
the credit adjustment value in step 2. The relation between the output
𝜕𝐸𝑎 (𝑘) 𝑢 and 𝛥𝐶 is defined by Eq. (22):
𝛥𝑤(2)
𝑎 (𝑘) = 𝑙𝑎 (𝑘)[− ] {
𝑖
𝜕𝑤(2)
𝑎𝑖 (𝑘) 𝑚𝑎𝑔 𝑢≥0
𝛥𝐶 = (22)
𝜕𝐸 (𝑘) 𝜕𝐽 (𝑘) 𝜕𝑢(𝑘) 𝜕ℎ(𝑘) −2𝑚𝑎𝑔 𝑢<0
= −𝑙𝑎 (𝑘)[ 𝑎 ]
𝜕𝐽 (𝑘) 𝜕𝑢(𝑘) 𝜕ℎ(𝑘) 𝜕𝑤(2) (𝑘) where the 𝑚𝑎𝑔 is the adjustment rate of credit, and it can be set freely.
𝑎𝑖
𝑁ℎ
∑ In this paper, we set it to 0.1.
1 1
= 𝑒𝑎 (𝑘)[ (1 − 𝑢2 (𝑘))]𝑧𝑖 (𝑘) [ 𝑤2𝑐 (𝑘)(1 − 𝜙2 (𝑘))𝑤(1)
𝑐𝑖 ,𝑛+1
(𝑘)] In addition, the specific settings of the neural network are as fol-
2 𝑖=1
2 𝑖
lows:
(20)
- 𝑙𝑐 (0) 0.3, initial learning rate of the critic network;
• 𝛥𝑤(1)
𝑎 (input to hidden layer) - 𝑙𝑎 (0) 0.3, initial learning rate of the action network;
- 𝑙𝑐 (𝑡) learning rate of the critic network at time 𝑡, that is
𝜕𝐸𝑎 (𝑘)
𝛥𝑤(1)
𝑎 (𝑘) = 𝑙𝑎 (𝑘)[− ] decreased by 0.05 every five-time steps, until it reaches 0.005
𝑖𝑗
𝜕𝑤(1)
𝑎𝑖𝑗 (𝑘) and it stays at 0.005 after that;
𝜕𝐸𝑎 (𝑘) 𝜕𝐽 (𝑘) 𝜕𝑢(𝑘) 𝜕ℎ(𝑘) 𝜕𝑧𝑖 (𝑘) 𝜕𝑚𝑖 (𝑘) - 𝑙𝑎 (𝑡) learning rate of the action network at time 𝑡, that is
= −𝑙𝑎 (𝑘)[ ] decreased by 0.05 every five-time steps, until it reaches 0.005
𝜕𝐽 (𝑘) 𝜕𝑢(𝑘) 𝜕ℎ(𝑘) 𝜕𝑧𝑖 (𝑘) 𝜕𝑚𝑖 (𝑘) 𝜕𝑤(1) (𝑘)
𝑎𝑖𝑗
and it stays at 0.005 after that;
1 1
= −𝑙𝑎 (𝑘)𝑒𝑎 (𝑘)[ (1 − 𝑢2 (𝑘))] ⋅ 𝑤2𝑎 (𝑘)[ (1 − 𝑧2𝑖 (𝑘))]𝑥𝑗 (𝑘) - 𝑁𝑐 50, internal cycle of the critic network;
2 𝑖 2
𝑁ℎ - 𝑁𝑎 100, internal cycle of the action network;
∑ 1 2
⋅ [ 𝑤𝑐 (𝑘)(1 − 𝜙2 (𝑘))𝑤(1)
𝑐𝑖 ,𝑛+1
(𝑘)] - 𝑇𝑐 0.05, internal training error threshold for the critic net-
𝑖=1
2 𝑖 work (the threshold of 𝐸𝑐 (𝑘));
(21) - 𝑇𝑎 0.005, internal training error threshold for the action
network (the threshold of 𝐸𝑎 (𝑘));
where 𝑙𝑎 (𝑘) > 0 is the learning rate of the action network in - 𝑁ℎ 6, number of the hidden nodes.
dealing with the 𝑖𝑡ℎ message.

We define the list NT to record nodes where 𝑟𝑒 is ‘‘1’’, and the list 4.2. Performance evaluation
NF to record nodes where 𝑟𝑒 is ‘‘−1’’. The consensus is confirmed when
the number of nodes in NT is greater than or equal to 2𝐹 . Conversely, 4.2.1. Algorithm performance test
when the number of nodes in NF is greater than or equal to 2𝐹 , the In the experiment, we set 𝐹 to 3, 𝑁 to 13, and we consider 10
sub-node should not trust the master node. consensuses. Again, we set node 3 to become malicious at the 3𝑟𝑑
consensus, node 2 to become failed at the 4𝑡ℎ consensus, and node 5
to become malicious at the 6𝑡ℎ consensus. We show the credit trends
4. Results and discussion of each node in Fig. 7.
In Fig. 7, we can observe that in the 3𝑟𝑑 consensus, node 3 becomes
This section first explains the simulation settings. Then, we test a malicious node and its credit decreases. When the credit of node 3 is
the performance of the CRBFT algorithm and compare the CRBFT lower than 𝐶𝑏𝑎𝑠𝑒 , node 11 changes from a candidate node to a sub-node,
algorithm with other related works to verify its effectiveness and and node 3 becomes a candidate node. Node 2 becomes a failed node at
availability. Finally, we list two algorithm applications. the 4𝑡ℎ consensus, and the credit also decreases, but the decrease rate

6
P. Chen et al. Journal of Information Security and Applications 59 (2021) 102821

Fig. 7. Credit trend of nodes during the 10 consensus.

Fig. 8. Comparison diagram of single consensus time.

Fig. 9. Consensus delay comparison between the two algorithms.

is slower than malicious nodes. When the credit is lower than 𝐶𝑏𝑎𝑠𝑒 , it
becomes a candidate node.
We can conclude that the proposed algorithm can adaptively adjust where 𝑛 is the number of nodes participating in the consensus. Simi-
the credit of consensus nodes, effectively identify malicious nodes and larly, we can get the number of single consensus messages for CRBFT
failed nodes, and dynamically adjust the types of nodes, improving con- as follow:
sensus security. It can also be observed that it will give the successful
𝑍𝐶𝑅𝐵𝐹 𝑇 = 4𝑛 − 3 (25)
sub-node priority to take part in the next consensus, and the master
node has the advantage of being the master node next time. The total amount of messages in GH [19] is given as below:
The comparison of a single consensus time is shown in Fig. 8. We
can observe that the first consensus takes the longest time, and with 𝑛2 ∑
𝑗=𝑔
𝑍𝐺𝐻 = 2 +2 𝑀𝑗2 + 2𝑔 2 + 𝑔(𝑛 − 2) − 𝑛 (26)
the learning of the neural network, the consensus time decreases until 𝑔 𝑗=1
the learning ends. The consensus time stabilizes at around 200 ms.
where 𝑔 is the number of groups, 𝑀𝑗 is the number of nodes for the 𝑗𝑡ℎ
4.2.2. Delay test group. We set the number of nodes in each group to 2 in the experiment.
Consensus delay is an important index to measure the speed of the We compare the number of messages between these algorithms with
consensus algorithm. A low consensus delay can make the transaction a different number of nodes, as shown in Fig. 10. More precisely,
be confirmed quickly. Thus, the Blockchain results in being more secure as shown in Fig. 10, the number of messages exchanged by these
and practical. More precisely, the consensus delay 𝑇𝑑 tested in this three algorithms increases as the number of nodes increases, but PBFT
paper is a consensus completion time, and it is defined by Eq. (23): increases faster. Although GH has fewer messages than PBFT, it still
𝑇𝑑 = 𝑇𝑡𝑐 − −𝑇𝑡𝑟 (23) has more messages than CRBFT. In other words, CRBFT has a small
increase, compared with PBFT and GH, the number of messages is
where 𝑇𝑡𝑟 is the transaction start time, and 𝑇𝑡𝑐 is the consensus comple- reduced by more than 45%. It can effectively reduce the communica-
tion time. Then, 𝐹 is set to values 1, 2, and 3, and 𝑁 is set to values tion frequency of a single consensus process and reduce the resource
4, 7, and 10, respectively. The average value is obtained through 10
consumption of the network.
tests and compared with the PBFT algorithm. The statistical results are
shown in Fig. 9.
In Fig. 9, we can observe that with the increase of nodes, the 4.3. Algorithm application
consensus delay of both algorithms will increase, but the increase of
the PBFT is higher than the CRBFT, so the rise of consensus nodes has 4.3.1. Smart contracts in power system
a more significant impact on PBFT. Besides, using the same number of Blockchain technology can realize more accurate and reliable per-
nodes, the consensus delay of CRBFT is significantly lower than that of ception, transmission and recording of power systems, and more intel-
PBFT, which is reduced by about 40%. ligent and efficient mining, integration, and analysis of physical char-
acteristics and internal connections. Besides, Blockchain technology
4.2.3. Communication frequency comparison
can provide an open, transparent, and credible platform for electricity
According to the PBFT algorithm execution flow, the number of
market transactions, reducing contract execution risks and regulatory
messages in each phase is simplified to get the Eq. (24) as follows:
costs. Simultaneously, it assists people in thinking and decision-making
𝑍𝑃 𝐵𝐹 𝑇 = 2𝑛 ⋅ (𝑛 − 1) (24) in the power system and its related social systems, which improves the

7
P. Chen et al. Journal of Information Security and Applications 59 (2021) 102821

We can apply the CRBFT consensus algorithm to the financial


blockchain, then its transaction process and clearing process are syn-
chronized in real time, and the bookkeeping initiated by the seller
must be approved by the buyer to complete the transaction. More
importantly, in the transaction process, the CRBFT consensus algorithm
can dynamically adjust the credibility of the transaction object, which
is conducive to creating a safe and credible clearing and settlement
environment.

5. Conclusion and future work

Fig. 10. The number of messages between the related algorithms. This paper proposes a Credit Reinforcement Byzantine Fault Toler-
ance Consensus (CRBFT) algorithm based on reinforcement learning. In
the CRBFT algorithm, the credit attribute is set to the node. The node
credit is then adaptively adjusted through the reinforcement learning
algorithm to have a specific network cognitive ability. The proposed
algorithm can automatically identify malicious nodes and failure nodes
in the consensus network, thus improving consensus network security,
reducing consensus delay, stimulating energy saving, and emission
reduction. Moreover, compared with the PBFT algorithm, the CRBFT
algorithm has lower consensus delay, fewer communication times, and
less network resource consumption. These characteristics make our
proposal suitable for constructing Green IoT and promoting smart cities’
development.
As future research, we want to optimize the algorithm details, for
example, parameter selection in credit adjustment rules. In particular,
we intend to find new solutions to speed up the learning process
Fig. 11. The Blockchain mode of power system state estimation.
and stability of the neural network, besides further improving the
algorithm’s security, to apply it to different types of Blockchain in
IoT, such as DAG-structured blockchains [34]–[35]. Meanwhile, we
management level [33]. Its related applications are, for example, the intend to explore the possibility of applying the Blockchain in the smart
Blockchain mode of power system state estimation shown in Fig. 11. grid to improve the algorithm further and promote the smart grid’s
Fig. 11 is a block chain-based power system state estimation model. development.
In Fig. 11, the data center and users form a blockchain power system,
and they both have communication and verification functions. The Declaration of competing interest
data center receives power state estimation data from users and stores
the data. If an attacker wants to tamper with the information of a The authors declare that they have no known competing finan-
data center or a single user, it needs to attack most users in the cial interests or personal relationships that could have appeared to
blockchain system at the same time, which requires a huge cracking influence the work reported in this paper.
cost. Therefore, the blockchain model can better prevent data intrusion
and form a data protection barrier. Acknowledgment
More precisely, combining Fig. 11 and Blockchain technology
knowledge, we can apply the CPBFT algorithm to power systems. This study was funded by the National Natural Science Foundation
Based on Blockchain data’s tamper-proof characteristics, Blockchain of China, under grants 61873160 and 61672338.
nodes store energy consumption information collected from the IoT’s
intelligent metering devices, and then consumers in the Blockchain self- References
execute smart contracts through the CRBFT algorithm. When the smart
contract legally satisfies all the judgment conditions, and the consensus [1] Huang J, Meng Y, Gong X, Liu Y, Duan Q. A novel deployment scheme for green
node reaches a consensus, the smart contract automatically enforces internet of things. IEEE Internet Things J 2014;1(2):196–205. http://dx.doi.org/
10.1109/JIOT.2014.2301819.
and improves transaction flexibility and efficiency. Concurrently, the
[2] Zhang X, Huang Y, Wang WB. Green internet of things: Requirements,
increase or decrease of credibility in the CRBFT algorithm is used as development status and key technologies. Telecommun Sci 2012;28(8):96–104.
rewards and punishments to balance energy demand and grid energy [3] Chang J, Han F. Blockchain: from digital currency to credit society. Beijing:
production rules, thereby balancing energy supply and demand. When China CITIC Press; 2016.
there is an external attack, the CRBFT algorithm can screen out the [4] Yuan Y, Wang FY. Blockchain: The state of the art and future trends. Acta
Automat Sinica 2016.
error node and take defensive measures to ensure the information [5] Bonneau J, Miller A, Clark J, Narayanan A, Kroll JA, Felten EW. SoK: Research
security of the power system. The CRBFT algorithm ensures each perspectives and challenges for bitcoin and cryptocurrencies. In: 2015 IEEE
transaction’s legitimacy, guarantees the power system’s safety, and is Symposium on security and privacy, 2015, pp. 104–121.
conducive to the construction and development of smart grids. [6] Tian Q, Han D, Li K-C, Liu X, Duan L, Castiglione A. An intrusion detection
approach based on improved deep belief network. Appl Intell 2020;50:3162–78.
[7] Xiao L, Han D, Meng X, Liang W, Li K-C. A secure framework for data sharing
4.3.2. Transaction settlement and clearing in finance in private blockchain-based WBANs. IEEE Access 2020;8:153956–68.
In the field of clearing and settlement, the traditional transaction [8] Dorri A, Kanhere SS, Jurdak R, Gauravaram P. Blockchain for IoT security and
model is that both parties keep separate accounts, and as the data is privacy: The case study of a smart home. In: 2017 IEEE international conference
recorded by each other, the authenticity is difficult to guarantee. In on pervasive computing and communications workshops (PerCom Workshops), 2017.
[9] Liang W, Zhang D, Lei X, Tang M, Li K-C, Zomaya A. Circuit copyright
contrast, the data in the blockchain is distributed, and each node has blockchain: Blockchain-based homomorphic encryption for IP circuit protection.
access to all the transaction information, and once changes are detected IEEE Trans Emerg Top Comput 2020. http://dx.doi.org/10.1109/TETC.2020.
the whole network can be notified to prevent tampering. 2993032.

8
P. Chen et al. Journal of Information Security and Applications 59 (2021) 102821

[10] Drdobbs. The byzantine generals problem. Acm Trans Program Lang Syst [22] Zhao YQ, Yu Y, Li YN, Han G, Du XJ. Machine learning based privacy-preserving
1982;4(3):382–401. fair data trading in big data market. Inform Sci 2019;478:449–60.
[11] Lamport L. Seminal research document related to the field of byzantine fault [23] Xie MH, Li HY, Zhao YJ. Blockchain financial investment based on deep learning
tolerance. 1982. network algorithm. J Comput Appl Math 2020;372:112723.
[12] Pease M, Shostak R, Lamport L. Reaching agreement in the presence of faults. [24] Pang XW, Zhou YQ, Wang P, Lin WW, Chang V. An innovative neural network
J ACM 1980;27:228–34. approach for stock market prediction. J Supercomput 2020;76(3):2098–118.
[13] Dwork C, Naor M. Pricing via processing or combatting junk mail. In: Annual [25] Mendel JM, Mclaren RW. Reinforcement learning control and pattern recognition
international cryptology conference, 1992. systems. In: A prelude to neural networks. 1970.
[14] Castro M, Liskov B. Practical byzantine fault tolerance. Acm Trans Comput Syst [26] Busoniu L, Babuska R, Schutter BD, Ernst D. Reinforcement learning and dynamic
2002;20(4):398–461. programming using function approximators, first ed.. USA: CRC Press, Inc.; 2010.
[15] Androutsellis-Theotokis S, Spinellis D. A survey of peer-to-peer content [27] Pease M, Shostak R, Lamport L. Reaching agreement in the presence of faults.
distribution technologies. ACM Comput Surv 2004;36(4):335–71. J ACM 1980;27:228–34.
[16] Malkhi D, Nayak K, Ren L. Flexible byzantine fault tolerance. In: Proceedings of [28] Reiter MK. A secure group membership protocol. IEEE Trans Softw Eng
the 2019 ACM SIGSAC conference on computer and communications security. 1996;22(1):P.31–42.
CCS ’19, New York, NY, USA: Association for Computing Machinery; 2019, p. [29] Sutton R, Barto A. Reinforcement learning: An introduction (adaptive
1041–53. http://dx.doi.org/10.1145/3319535.3354225. computation and machine learning). 1998.
[17] Duan S, Peisert S, Levitt KN. hBFT: Speculative byzantine fault tolerance with [30] Lee D, Seo H, Jung MW. Neural basis of reinforcement learning and decision
minimum cost. IEEE Trans Dependable Secure Comput 2015;12(1):58–70. http: making. Ann Rev Neuroence 2012;35(1):287.
//dx.doi.org/10.1109/TDSC.2014.2312331. [31] Sutton, Richard S. Learning to predict by the methods of temporal differences.
[18] Liu XF. Research on blockchain performance improvement based on byzantine Mach Learn 1988;3(1):9–44.
fault tolerance consensus algorithm based on dynamic authorization. [Ph.D. [32] Werbos P. Approximate dynamic programming for real-time control and neural
thesis], Hangzhou: Zhejiang University; 2017. modeling. 1992.
[19] Li QW. Research on consensus efficiency based on practical byzantine fault [33] Wang S, Guo CX, Feng B, Zhang H, Du ZD. Application of blockchain
tolerance. In: ICMIC. Guiyang; 2018. technology in power system: Prospects and ideas. Autom Electr Power Syst
[20] Liang W, Fan Y, Li K-C, Zhang D, Gaudiot J-L. Secure data storage and 2020;44(11):10–24.
recovery in industrial blockchain network environments. IEEE Trans Ind Inf [34] Suhail S, Hussain R, Khan A, Hong CS. Orchestrating product provenance
2020;16(10):6543–52. story: When IOTA ecosystem meets electronics supply chain space. Comput
[21] Liang W, Huang W, Long J, Zhang K, Li K-C, Zhang D. Deep reinforcement Ind 2020;123:103334. http://dx.doi.org/10.1016/j.compind.2020.103334, https:
learning for resource protection and real-time detection in lot environment. IEEE //www.sciencedirect.com/science/article/pii/S0166361520305686.
Internet Things J 2020;7(7):6392–401. [35] Suhail S, Hussain R, Jurdak R, Hong CS. Trustworthy digital twins in the
industrial internet of things with blockchain. 2020, arXiv:2010.12168.

You might also like