You are on page 1of 35

UNIT IV CONSENSUS AND RECOVERY 10

• Consensus and Agreement Algorithms: Problem Definition –


Overview of Results – Agreement in a Failure-Free
System(Synchronous and Asynchronous) – Agreement in Synchronous
Systems with Failures;
• Checkpointing and Rollback Recovery: Introduction – Background and
Definitions – Issues in Failure Recovery – Checkpoint-based Recovery
– Coordinated Checkpointing Algorithm - Algorithm for Asynchronous
Checkpointing and Recovery
Consensus and Agreement Algorithms
• Agreement among the processes in a distributed system is a
fundamental requirement for a wide range of applications.
• Many forms of coordination require the processes to exchange
information to negotiate with one another and eventually reach a
common understanding or agreement, before taking application-
specific actions.
• example : the commit decision in database systems, wherein the
processes collectively decide whether to commit or abort a
transaction that they participate in.
• Here, we are designing algorithms to reach agreement under
various system models and failure model
• examine some representative algorithms to reach agreement.
• We first state some assumptions underlying our study of agreement
algorithms:
Consensus and Agreement Algorithms
Failure models
• A faulty process - can behave in any manner allowed by
the failure model assumed.
• in the fail-stop model, a process may crash in the middle
of a step, which could be the execution of a local
operation or processing of a message for a send or receive
event.
• It may send a message to only a subset of the
destination set before crashing.
• In the Byzantine failure model, a process may behave
arbitrarily/randomly.
• The choice of the failure model determines the feasibility
and complexity of solving consensus
Consensus and Agreement Algorithms
• Synchronous/asynchronous communication
• If a failure-prone process chooses to send a message to process Pi
but fails, then Pi cannot detect the non-arrival of the message in an
asynchronous system because this scenario is indistinguishable from
the scenario in which the message takes a very long time in transit.
• In a synchronous system, however, the scenario in which a message
has not been sent can be recognized by the intended recipient, at
the end of the round.
• The intended/future recipient can deal with the non-arrival of the
expected message by assuming the arrival of a message containing
some default data, and then proceeding with the next round of algm
Consensus and Agreement Algorithms
Network connectivity
• The system has full logical connectivity, i.e., each process can
communicate with any other by direct message passing.
• Sender identification
• A process that receives a message always knows the identity of the
sender process.
• This assumption is important – because even with Byzantine behavior,
even though the payload of the message can contain fictitious data
sent by a malicious sender, the underlying network layer protocols can
reveal the true identity of the sender process.
Consensus and Agreement Algorithms
• When multiple messages are expected from the same sender in a single
round, we implicitly assume a scheduling algorithm that sends these
messages in sub-rounds, so that each message sent within the round
can be uniquely identified.
• Channel reliability
• The channels are reliable, and only the processes may fail (under one of
various failure models).
• with this simplifying assumption, the agreement problem is either
unsolvable, or solvable in a complex manner
Consensus and Agreement Algorithms
Authenticated vs. non-authenticated messages
• we will be dealing only with unauthenticated messages.
• With unauthenticated messages, when a faulty process relays/sends a
message to other processes,
• (i) it can forge/falsify the message and claim that it was received from
another process,
• (ii) it can also tamper/interfere with the contents of a received
message before relaying it.
• When a process receives a message, it has no way to verify its
authenticity.
• An unauthenticated message is also called an oral message or an
unsigned message.
Consensus and Agreement Algorithms

• Using authentication via techniques such as digital signatures, it is


easier to solve the agreement problem because, if some process
forges/falsifies a message or tampers with the contents of a
received message before relaying it, the recipient can detect the
forgery or tampering.
• Thus, faulty processes can inflict/impose less damage.
Consensus and Agreement Algorithms
Agreement variable
• The agreement variable may be boolean or multivalued,
and need not be an integer.
• When studying some of the more complex algorithms, we
will use a boolean variable.
• This simplifying assumption does not affect the results
for other data types, but helps in the abstraction while
presenting the algorithms.
Consensus and Agreement Algorithms
Example: Byzantium war
• Consider the difficulty of reaching agreement using the following
example, that is inspired by the long wars fought by the Byzantine
Empire in the Middle Ages.
• Four camps of the attacking army, each commanded by a general, are
camped around the fort of Byzantium.
• They can succeed in attacking only if they attack simultaneously.
Hence, they need to reach agreement on the time of attack.
• The only way they can communicate is to send messengers among
themselves.
• The messengers model the messages.
Consensus and Agreement Algorithms
• An asynchronous system is modeled by messengers taking an
unbounded time to travel between two camps.
• A lost message is modeled by a messenger being captured by the
enemy.
• A Byzantine process is modeled by a general being a traitor.
• The traitor will attempt to subvert the agreement-reaching mechanism,
by giving misleading information to the other generals.
• For example, a traitor may inform one general to attack at 10 a.m.,
and inform the other generals to attack at noon.
• Or he may not send a message at all to some general.
Consensus and Agreement Algorithms
• Likewise, he may tamper/interfere and change with the messages he
gets from other generals, before relaying/communicating those
messages.
• A simple example of Byzantine behavior is shown in Figure 14.1.
• Four generals are shown, and a consensus decision is to be reached
about a boolean value.
• The various generals are conveying potentially misleading values of
the decision variable to the other generals, which results in
confusion.
Consensus and Agreement Algorithms

• the challenge - is to determine whether it is possible to reach


agreement,
• if so under what conditions
• If agreement is reachable, then protocols to reach it need to be
devised
Assumptions

System assumptions
Failure models 1
G1 G2
Synchronous/ Asynchronous 0
communication 1
Network 1
1 0 0 1
connectivity Sender 0
identification 0
Channel reliability 0
G3 G4
Authenticated vs. non-authenticated 0
messages
Agreement variable

A. Kshemkalyani and M. Singhal (Distributed Compu ting) Consensus and Agreement CUP 2008 14 / 54
Problem Specifications
Byzantine Agreement (single source has an initial value)
Agreement: All non-faulty processes must agree on the same value.
Validity: If the source process is non-faulty, then the agreed upon value by all the
non-faulty processes must be the same as the initial value of the source.
Termination: Each non-faulty process must eventually decide on a value.
Consensus Problem (all processes have an initial value)
Agreement: All non-faulty processes must agree on the same (single) value.
Validity: If all the non-faulty processes have the same initial value, then the agreed upon
value by all the non-faulty processes must be that same value.
Termination: Each non-faulty process must eventually decide on a value.
Interactive Consistency (all processes have an initial value)
Agreement: All non-faulty processes must agree on the same array of values A[v1 . . . vn ].
Validity: If process i is non-faulty and its initial value is vi , then all non-faulty processes
agree on vi as the i th element of the array A. If process j is faulty, then the
non-faulty processes can agree on any value for A[j].
Termination: Each non-faulty process must eventually decide on the array A.
These problems are equivalent to one another! Show using reductions.

A. Kshemkalyani and M. Singhal (Distributed Compu ting) Consensus and Agreement CUP 2008 15 / 54
Overview of Results – attain consensus
• It is worth understanding the relation between the consensus problem
and the problem of attaining common knowledge of the agreement
value.
• For the “no failure” case, consensus is attainable.
• Further, in a synchronous system, common knowledge of the
consensus value is also attainable
• whereas in the asynchronous case, concurrent common knowledge of
the consensus value is attainable.
• Consensus is not solvable in asynchronous systems even if one
process can fail by crashing.
Overview of Results

Failure Synchronous system Asynchronous system


mode (message-passing and shared memory) (message-passing and shared memory)
No agreement attainable; agreement attainable;
failure common knowledge also attainable concurrent common knowledge attainable
Crash agreement attainable agreement not attainable
failure f < n processes
Ω(f + 1)
rounds
Byzantine agreement attainable agreement not attainable
failure f ≤ [(n − 1)/3♩ Byzantine
processes Ω(f + 1) rounds

Table: Overview of results on agreement. f denotes number of failure-prone processes.


n is the total number of processes.

In a failure-free system, consensus can be attained in a straightforward


manner

A. Kshemkalyani and M. Singhal (Distributed Compu ting) Consensus and Agreement CUP 2008 17 / 54
Some Solvable Variants of the Consensus Problem in
Async Systems

Solvable Failure model and Definition


Variants overhead
Reliable crash failures, n > f (MP) Validity, Agreement, Integrity conditions
broadcast
k-set crash failures. f < k < n. size of the set of values agreed
consensus ( M P and SM) upon must be less than k

є-agreement crash failures values agreed upon are


n ≥ 5f + 1 (MP) within є of each other

Renaming up to f fail-stop processes, select a unique name from


n ≥ 2f + 1 (MP) a set of names
Crash failures f ≤ n − 1 (SM)

Table: Some solvable variants of the agreement problem in asynchronous system.


The overhead bounds are for the given algorithms, and not necessarily tight bounds for
the problem.

A. Kshemkalyani and M. Singhal (Distributed Compu ting) Consensus and Agreement CUP 2008 18 / 54
Solvable Variants of the Consensus Problem in Async
Systems

C i r c u m v e n ti n g t h e i m p o s s i b i l i t y r e s u l t s fo r c o n s e n s u s i n a s y n c h r o n o u s s y s t e m s

Message−passing Sh a re d m e m o r y

k s et c o n s e n s u s k s et c o n s e n s u s Consensus
epsilon− c o n s e n s u s epsilon−
using more powerful
Renaming consensus objects than atomic
Re li a b l e Renaming registers.
This is the study of
b ro a d c a st
using atomic registers universal objects
and atomic snapshot and universal
objects constructed from constructions.
atomic registers

A. Kshemkalyani and M. Singhal (Distributed Compu ting) Consensus and Agreement CUP 2008 19 / 54
Agreement in a failure-free system (synchronous or
asynchronous)
• In a failure-free system, consensus can be reached by
collecting information from the different processes, arriving
at a “decision,” and distributing this decision in the system.
• A distributed mechanism would have - each process
broadcast its values to others, and each process computes
the same function on the values received.
• The decision can be reached by using an application specific
function – some simple examples being the majority, max,
and min functions
• Algorithms to collect the initial values and then distribute
the decision may be based on the token circulation on a
logical ring, or the three-phase tree-based broadcast–
converge cast–broadcast, or direct communication with all
nodes.
Agreement in a failure-free system (synchronous or
asynchronous)
• In a synchronous system, this can be done simply in a
constant number of rounds (depending on the specific logical
topology and algorithm used).
• Further, common knowledge of the decision value can be
obtained using an additional round
• In an asynchronous system, consensus can similarly be
reached in a constant number of message hops.
• Concurrent common knowledge of the consensus value can
also be attained, using any of the algorithms.
• Reaching agreement is straightforward in a failure-free
system.
• Hence, we focus on failure-prone systems
Consensus Algorithm for Crash Failures (MP, synchronous)
Up to f (< n) crash failures possible.
In f + 1 rounds, at least one round has no failures.
Now justify: agreement, validity, termination conditions are satisfied.
Complexity: O(f + 1)n2 messages
f + 1 is lower bound on number of rounds
(global constants)
integer: f ; / / maximum number of crash failures tolerated
(local variables)
integer: x ←−
local value;
(1) Process Pi (1 ≤ i ≤ n) executes the Consensus algorithm for up to f crash failures:
(1a) for round from 1 to f + 1 do
(1b) if the current value of x has not been broadcast then
(1c) broadcast(x );
(1d) yj ←− value (if any) received from process j in this
x ←− min(x, yj );
round; (1e)
(1f) output x as the consensus
value.
ting) Consensus and Agreement CUP 2008 22 / 54
Upper Bound on Byzantine Processes
(sync)
Agreement impossible when f = 1, n =
3. P P
c c commander
commander

0
0 0 1

1 1
Pa Pb Pa Pb
0 0
(a) (b)
malicious process correct process
first round message second round message

• Taking simple majority decision does not help because loyal commander
Pa
• cannot distinguish between the possible scenarios (a) and (b);
• hence does not know which action to take.
3
n− 1
• Proof using induction that problem solvable if f ≤ n-1/3♩

ting) Consensus and Agreement


Consensus Solvable when f = 1, n =
4
Pd Pd

0
1 0 0 0 0 1 0 1 0
commander commander
Pc Pc
1 0 0 0
0 0
Pa Pb Pa Pb
1 0
(a) (b)
first round exchange second round exchange
malicious process correct process

There is no ambiguity at any loyal commander, when taking majority


decision
Majority decision is over 2nd round messages, and 1st round message
received directly from commander-in-chief process.

ting) Consensus and Agreement


Byzantine Generals (recursive formulation), (sync,
msg-passing)
(variables)
boolean: v ← − initial value;
integer: f ← − maximum number of malicious processes, ≤ [ ( n − 1)/3♩;
(message type)
Oral Msg(v, Dests, List, faulty ), where
v is a boolean,
Dests is a set of destination process ids to which the message is sent,
List is a list of process ids traversed by this message, ordered from most
recent to earliest,
faulty is an integer indicating the number of malicious processes to be
tolerated.

Oral Msg(f
1 ), where
The f >is0:initiated by the Commander, who sends his source value v to all other processes using a OM(v , N, ( i ⟩ , f ) message.
algorithm
commander returns his own value v and terminates.
The
2 [Recursion unfolding:] For each message of the form OM(v j , Dests, List, f ' ) received in this round from some process j, the process i uses the

value vj it receives from the source, and using that value, 'acts as a new source. (If no value is received, a default value is assumed.)
To act as a new source, the process i initiates Oral Msg(f − 1), wherein it sends

OM(v
to destinations − {ini }concat((
j , Dests not , concat(( ⟩ , L), (f ' − 1))
i ⟩ ,i L)
in the next round.

3 [Recursion folding:] For each message of the form OM(v j , Dests, List, f ' ) received in Step 2, each process i has computed the agreement

value
the vk , for each
recursion. not in no
If it kreceives Listvalue k /=
and in ,corresponding
thisi round, it uses atodefault
the value received
value. from
Process afterthe
Pk uses
i then traversing the nodes
value majority in List, at one level lower
k/∈List,k/=i (v j , vk ) as the
agreement
in value and returns it to the next higher level in the recursive invocation.
Oral Msg(0):

1 [Recursion unfolding:] Process acts as a source and sends its value to each other process.
2 [Recursion folding:] Each process uses the value it receives from the other sources, and uses that value as the agreement value. If no value
is received, a default value is assumed.

A. Kshemkalyani and M. Singhal (Distributed Compu ting) Consensus and Agreement


Relationship between # Messages and
Rounds
round a message has aims to tolerate and each message total number of
number already visited these many failures gets sent to messages in round
1 1 f n− 1 n− 1
2 2 f − 1 n− 2 (n − 1) · (n − 2)
... ... ... ... ...
x x (f + 1) − x n− x (n − 1)(n − 2) . . . (n − x )
x+1 x+1 (f + 1) − x − 1 n− x− 1 (n − 1)(n − 2) . . . (n − x − 1)
f +1 f +1 0 n− f − 1 (n − 1)(n − 2) . . . (n − f − 1)

Table: Relationships between messages and rounds in the Oral Messages algorithm for
Byzantine agreement.

Complexity: f + 1 rounds, exponential amount of space, and

(n − 1) + (n − 1)(n − 2) + . . . + (n − 1)(n − 2)..(n − f − 1)messages

A. Kshemkalyani and M. Singhal (Distributed Compu ting) Consensus and Agreement


Bzantine Generals (iterative formulation), Sync,
Msg-passing
(variables)
boolean: v ← − initial value;
n− 1
integer: f ← − maximum number of malicious processes, ≤ [ 3 ♩;
tree of boolean:
L , where L = ( ⟩ ;
level 0 root is vinit

concat(( j ⟩ ,L)
level h(f ≥ h > 0) nodes: for each vjL at level h − 1 = sizeof (L), its n − 2 − sizeof (L) descendants at level h are , ∀k
k
v
such that k /= j , i and k is not a member of list L. (message type)
OM(v , Dests, List, faulty ), where the parameters are as in the recursive
formulation.

(1)Initiator (i.e., Commander) initiates Oral Byzantine agreement:


(1a) send OM(v , N − { i } , ( Pi ⟩ , f ) to N − { i } ;
(1b) return(v ).

(2) (Non-initiator, i.e., Lieutenant) receives Oral Message OM:


(2a) for rnd = 0 to f do
(2b) for each message O M that arrives in this round, do
(2c) receive OM(v , Dests, L = (P k1 . . . P kf +1−faulty ⟩ , faulty ) from Pk1 ;
/ / faulty + round = f; | Dests| + sizeof (L) = n
tail (L)
(2d) v ← − v ; / / sizeof (L) + faulty = f + 1. fill in
head (L)
(2e) estimate.
send OM(v , Dests − { i } , ( P i , Pk . . . P k ⟩ , faulty − 1) to Dests − { i } if rnd < f
1 f +1−faulty
(2f) for level = f − 1 down to 0
;
(2g) for each of the 1 · (n − 2) · . . . (n − (level + 1)) nodes vxL in level level ,
do
do
(2h) vxL (x / i , x /∈ L) = y /∈ concat(( x ⟩ ,L);y =/(v Li , vxconcat(( x ⟩ , L) );
= majority
y

A. Kshemkalyani and M. Singhal (Distributed Compu ting) Consensus and Agreement


Tree Data Structure for Agreement Problem (Byzantine
Generals)
level 0 v<>
0 enter after round 1

round 2
level 1
v<0>
5
<0> v<0> v<0>
v1<0> v2 6 v7<0> v8<0> 9
round3
4
v<5,0> v<0>
6 v<5,0>
7
v<5,0> v<5,0>
level 2 1 2
v<5,0>
4
v<5,0>
8
v<5,0>
9
round4
level 3
v<7,5,0> v<7,5,0> v<7,5,0> v<7,5,0> v<7,5,0> v<7,5,0>
1 2 4 6 8 9 Some branches of the tree at P3. In
this example, n = 10, f = 3, commander is P0 .
(round 1) P0 sends its value to all other processes using Oral Msg (3), including to P3.
(round 2) P3 sends 8 messages to others (excl. P0 and P 3) using Oral Msg (2). P3 also
receives 8 messages.
(round 3) P3 sends 8 × 7 = 56 messages to all others using Oral Msg (1); P3 also receives
56 messages.
(round 4) P3 sends 56 × 6 = 336 messages to all others using Oral Msg (0); P3 also
receives 336 messages. The received values are used as estimates of the majority function
at this level of recursion.
A. Kshemkalyani and M. Singhal (Distributed Compu ting) Consensus and Agreement
Exponential Algorithm: An example

An example of the majority computation is as follows.

P3 revises its estimate of v7( 5,0⟩ by taking


majority (v (5,0⟩
7 , v (7,5,0⟩, v (7,5,0⟩
1 , v (7,5,0⟩
2 , v (7,5,0⟩
4 , v (7,5,0⟩
6 , v (7,5,0⟩). Similarly for
8 9
the other nodes at level 2 of( 0⟩the tree.
P3 revises its estimate of v5 by
taking
majority (v5 (0 ⟩ , v (5,0⟩ , v (5,0⟩1 , v (5,0⟩ (5,0⟩ (5,0⟩ (5,0⟩
2 , v 4 , v 6 , v 7 , v
(5,0⟩
). Similarly for
8 9
the
other nodes at level 1 of the tree.
P3 revises its estimate of v (⟩ by taking
majority (v0 ( ⟩ , v1 (0 ⟩ , 2v (0 ⟩ ,4 v0(0 ⟩5, v (06⟩ , v (07⟩ , v (0 ⟩
8 , v
(0 ⟩
, v (0⟩ ). This is the
value.
consensus9

A. Kshemkalyani and M. Singhal (Distributed Compu ting) Consensus and Agreement


Impact of a Loyal and of a Disloyal
Commander (a) the commander who invokes
Oral Msg(x) is loyal, so all the loyal
processes have the same estimate.
? ? Although the subsystem of 3x processes has
c ommande
commander
1 x malicious processes, all the loyal processes
r have the same view to begin with.
0 Even if this case repeats for each nested
0 invocation of Oral Msg, even after x rounds,
l_Msg(k−1 Oral_Msg(k−1)
among the processes, the loyal processes are
Or )
Oral_Msg(k)
Oral_Msg(k)
in a simple majority, so the majority function
a works in having them maintain the same
correct process malicious process
(a) common
effects of
(b) a loyal or a disloyal commander T vhieew of the loyal commander’s value.
in a system with n = 14 and f = 4. The (b) the commander who invokes
subsystems that need to tolerate k and Oral Msg(x) may be malicious and can send
k − 1 traitors are shown for two cases. conflicting values to the loyal processes. The
(a)
Loyal commander. (b) No assumptions subsystem of 3x processes has x −
about commander. 1
malicious processes, but all the loyal
processes do not have the same view to
A. Kshemkalyani and M. Singhal (Distributed Compu ting)
begin with.
Consensus and Agreement
The Phase King Algorithm

Operation
Each phase has a unique ”phase king” derived, say, from PID.
Each phase has two rounds:
1 in 1st round, each process sends its estimate to all other processes.
2 in 2nd round, the ”Phase king” process arrives at an estimate based on the
values it received in 1st round, and broadcasts its new estimate to all
P others.
0

P
1

P
f+1

P
k
phase 1 phase 2 phase f+1

ting) Consensus and Agreement


The Phase King Algorithm: Code
(variables)
boolean: v ←− initial value;
integer: f ←− maximum number of malicious processes, f < [n/4|;

(1) Each process executes the following f + 1 phases, where f <


n/4: (1a) for phase = 1 to f + 1 do
(1b) Execute the following Round 1 actions: / / actions in round one of each phase
(1c) broadcast v to all processes;
(1d) await value vj from each process Pj
(1e) majority ←− the;value among the vj that occurs > n/2 times (default if no maj.);
(1f) mult ←− number of times that majority occurs;
(1g) Execute the following Round 2 actions: / / actions in round two of each phase
(1h) if i = phase then / / only the phase leader executes this send step
(1i) broadcast majority to all processes;
(1j) receive tiebreaker from Pphase (default value if nothing is received);
(1k) if mult > n/2 + f then
(1l) v ←−
(1m) ←−
else vmajority ;
(1n) if phase = ;f + 1 then
tiebreaker
(1o) output decision value v .

A. Kshemkalyani and M. Singhal (Distributed Compu ting) Consensus and Agreement


The Phase King Algorithm

(f + 1) phases, (f + 1)[(n − 1)(n + 1)] messages, and can tolerate up


fto< [n/4| malicious processes
Correctness Argument
1 Among f + 1 phases, at least one phase k where phase-king is non-
2 malicious.
In phase k, all non-malicious processes Pi and Pj will have same estimate of
consensus
1 Pi andvalue as their
Pj use Pk does.
own majority values (Hint: = ⇒ Pi ’s mult > n/2 + f )
2 Pi uses its majority value; Pj uses phase-king’s tie-breaker value. (Hint: Pi ”s
mult > n/ 2 + f , Pj ’s mult > n/ 2 for same value)
3 Pi and Pj use the phase-king’s tie-breaker value. (Hint: In the phase in
which
In allP3k iscases,
non-malicious,
argue thatit sends
P andsame value
P end uptowith
i
Pi and
same
j
Pj )value as estimate
3
If all non-malicious processes have the value x at the start of a phase,
they will continue to have x as the consensus value at the end of
the phase.

A. Kshemkalyani and M. Singhal (Distributed Compu ting) Consensus and Agreement

You might also like