You are on page 1of 47

Fault Tolerance

Part II
Distributed Systems
Erkay Savas
Sabanci University

Reliable Group Communication
• Reliable multicasting:
– A message that is sent to a process group should be
delivered to each member of the group.
• Assumptions for simplicity:
– An agreement exists on who is a member of the group
– Processes do not fail
– Processes do not join or leave the group while
communication is going on.
• What is reliable multicasting then when these
assumptions do not hold?
– A message that is sent to a process group should be
delivered to each current non-faulty member of the
Basic Reliable-Multicasting Schemes
sender receiver receiver receiver receiver

Last = 24 Last = 24 Last = 23 Last = 24
M25 M25 M25 M25

sender receiver receiver receiver receiver

Last = 25 Last = 25 Last = 23 Last = 25

M25 M25 M25 M25

ACK 25
ACK 25 Missed 24
ACK 25

• A simple solution to reliable multicasting when all

receivers are known and are assumed not to fail 3
Scalability in Reliable Multicasting
• Problem 1:
– The sender is flooded with ACK messages when there
are too many receivers (feedback implosion)
• Solution
– Receivers return only negative ACK when they notice
that they missed a broadcast message
• Problem 2:
– With returning only negative ACK the sender has to
keep a message in its history buffer forever (or at
least a long time)
• Solution:
– Use expiration time on messages in history buffer
Nonhierarchical Feedback Control
• Feedback suppression: goal is to reduce the number of
feedback messages returned to the sender  SRM protocol
• A process that notices a missing messages multicasts it to the
group after waiting for a random amount of time
Receivers suppress their feedback

sender receiver receiver receiver receiver

T=3 T=5 T=1 T=4


• Several receivers have scheduled a request for

retransmission, but the first retransmission request leads
to the suppression of others.
Hierarchical Feedback Control (1)
• Essence:
– Organize processes into subgroups and appoint a local
coordinator to each subgroup
– For simplicity, assume only one sender
– Setup a tree where the subgroup of sender process is
the root node in the tree.
– Local coordinator is responsible for handling
retransmission requests of receivers within its
– Local coordinator keeps a history buffer
– If the local coordinator itself misses a message it asks
the coordinator of its parent subgroup to retransmit
the message
Hierarchical Feedback Control (2)




• The essence of hierarchical reliable multicasting.
Atomic Multicast
• Goal: To achieve reliable multicasting in the
presence of process failures
– Guarantees that a message is delivered to either all
processes or to none at all.
– All messages must be delivered in the same order to all
– Some processes in the group may crash
– In order to achieve reliable atomic multicasting, all the
nonfaulty members must have agreed on the group
membership; e.g. the crashed process is no longer a
group member
– When the process recovers, it is forced to join the
group again.
– Joining the group requires that the state of the
process have to be brought up to date. 8
Receiving vs. Delivering Messages
• The logical organization of a distributed system to
distinguish between message receipt and message

Message is delivered to application


Message is received by communication layer

Comm. Layer
Message is buffered in this layer until it can
be delivered to the application

Local OS

Message comes in from the network

Message Ordering (1)
• Four different orderings in multicast are
1. Unordered (reliable) multicast
2. FIFO-ordered multicast
3. Causally-ordered multicast
4. Totally-ordered multicast
Process P1 Process P2 Process P3

sends m1 receives m1 receives m2

sends m2 receives m2 receives m1

Three communicating processes in the same group. The

ordering of events per process is shown along the vertical
Message Ordering (2)

Process P1 Process P2 Process P3 Process P4

sends m1 receives m1 receives m3 sends m3
sends m2 receives m3 receives m1 sends m4
receives m2 receives m2
receives m4 receives m4

• Four processes in the same group with two different

senders, and a possible delivery order of messages
under FIFO-ordered multicasting

Message Ordering (3)
• Six different versions of reliable multicasting.
Basic Message Total-ordered
Ordering Delivery?
None No
FIFO multicast No
Causal multicast No
Atomic multicast None Yes
FIFO atomic FIFO-ordered
multicast delivery
Causal atomic Causal-ordered
multicast delivery
Virtual Synchrony (1)
• Group view: The list of processes that a multicast
message is delivered (delivery list); denoted as G
• Each process on that list should have the same
group view,
• A view change vc may occur (e.g. a process joins
or leaves the group) during transmission of
message m
• The message m must be delivered to each
nonfaulty process in G before the view change
comes into effect.
• Otherwise, the message m must not be delivered
at all.
Virtual Synchrony (2)
• For example, a process multicasts a message m to
a group of processes
– Right after that, a process leaves or joins the group
– another process notices a view change and multicasts
view change message (vc) to the group
– Any message sent in view G must be delivered to each
correct process before view change message is
• A reliable multicast with this property is said to
be virtually synchronous
• In other words, a view change acts as a barrier
across which no multicast can pass
Virtual Synchrony (3)
• A message sent to view G can be delivered only to
processes in G, and is discarded by successive views
Reliable multicast
P1 joins the group P3 crashes P3 rejoins




G = {P1, P2, P3, P4} G = {P1, P2, P4}

• The principle of virtually synchronous multicast. 15

Virtual Synchrony: Examples

vc vc

G G’ G G’




G G’ G G’ 16
Implementing Virtual Synchrony (1)
• Isis system (fault-tolerant distributed system)
– A reliable point-to-point communication facilities exist and
the ordering is assumed to be FIFO
– Can TCP provide a reliable FIFO ordered point-to-point
• If a message m has been received by all members in
G, m is said to be stable
– Only stable messages are allowed to be delivered.
– Otherwise, it is kept in a buffer in the communication
• Assume
– The current view is Gi and the next view Gi+1 is to be
– Gi and Gi+1 differs by one process (WLG)
Implementing Virtual Synchrony (2)
• For example,
– The process that notices a view change (e.g. a process
crashes or a process joins the group probably after
recovery) sends a view change message to other
nonfaulty processes
– Any other process P notices the view change when it
receives a view change message.
– P first forwards all unstable messages in the buffer to
every process in Gi+1 using a reliable point-to-point
– Afterwards, it multicasts a flush message
– After P has received a flush message from every other
process, it can safely install the new view
– It is also possible to elect a coordinator to forward all
unstable messages
Implementing Virtual Synchrony (3)
Flush message
Unstable message

1 1 1
2 5 2 5 2 5

4 6 4 6 4 6

0 3 0 3 0 3
7 7 7

a) Process 4 notices that process 7 has crashed, sends a view change

b) Process 6 sends out all its unstable messages, followed by a flush
c) Process 6 installs the new view when it has received a flush 19
message from everyone else
Distributed Commit
• Essential issue: having an operation being
performed by each member of a process group,
or none at all.
– e.g. committing a transaction
• Distributed commit problem
• A coordinator is present to initiate the commit
• One-phase commit
• Two-phase commit
• Three-phase commit

Two-Phase Commit - 2PC (1)
• Consider a distributed transaction involving the
participation of a number of processes each
running on a different machine.
– Phase 1 a: Coordinator sends VOTE_REQUEST to
– Phase 1 b: When a participant receives VOTE_REQUEST
it returns either VOTE_COMMIT or VOTE_ABORT to the
– Phase 2 a: coordinator collects all votes; if all are
participants; otherwise it sends GLOBAL_ABORT.
– Phase 2 b: Each participant waits for GLOBAL_COMMIT
or GLOBAL_ABORT and acts accordingly.
2PC (2)
INIT Vote-request INIT
Commit Vote-abort
Vote-request Vote-request
Vote-commit READY
Vote-abort Global-commit
Global-abort Global-commit Global-abort ACK


a b

a) The finite state machine for the coordinator in 2PC.

b) The finite state machine for a participant.

2PC – Failing Participant (1)
• How does this affect other participants?
• INIT: No problem
• READY: A participant P is waiting for either
GLOBAL_COMMIT or GLOBAL_ABORT. If the coordinator
crashes before its message reached P, P cannot know
what to do.
1. It may block until the coordinator recovers
2. It can ask another participant Q. The decision depends
which state Q is in
i. INIT: they can both abort
ii. COMMIT: They can commit
iii. ABORT: They both abort
iv. READY: Contact another participant. If all the participants
it contacted are in this state, they have to wait until the
coordinator recovers (apparently the coordinator is failing)

2PC – Failing Participant (2)

State of Q Action by P

COMMIT Make transition to COMMIT

ABORT Make transition to ABORT

INIT Make transition to ABORT

READY Contact another participant

• Actions taken by a participant P when residing in

state READY and having contacted another
participant Q.
2PC - Steps Taken by Coordinator
write START_2PC to local log;
multicast VOTE_REQUEST to all participants;
while not all votes have been collected {
wait for any incoming vote;
if timeout {
write GLOBAL_ABORT to local log;
multicast GLOBAL_ABORT to all participants;
record vote;
if all participants sent VOTE_COMMIT and coordinator
votes COMMIT{
write GLOBAL_COMMIT to local log;
multicast GLOBAL_COMMIT to all participants;
} else {
write GLOBAL_ABORT to local log;
multicast GLOBAL_ABORT to all participants;
2PC - Steps Taken by a Participant
write INIT to local log;
wait for VOTE_REQUEST from coordinator;
if timeout {
write VOTE_ABORT to local log;
if participant votes COMMIT {
write VOTE_COMMIT to local log;
send VOTE_COMMIT to coordinator;
wait for DECISION from coordinator;
if timeout {
multicast DECISION_REQUEST to other participants;
wait until DECISION is received; /* remain blocked */
write DECISION to local log;
write GLOBAL_COMMIT to local log;
write GLOBAL_ABORT to local log;
} else {
write VOTE_ABORT to local log;
send VOTE ABORT to coordinator;
} 26
2PC - When a Participant is Asked
for a Decision …
actions for handling decision requests:
/* executed by separate thread */
while true {
wait until any incoming DECISION_REQUEST is received;
/* remain blocked */
read most recently recorded STATE from the local log;
send GLOBAL_COMMIT to requesting participant;
send GLOBAL_ABORT to requesting participant;
skip; /* participant remains blocked */

• Steps taken for handling incoming decision requests.

2PC – Wait for the Coordinator to Recover
• All participants need to block until the
coordinator recovers when
– All participants have received and processed the
VOTE_REQUEST (i.e. they all are in state READY) from
the coordinator while in the meantime the coordinator
is crashed.
– In that case, participants cannot cooperatively decide
on the final action to take (COMMIT or ABORT)
• Assuming that not all participant can be contacted
(perhaps they are crashed as well), and un-
contacted participant may either be in (or recover
to) state INIT, ABORT or COMMIT.
– This is why another protocol is needed to avoid
Three-Phase Commit – 3PC
• Avoids blocking processes in the presence of fail-
stop crashes
– Phase 1 a: Coordinator sends VOTE_REQUEST to
– Phase 1 b: When participant receives VOTE_REQUEST it
returns either VOTE_COMMIT or VOTE_ABORT to
– Phase 2 a: Coordinator collects all votes; if all are
VOTE_COMMIT it sends PREPARE to all participants;
otherwise it sends ABORT
– Phase 2 b: Each participant waits for PREPARE or

3PC (2)
– Phase 3 a (prepare to commit):
Coordinator waits until all participants have ACKed
(READY-COMMIT) receipt of PREPARE message, and
then sends COMMIT to all.
– Phase 3 b (prepare to commit):
Participants waits for COMMIT
• States of the coordinator and each participant
satisfies the following two conditions:
1. There is no single state from which it is possible to
make a transition directly to either COMMIT or ABORT
2. There is no state in which it is not possible to make
final decision,
3PC (3)
INIT Vote-abort INIT
Commit Vote-request
Vote-request Vote-commit

WAIT Vote-commit WAIT Prepare-commit

Vote-abort Global-abort
Global-abort Prepare-commit Ready-commit


Ready-commit Global-commit
Global-commit ACK
a b

a) Finite state machine for the coordinator in 3PC

b) Finite state machine for a participant
3PC – Failing Participant (1)
• Coordinator blocks
– WAIT: The coordinator sends GLOBAL_ABORT after
– PRECOMMIT: On a timeout, it will conclude one of
the participant crashed (and it is known to have voted
COMMIT); it will send GLOBAL_COMMIT to remaining
• Participant P blocks
– INIT: abort on a timeout
– READY: On a timeout, P contacts Q
• If Q is still in INIT, they can safely abort (since no
other participant can be in state PRECOMMIT)
3PC – Failing Participant (2)
• Participant P blocks (cont.)
– READY: On a timeout, P contacts Q
1. If each of the participants P contacted is in state
READY, the transaction should be aborted (an un-
contacted process may be in INIT). If one of the
participants not contacted by P is in state
PRECOMMIT, it can still abort.
2. If all contacted processes are in state
PRECOMMIT, the transaction can safely commit
3. If a contacted process is in state ABORT (or
COMMIT), then P moves to the corresponding
• A decision can be taken
• Once a failure occurs, it is essential that the
failing process be able to recover to a correct
• What does it actually mean recovering to a
correct state?
• How can the state of a distributed system be
recorded and recovered to?
• Methods
– Check-pointing
– Message logging

Recovery: Background
• Essence: When a failure occurs, we need to bring
the system into an error-free state.
• Backward recovery:
– bring the system from its present erroneous state
back into a previously correct state.
– From time to time, the system state (at least part of
it) must be recorded (check-pointing) in a persistent
• Forward recovery:
– Instead of a previous check-pointed state, find a
correct new state from which the system can continue
to execute.
• In practice:
– By and large backward error recovery is widely applied
Forms of Recovery: Example
• Backward recovery:
– Retransmitting a lost message
• Forward recovery:
– Constructing the missing packets from successfully
delivered packets
– (n, k) block erasure codes

• Forward recovery require that error types be

known in advance so that appropriate recovery
mechanisms are deployed.
• Backward recovery can be used as a general
Backward Recovery: Problems
• Restoring a previous state is costly operation
– Saving system state is not for free
• Loop of recovery
– No guarantee that the same (or similar) failure does
not happen again.
• Rolling back is not always possible
– Think of a ATM machine handing mistakenly $1000.
– Imagine a UNIX command like /bin/rm –fr *

Recovery: Stable Storage

b c b c
b c a d
drive 1 a d a’ d
e h e
h e h g f
g f g f

updates recover

b c b c b c
a d a d a d
drive 2 e h e
h h e g f
g f g f

a) Stable Storage
b) Crash after drive 1 is updated
c) Bad spot (as a result of general wear and tear) 38
• In a fault-tolerant distributed system, backward error
recovery requires that the system regularly save its
(global) state onto stable storage.
• Consistent global state can be captured using distributed
snapshot algorithm.
• A recovery line corresponds to the most recent
distributed snapshot
Initial state consistent cut Checkpoint



inconsistent cut Time
Independent Checkpointing
• Processes save their local state independently
• Each process rolls back to the most recently saved state
on a crash
• If these local states jointly do not form a consistent cut,
then processes will have to further roll back to another
previous checkpoint.

• The domino effect.

Coordinated Checkpointing
• Essence: All processes synchronize to jointly
write their state to local stable storage.
– Saved state is automatically globally consistent.
• Two-phase blocking protocol:
– A coordinator first multicasts a
CHECKPOINT_REQUEST message to all processes
– A receiving process takes a local checkpoint, stops
sending messages (queues them and blocks), and tells
the coordinator it has taken the checkpoint (ACK).
– When the coordinator has received an ACK from all
processes, it multicasts a CHECKPOINT_DONE message
to allow blocked processes to continue
– Question: What could happen if a process did not stop
sending regular messages after saving its local state?
Message Logging (1)
• Checkpointing is an expensive operation
– Message logging allows to reduce the number of
checkpoints, but still enables recovery
– Message logging and checkpointing are used together
• Idea:
– If the transmission of messages can be replayed, we
can still reach a globally consistent state.
– A checkpointed state is taken as a starting point.
• Piecewise deterministic model:
– The execution of each process is considered to take
place as a sequence of intervals, where events occur
– Each interval starts with a nondeterministic event (e.g.
receipt of a message)
– Execution in an interval is completely deterministic
Message Logging (2)
• Conclusion (Piecewise deterministic model)
– If we record non-deterministic events (to replay them
later), we obtain a deterministic execution model that
will allow us to do a complete replay
• Problem:
– When should we actually log messages?
• Issue: Avoid orphan processes
– Orphan process is a process that survives the crash of
another process, but whose state is inconsistent with
the crashed process after its recovery
• Goal:
– Devise message logging schemes in which orphans do
not occur.
Orphan Process: Example
• Process Q has just received m1 and m2 and subsequently
delivered m3 before it crashes.
• Assume that m2 is not logged.
• When Q crashes and subsequently recovers, only m1 is going
to be replayed, but m2 is certainly not, and probably m3 is not.
Q crashes and recovers
m1 m1
m2 m3 m2 m3
unlogged message
logged message
Incorrect replay of messages after recovery, leading to an orphan process
(Question: which one is the orphan process here?). 44
Message Logging Schemes (1)
• HDR(m): The header of message m containing its source,
destination, sequence number, a delivery number
– The header contains all information for resending a
message and delivering it in the correct order
• A message is stable if it can no longer be lost (e.g. is
written to a stable storage, along with its header).
• DEP(m): the set of processes to which the message m
has been delivered. It includes the processes to which
another message m’, which is causally dependent on m,
has been delivered.
• COPY(m): the set of processes that have a copy of the
message (and its header), but not (yet) in their local
stable storage.
Message Logging Schemes (2)
• The processes in COPY(m) can hand over m. If all
processes in this set crashes, the retransmission
of m is not possible.
• Using this notation,
– Process Q is orphan if there is a message m, such that Q
is contained in DEP(m), while at the same time every
process in COPY(m) has crashed. There is no way to
replay transmission of m.
• To avoid orphan processes,
– We can enforce that DEP(m) ⊆ COPY(m).
– In other words, whenever a process becomes
dependent on the delivery of m, it will always keep a
copy of m (i.e. the message along with its header)
Message Logging Schemes (3)
• Pessimistic logging protocol
– For each unstable message m, there is at most one process
dependent on m, that is DEP(m) ≤ 1.
– In other words, this protocol ensures that each unstable
message m is delivered to at most one process.
– A process P, after receiving m also becomes a member of
– P, is forced to write it to a stable storage before sending a
message to another process
– If P crashes before it logs m there will be no problem since
no other process will be dependent on the delivery of m.
• Optimistic logging protocol
– If each process in COPY(m) has crashed, any orphan
process in DEP(m) is rolled back to a state in which it is no
longer belongs to DEP(m).

You might also like