Professional Documents
Culture Documents
Part II
CS403/534
Distributed Systems
Erkay Savas
Sabanci University
1
Reliable Group Communication
• Reliable multicasting:
– A message that is sent to a process group should be
delivered to each member of the group.
• Assumptions for simplicity:
– An agreement exists on who is a member of the group
– Processes do not fail
– Processes do not join or leave the group while
communication is going on.
• What is reliable multicasting then when these
assumptions do not hold?
– A message that is sent to a process group should be
delivered to each current non-faulty member of the
group.
2
Basic Reliable-Multicasting Schemes
sender receiver receiver receiver receiver
M25
Last = 24 Last = 24 Last = 23 Last = 24
history
buffer
M25 M25 M25 M25
LAN
S
C C
receiver
7
• The essence of hierarchical reliable multicasting.
Atomic Multicast
• Goal: To achieve reliable multicasting in the
presence of process failures
– Guarantees that a message is delivered to either all
processes or to none at all.
– All messages must be delivered in the same order to all
processes
– Some processes in the group may crash
– In order to achieve reliable atomic multicasting, all the
nonfaulty members must have agreed on the group
membership; e.g. the crashed process is no longer a
group member
– When the process recovers, it is forced to join the
group again.
– Joining the group requires that the state of the
process have to be brought up to date. 8
Receiving vs. Delivering Messages
• The logical organization of a distributed system to
distinguish between message receipt and message
delivery
Local OS
Network
9
Message Ordering (1)
• Four different orderings in multicast are
distinguished:
1. Unordered (reliable) multicast
2. FIFO-ordered multicast
3. Causally-ordered multicast
4. Totally-ordered multicast
Process P1 Process P2 Process P3
11
Message Ordering (3)
• Six different versions of reliable multicasting.
Basic Message Total-ordered
Multicast
Ordering Delivery?
Reliable
None No
multicast
FIFO-ordered
FIFO multicast No
delivery
Causal-ordered
Causal multicast No
delivery
Atomic multicast None Yes
FIFO atomic FIFO-ordered
Yes
multicast delivery
Causal atomic Causal-ordered
Yes
multicast delivery
12
Virtual Synchrony (1)
• Group view: The list of processes that a multicast
message is delivered (delivery list); denoted as G
• Each process on that list should have the same
group view,
• A view change vc may occur (e.g. a process joins
or leaves the group) during transmission of
message m
• The message m must be delivered to each
nonfaulty process in G before the view change
comes into effect.
• Otherwise, the message m must not be delivered
at all.
13
Virtual Synchrony (2)
• For example, a process multicasts a message m to
a group of processes
– Right after that, a process leaves or joins the group
– another process notices a view change and multicasts
view change message (vc) to the group
– Any message sent in view G must be delivered to each
correct process before view change message is
delivered
• A reliable multicast with this property is said to
be virtually synchronous
• In other words, a view change acts as a barrier
across which no multicast can pass
14
Virtual Synchrony (3)
• A message sent to view G can be delivered only to
processes in G, and is discarded by successive views
Reliable multicast
P1 joins the group P3 crashes P3 rejoins
P1
P2
P3
P4
G = {P1, P2, P3, P4} G = {P1, P2, P4}
Q Q
vc vc
R R
G G’ G G’
P P
Q Q
R R
G G’ G G’ 16
Implementing Virtual Synchrony (1)
• Isis system (fault-tolerant distributed system)
– A reliable point-to-point communication facilities exist and
the ordering is assumed to be FIFO
– Can TCP provide a reliable FIFO ordered point-to-point
communication?
• If a message m has been received by all members in
G, m is said to be stable
– Only stable messages are allowed to be delivered.
– Otherwise, it is kept in a buffer in the communication
layer.
• Assume
– The current view is Gi and the next view Gi+1 is to be
installed
– Gi and Gi+1 differs by one process (WLG)
17
Implementing Virtual Synchrony (2)
• For example,
– The process that notices a view change (e.g. a process
crashes or a process joins the group probably after
recovery) sends a view change message to other
nonfaulty processes
– Any other process P notices the view change when it
receives a view change message.
– P first forwards all unstable messages in the buffer to
every process in Gi+1 using a reliable point-to-point
communication
– Afterwards, it multicasts a flush message
– After P has received a flush message from every other
process, it can safely install the new view
– It is also possible to elect a coordinator to forward all
unstable messages
18
Implementing Virtual Synchrony (3)
Flush message
Unstable message
1 1 1
2 5 2 5 2 5
vc
4 6 4 6 4 6
0 3 0 3 0 3
7 7 7
20
Two-Phase Commit - 2PC (1)
• Consider a distributed transaction involving the
participation of a number of processes each
running on a different machine.
– Phase 1 a: Coordinator sends VOTE_REQUEST to
participants
– Phase 1 b: When a participant receives VOTE_REQUEST
it returns either VOTE_COMMIT or VOTE_ABORT to the
coordinator.
– Phase 2 a: coordinator collects all votes; if all are
VOTE_COMMIT it sends GLOBAL_COMMIT to all
participants; otherwise it sends GLOBAL_ABORT.
– Phase 2 b: Each participant waits for GLOBAL_COMMIT
or GLOBAL_ABORT and acts accordingly.
21
2PC (2)
INIT Vote-request INIT
Commit Vote-abort
Vote-request Vote-request
Vote-Commit
WAIT
Vote-commit READY
Vote-abort Global-commit
Global-abort Global-commit Global-abort ACK
ACK
22
2PC – Failing Participant (1)
• How does this affect other participants?
• INIT: No problem
• READY: A participant P is waiting for either
GLOBAL_COMMIT or GLOBAL_ABORT. If the coordinator
crashes before its message reached P, P cannot know
what to do.
1. It may block until the coordinator recovers
2. It can ask another participant Q. The decision depends
which state Q is in
i. INIT: they can both abort
ii. COMMIT: They can commit
iii. ABORT: They both abort
iv. READY: Contact another participant. If all the participants
it contacted are in this state, they have to wait until the
coordinator recovers (apparently the coordinator is failing)
23
2PC – Failing Participant (2)
State of Q Action by P
29
3PC (2)
– Phase 3 a (prepare to commit):
Coordinator waits until all participants have ACKed
(READY-COMMIT) receipt of PREPARE message, and
then sends COMMIT to all.
– Phase 3 b (prepare to commit):
Participants waits for COMMIT
• States of the coordinator and each participant
satisfies the following two conditions:
1. There is no single state from which it is possible to
make a transition directly to either COMMIT or ABORT
state
2. There is no state in which it is not possible to make
final decision,
30
3PC (3)
Vote-request
INIT Vote-abort INIT
Commit Vote-request
Vote-request Vote-commit
34
Recovery: Background
• Essence: When a failure occurs, we need to bring
the system into an error-free state.
• Backward recovery:
– bring the system from its present erroneous state
back into a previously correct state.
– From time to time, the system state (at least part of
it) must be recorded (check-pointing) in a persistent
storage.
• Forward recovery:
– Instead of a previous check-pointed state, find a
correct new state from which the system can continue
to execute.
• In practice:
– By and large backward error recovery is widely applied
35
Forms of Recovery: Example
• Backward recovery:
– Retransmitting a lost message
• Forward recovery:
– Constructing the missing packets from successfully
delivered packets
– (n, k) block erasure codes
37
Recovery: Stable Storage
b c b c
b c a d
drive 1 a d a’ d
e h e
h e h g f
g f g f
updates recover
b c b c b c
a d a d a d
drive 2 e h e
h h e g f
g f g f
a) Stable Storage
b) Crash after drive 1 is updated
c) Bad spot (as a result of general wear and tear) 38
Checkpointing
• In a fault-tolerant distributed system, backward error
recovery requires that the system regularly save its
(global) state onto stable storage.
• Consistent global state can be captured using distributed
snapshot algorithm.
• A recovery line corresponds to the most recent
distributed snapshot
Initial state consistent cut Checkpoint
P1
Failure
P2
39
inconsistent cut Time
Independent Checkpointing
• Processes save their local state independently
• Each process rolls back to the most recently saved state
on a crash
• If these local states jointly do not form a consistent cut,
then processes will have to further roll back to another
previous checkpoint.