You are on page 1of 27

CS8603 - DISTRIBUTED SYSTEMS

UNIT II MESSAGE ORDERING & SNAPSHOTS

Message ordering and group communication: Message ordering paradigms –Asynchronous


execution with synchronous communication –Synchronous program order on an
asynchronous system –Group communication – Causal order (CO) – Total order. Global
state and snapshot recording algorithms: Introduction –System model and definitions –
Snapshot algorithms for FIFO channels

MESSAGE ORDERING AND GROUP COMMUNICATION

Outline and Notations Outline

• Message orders: non-FIFO, FIFO, causal order, synchronous order


• Group communication with multicast: causal order, total order
• Expected behaviour semantics when failures occur
• Multicasts: application layer on overlays; also at network layer

Notations
• Network (N,L); event set (E,≺)
• Message mi: send and receive events si and ri send and receive events :s and r.
• M, send(M), and receive(M)I
• Corresponding events :a∼b denotes a and b occur at the same process
• send-receive pairs T - { (s,r) ∈ Ei×Ej | s corresponds to r}

MESSAGE ORDERING PARADIGMS

Asynchronous and FIFO Executions

Figure 2.1: (a) A -execution that is FIFO (b) A -execution that is not FIFO
Asynchronous executions

A -execution: ( E ,≺) for which the causality relation is a partial order.

Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 1


CS8603 - DISTRIBUTED SYSTEMS

no causality cycles on any logical link, not necessarily FIFO delivery, e.g., network layer IPv4
connectionless service

All physical links obey FIFO

FIFO executions

An A -execution in which:

for all (s ,r) and (s′,r′) ∈T,(s∼s′ and r∼r′ and s≺s′) =⇒ r≺r′

Logical link inherently non-FIFO

Can assume connection-oriented service at transport layer, e.g., TCP

To implement FIFO over non-FIFO link:

Use 〈seqnum, connid 〉per message.

Receiver uses buffer to order messages.

Causal Order: Definition Causal order (CO)

A CO execution is an A -execution in which, for all ( s,r) and (s′,r′)∈T,(r∼r′ and s≺s′) =⇒r≺r’

If send events s and s′ are related by causality ordering (not physical time ordering), their
corresponding receive events r and r′ occur in the same order at all common dests. If s and
s′ are not related by causality, then CO is vacuously satisfied

Figure 2.2: (a) Violates CO as s1≺s3; r3≺r1

(b) Satisfies CO. (c) Satisfies CO. No send events related by causality. (d) Satisfies CO.
Causal Order: Definition from Implementation Perspective CO alternate definition

If send (m1) ≺ send (m2) then for each common destination d of messages m1 and m2, deliverd

(m1) ≺ deliverd (m2) must be satisfied.

Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 2


CS8603 - DISTRIBUTED SYSTEMS

Message arrival vs. delivery:


• message m that arrives in OS buffer at Pi may have to be delayed until the messages
that were sent to Pi causally before m was sent (the “overtaken” messages) have
arrived!
• The event of an application processing an arrived message is referred to as a delivery
event (instead of as a receive event).

• no message overtaken by a chain of messages between the same (sender, receiver) pair.
In Fig. 2.1(a), m1 overtaken by chain (m2, m3).
• CO degenerates to FIFO when m1, m2 sent by same process
• Uses: updates to shared data, implementing distributed shared memory, fair resource
allocation; collaborative applications, event notification systems, distributed virtual
environments

Message Order (MO)

A-execution in which, for all (s, r) and (sr, r r) ∈ T, s ≺ sr =⇒ ¬(r r ≺ r)

Fig 2.3(a): s1 ≺ s3 but ¬(r 3 ≺ r 1) is false ⇒ MO not satisfied m cannot be overtaken by a


chain

Figure 2.3: (a) Violates CO as s 1 ≺ s 3; r 3 ≺ r 1 (b) Satisfies CO. (c) Satisfies CO. No send
events related by causality. (d) Satisfies CO.

Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 3


CS8603 - DISTRIBUTED SYSTEMS

Causal Order: Other Characterizations

Figure 2.4: (a) Violates CO as s 1 ≺ s 3; r 3 ≺ r 1 (b) Satisfies CO. (c) Satisfies CO. No send
events related by causality. (d) Satisfies CO.

Empty-Interval (EI) property

(E,≺ ) is an EI execution if for each (s, r ) ∈ T , the open interval {x ∈ E | s ≺ x ≺ r } in the partial
order is empty.

• Fig 2.4(b). Consider M2. No event x such that s 2 ≺ x ≺ r 2. Holds for all messages ⇒ EI

For EI (s, r ), there exists some linear extension 1 < | such the corresp.
interval {x ∈ E | s < x < r} is also empty.

• An empty (s,r) interval in a linear extension implies s; r may be arbitrarily close; shown by
vertical arrow in a timing diagram.
• An execution E is CO iff for each M, there exists some space-time diagram in which that
message can be drawn as a vertical arrow.

• A linear extension of a partial order (E,<) is any total order (E,<)| each ordering relation of
the partial order is preserved.

• CO =/> all messages can be drawn as vertical arrows in the same space-time diagram
(otherwise all (s, r) intervals empty in the same linear extension; synchronous execution).

Common Past and Future


An execution (E,≺ ) is CO iff for each pair (s, r ) ∈ T and each event e ∈ E ,
• Weak common past: e ≺ r =⇒ ¬(s ≺ e)
• Weak common future: s ≺ e =⇒ ¬(e ≺ r )
If the past of both s and r are identical (analogously for the future), viz.,

Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 4


CS8603 - DISTRIBUTED SYSTEMS

we get a subclass of CO executions, called synchronous executions.

Synchronous Executions (SYNC)

Figure 2.5: (a) Execution in an async system (b) Equivalent sync execution.
• Handshake between sender and receiver
• Instantaneous communication =>modified definition of causality, where s; r are
atomic and simultaneous, neither preceding the other.

Synchronous Executions: Definition


Causality in a synchronous execution.
The synchronous causality relation on E is the smallest transitive relation that satisfies the
following.

Synchronous execution (or S-execution).

An execution (E,˂˂) for which the causality relation ˂˂ is a partial order.

Timestamping a synchronous execution.

An execution (E, ≺) is synchronous iff there exists a mapping from E to T (scalar timestamps) |
• for any message M, T (s (M)) = T (r (M))
t t
• for each process Pi , if ei ≺ ei then T (ei ) < T (ei )

Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 5


CS8603 - DISTRIBUTED SYSTEMS

ASYNCHRONOUS EXECUTION WITH SYNCHRONOUS COMMUNICATION

Will a program written for an asynchronous system (A-execution) run correctly if run with
synchronous primitives?

Figure 2.6:A-execution deadlocks when using synchronous primitives.

An A-execution that is realizable under synchronous communication is a realizable with


synchronous communication (RSC) execution

Figure 2.7: Illustration of non-RSC A-executions.

RSC Executions

Non-separated linear extension of (E ,≺) is a linear extension of (E, ≺) such that for each pair (s, r
) ∈ T , the interval E | s ≺ x ≺ r }is empty.

Examples

• Figure 2.4(d):

is a linear extension that is non-separated.

Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 6


CS8603 - DISTRIBUTED SYSTEMS

is a linear extension that is separated. Figure

2.5(b)

is a linear extension that is non-separated.

is a linear extension that is separated.


RSC execution

An A-execution (E, ≺) is an RSC execution iff there exists a non-separated linear extension of the
partial order (E, ≺).

Checking for all linear extensions has exponential cost! Practical test using the crown
characterization.

Crown: Definition Crown

Let E be an execution. A crown of size k in E is a sequence ( (s ,r ), i ∈ { 0, . . ., k-1}) of pairs of


k−2 k−1 k−1
corresponding send and receive events such that s 0 ≺ r 1, s 1 ≺ r 2, . . . . . . s ≺r ,s
≺r0.

Figure Fig 2.8(2.8a): crown is : Illustration of non((s 1 , r 1)-, RSC (s 2 A, r -executions


and crowns.2)) as we have s 1 ≺ r 2 and s 2 ≺ r 1
Fig 2.8(b) (b) crown is ((s 1 , r 1), (s 2 , r 2)) as we have s 1 ≺ r 2 and s 2 ≺ r 1
Fig 2.9(c): crown is ((s 1 , r 1), (s 3 , r 3), (s 2 , r 2)) as we have s 1 ≺ r 3 and s 3 ≺ r 2 and s 2
≺ r 1 Fig 2.10(a): crown is ((s 1 , r 1), (s 2 , r 2), (s 3 , r 3)) as we have s 1 ≺ r 2 and s 2 ≺ r 3
and s 3 ≺ r 1.

Some observations

Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 7


CS8603 - DISTRIBUTED SYSTEMS

i i
In a crown, s and r +1 may or may not be on same process

Non-CO execution must have a crown


CO executions (that are not synchronous) have a crown (see Fig 2.4(b))
Cyclic dependencies of crown ⇒ cannot schedule messages serially ⇒ not RSC

Crown Test for RSC executions

• Define the ‹→: T × T relation on messages in the execution (E, ≺) as follows. Let
‹→ ([s, r ], [st, r t ]) iff s ≺ r t . Observe that the condition s ≺ r t (which has the form used
in the definition of a crown) is implied by all the four conditions: (i) s ≺ st , or
(ii) s ≺ r t , or (iii) r ≺ st , and (iv) r ≺
rt.
• Now define a directed graph G‹→ = (T , ‹→), where the vertex set is the set of messages T
and the edge set is defined by ‹→.
Observe that ‹→: T × T is a partial order iff G‹→ has no cycle, i.e., there must not be
a cycle with respect to ‹→ on the set of corresponding (s, r ) events.
• Observe from the defn. of a crown that G‹→ has a directed cycle iff (E, ≺) has a
crown.

Crown criterion

An A-computation is RSC, i.e., it can be realized on a system with synchronous communication,


iff it contains no crown.

Crown test complexity: O(|E|) (actually, # communication events)

Timestamps for a RSC execution

Execution (E, ≺) is RSC iff there exists a mapping from E to T (scalar timestamps) such that for
any message M, T (s (M)) = T (r (M)) for each (a, b) in (E × E ) \ T , a ≺ b =⇒ T (a) < T (b)

Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 8


CS8603 - DISTRIBUTED SYSTEMS

Hierarchy of Message Ordering Paradigms

Figure 2.9: Hierarchy of message ordering paradigms. (a) Venn diagram (b) Example
executions.
An A-execution is RSC iff A is an S-execution.
RSC CO FIFO A.

More restrictions on the possible message orderings in the smaller classes. The degree of
concurrency is most in A, least in SYNC.
A program using synchronous communication easiest to develop and verify. A program using non-FIFO
communication, resulting in an A-execution, hardest to design and verify.

Simulations: Async Programs on Sync Systems

RSC execution: schedule events as per a non-separated linear extension



adjacent (s, r ) events sequentially

partial order of original A-execution unchanged
If A-execution is not RSC:

partial order has to be changed; or

model each Ci,j by control process Pi,j and use sync communication (see Fig 2.10)

Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 9


CS8603 - DISTRIBUTED SYSTEMS

Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 10


CS8603 - DISTRIBUTED SYSTEMS

Figure 2..10: Modeling channels as processes to simulate an execution using asynchronous


primitives on an synchronous system.

Enables decoupling of sender from receiver.

This implementation is expensive.

Simulations: Synch Programs on Async Systems

• Schedule msgs in the order in which they appear in S-program


• partial order of S -execution unchanged
• Communication on async system with async primitives When sync send is scheduled:
o wait for ack before completion

SYNC PROGRAM ORDER ON ASYNC SYSTEMS

• Deterministic program: repeated runs produce same partial order

• Deterministic receive ⇒ deterministic execution ⇒ (E,≺ ) is fixed


Nondeterminism (besides due to unpredictable message delays):
Receive call does not specify sender
Multiple sends and receives enabled at a process; can be executed in interchangeable order
∗[G1 −→ CL1 || G2 −→ CL2 || · · · || Gk −→ CLk ]
Deadlock example of Fig 2.6
If event order at a process is permuted, no deadlock!
How to schedule (nondeterministic) sync communication calls over async
system?
Match send or receive with corresponding event
Binary rendezvous (implementation using tokens)
Token for each enabled interaction
Schedule online, atomically, in a distributed manner
Crown-free scheduling (safety); also progress to be guaranteed
Fairness and efficiency in scheduling

Bagrodia’s Algorithm for Binary Rendezvous (1) Assumptions


Receives are always enabled
Send, once enabled, remains enabled
To break deadlock, PIDs used to introduce asymmetry

Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 11


CS8603 - DISTRIBUTED SYSTEMS

Each process schedules one send at a time


Message types: M, ack (M), request(M), permission(M)
Process blocks when it knows it can successfully synchronize the current message

Fig 2..11: Rules to prevent message cyles. (a) High priority process blocks. (b) Low priority
process does not block.

Bagrodia’s Algorithm for Binary Rendezvous: Code

Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 12


CS8603 - DISTRIBUTED SYSTEMS

Higher prio Pi blocks on lower prio Pj to avoid cyclic wait (whether or not it is the intended
sender or receiver of msg being scheduled)
Before sending M to Pi , Pj requests permission in a nonblocking manner. While
waiting:
t t
o M arrives from another higher prio process. ack (M ) is returned
t t
o request(M ) arrives from lower prio process. Pj returns permission(M )
t
and blocks until M arrives.
Note: receive(Mr ) gets permuted with the send (M) event.

Figure 2.12: Scheduling messages with sync communication

GROUP COMMUNICATION

• Unicast vs. multicast vs. broadcast


Network layer or hardware-assist multicast cannot easily provide:

• Application-specific semantics on message delivery order


• Adapt groups to dynamic membership
• Multicast to arbitrary process set at each send
• Provide multiple fault-tolerance semantics
Closed group (source part of group) vs. open group

# groups can be O(2n )P1

Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 13


CS8603 - DISTRIBUTED SYSTEMS

Causal order (CO)

Figure 2.13: (a) Updates to 3 replicas. (b) Causal order (CO) and total order violated. (c) Causal
order violated.

If m did not exist, (b,c) would not violate CO.


The following two criteria must be met by a causal ordering

protocol:

Safety

In order to prevent causal order from being violated, a message M that arrives at a process may
need to be buffered until all system wide messages sent in the causal past of the send(M) event to
that same destination have already arrived.

Liveness

A message that arrives at a process must eventually be delivered to the process.

The Raynal–Schiper–Toueg algorithm

Each message M should carry a log of all other messages, or their identifiers, sent causally before
M’s send event, and sent to the same destination destM.This log can then be examined to ensure
whether it is safe to deliver a message. All algorithms aim to reduce this log overhead, and the
space and time overhead of maintaining the log information at the processes. Refining the notion
of broadcasting, there is multicasting wherein a message is sent to a certain subset, identified as a
group, of the processes in the system. At the other extreme is unicasting, which is the familiar
point-to-point message communication. Network layer protocol assisted multicast cannot
efficiently provide features such as the following:
• Application-specific ordering semantics on the order of delivery of messages.
• Adapting groups to dynamically changing membership.
• Sending multicasts to an arbitrary set of processes at each send event.
• Providing various fault-tolerance semantics.

Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 14


CS8603 - DISTRIBUTED SYSTEMS

If a multicast algorithm requires the sender to be a part of the destination group, the multicast
algorithm is said to be a closed group algorithm. If the sender of the multicast can be outside the
destination group, the multicast algorithm is said to be an open group algorithm

How does algorithm simplify if all msgs are broadcast?

Causal order (CO)


Optimal KS Algorithm for CO: Principles

• Mi,a : ath multicast message sent by Pi

Delivery Condition for correctness:

Msg M ∗that carries information “ d ∈ M. Dests”, where message M was sent to d in the causal
past of Send( M ∗ ), is not delivered to d if M has not yet been delivered to d .

Necessary and Sufficient Conditions for Optimality:

• For how long should the information “ d∈ Mi, a.Dests” be stored in the log at a process,
and piggybacked on messages?
• as long as and only as long as o (Propagation Constraint I: )it is not known that the message
Mi ,a is delivered to d

Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 15


CS8603 - DISTRIBUTED SYSTEMS

, and o (Propagation Constraint II: ) it is not known that a message has been sent
to d in the causal future of Send (Mi,a), and hence it is not guaranteed using a reasoning
based on transitivity that the message Mi,a will be delivered to d in CO.

⇒ if either (I) or (II) is false, “d ∈ M.Dests” must not be stored or propagated, even to remember
that (I) or (II) has been falsified.

“d Mi,a.Dests” must be available in the causal future of event ei,a , but not in the causal future
of Deliverd (Mi,a ), and not in the causal future of ek,c , where d Mk,c .Dests and there is no
other message sent causally between Mi,a and Mk,c to the same destination d .

In the causal future of Deliverd (Mi,a ), and Send (Mk,c ), the information is redundant; elsewhere,
it is necessary.

Information about what messages have been delivered (or are guaranteed to be delivered without
violating CO) is necessary for the Delivery Condition.

• For optimality, this cannot be stored. Algorithm infers this using set-operation logic.

“d M.Dests”

• must exist at e1 and e2 because (I) and (II) are true. must not exist at e3 because (I) is
false
• must not exist at e4, e5, e6 because (II) is false

Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 16


CS8603 - DISTRIBUTED SYSTEMS

• must not exist at e7, e8 because (I) and (II) are false

• Info about messages (i) not known to be delivered and (ii) not guaranteed to be delivered
in CO, is explicitly tracked using (source, ts, dest).

• Must be deleted as soon as either (i) or (ii) becomes false

• Info about messages already delivered and messages guaranteed to be delivered in CO is


implicitly tracked without storing or propagating it:
• derived from the explicit information.
• used for determining when (i) or (ii) becomes false for the explicit information being
stored/piggybacked.

Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 17


CS8603 - DISTRIBUTED SYSTEMS

Explicit tracking of (s, ts, dest) per multicast in Log and OM

Implicit tracking of msgs that are (i) delivered, or (ii) guaranteed to be delivered in CO:

• (Type 1:) ∃d ∈ Mi,a.Dests | d ƒ∈ li ,a.Dests d ƒ∈ oi,a.Dests o When li ,a.Dests = ∅ or


oi,a.Dests = ∅?
o Entries of the form li ,ak for k = 1, 2, . . . will accumulate
o Implemented in Step (2d)
• (Type 2:) if a1 < a2 and li ,a2 LOGj , then li ,a1 LOGj . (Likewise for messages) o
entries of the form li ,a1 .Dests = can be inferred by their absence, and should not be stored
o Implemented in Step (2d) and PURGE NULL ENTRIES

Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 18


CS8603 - DISTRIBUTED SYSTEMS

Figure 2.14: Tracking of information about M5,1.Dests

TOTAL MESSAGE ORDER

Total order

For each pair of processes Pi and Pj and for each pair of messages Mx and My that are delivered
to both the processes, Pi is delivered Mx before My if and only if Pj is delivered Mx before My .

Centralized algorithm for Total Order:


(1)When Pi wants to multicast M to group G :

(1a) send M(i, G ) to coordinator

(2)When M(i, G ) arrives from Pi at coordinator:

(2a) send M(i, G ) to members of G .

(3)When M(i, G ) arrives at Pj from coordinator:

(3a) deliver M(i, G ) to application.

Same order seen by all Solves coherence problem.

Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 19


CS8603 - DISTRIBUTED SYSTEMS

Fig 2.15: (a) Updates to 3 replicas. (b) Total order violated. (c) Total order not violated

Time Complexity: 2 hops/ transmission

Message complexity: n

Three-phase distributed algorithm:


The three phases of the algorithm are first described from the viewpoint of the sender, and
then from the viewpoint of the receiver.

Sender

Phase 1 : In the first phase, a process multicasts (line 1b) the message M with a locally
unique tag and the local timestamp to the group members.

Phase 2: In the second phase, the sender process awaits a reply from all the group members
who respond with a tentative proposal for a revised timestamp for that message M. The
await call in line 1d is non-blocking i.e., any other messages received in the meanwhile
are processed. Once all expected replies are received, the process computes the maximum
of the proposed timestamps for M, and uses the maximum as the final timestamp.

Phase 3: In the third phase, the process multicasts the final timestamp to the group in line
(1f).

Receivers

Phase 1 In the first phase, the receiver receives the message with a tentative/proposed
timestamp. It updates the variable priority that tracks the highest proposed timestamp (line
2a), then revises the proposed timestamp to the priority, and places the message with its
tag and the revised timestamp at the tail of the queue temp_Q (line 2b). In the queue, the
entry is marked as undeliverable.
Phase 2 In the second phase, the receiver sends the revised timestamp (and the tag) back
to the sender (line 2c). The receiver then waits in a non-blocking manner for the final
timestamp (correlated by the message tag).

Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 20


CS8603 - DISTRIBUTED SYSTEMS

Phase 3 In the third phase, the final timestamp is received from the multicaster (line 3).
The corresponding message entry in temp_Q is identified using the tag (line 3a), and is
marked as deliverable (line 3b) after the revised timestamp is overwritten by the final
timestamp (line 3c). The queue is then resorted using the timestamp field of the entries as
the key (line 3c). As the queue is already sorted except for the modified entry for the
message under consideration, that message entry has to be placed in its sorted position in
the queue. If the message entry is at the head of the temp_Q, that entry, and all consecutive
subsequent entries that are also marked as deliverable, are dequeued from temp_Q, and
enqueued in deliver_Q in that order (the loop in lines 3d–3g).

Total Message Order: 3-phase Algorithm Code

Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 21


CS8603 - DISTRIBUTED SYSTEMS

Total Order: Distributed Algorithm: Example and Complexity

Figure 2.16: (a) A snapshot for PROPOSED TS and REVISE TS messages. The dashed lines show
the further execution after the snapshot. (b) The FINAL TS messages.

Complexity:

• Three phases
• 3(n − 1) messages for n − 1 dests
• Delay: 3 message hops
• Also implements causal order

GLOBAL STATE AND SNAPSHOT RECORDING ALGORITHMS

Introduction

• Recording the global state of a distributed system on-the-fly is an important


paradigm.
• The lack of globally shared memory, global clock and unpredictable message delays
in a distributed system make this problem non-trivial.
• This chapter first defines consistent global states and discusses issues to be addressed
to compute consistent distributed snapshots.

Then several algorithms to determine on-the-fly such snapshots are presented for
several types of networks.
System model

• The system consists of a collection of n processes p1, p2, ..., pn that are connected by
channels.
• There are no globally shared memory and physical global clock and processes
communicate by passing messages through communication channels.
• Cij denotes the channel from process pi to process pj and its state is denoted by SCij .

Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 22


CS8603 - DISTRIBUTED SYSTEMS

• The actions performed by a process are modeled as three types of events: Internal
events,the message send event and the message receive event.
• For a message mij that is sent by process pi to process pj , let send (mij ) and rec (mij )
denote its send and receive events.

• At any instant, the state of process pi , denoted by LSi , is a result of the sequence of all the
events executed by pi till that instant.
• For an event e and a process state LSi , e ∈LSi iff e belongs to the sequence of events that
have taken process pi to state LSi .
• For an event e and a process state LSi , e ƒ∈LSi iff e does not belong to the sequence of
events that have taken process pi to state LSi .
• For a channel Cij , the following set of messages can be defined based on the local states
of the processes pi and pj

• Transit: transit(LSi , LSj ) = {mij |send (mij ) ∈ LSi ˄ rec (mij ) ƒ∈ LSj }

Models of communication

Recall, there are three models of communication: FIFO, non-FIFO, and Co.
• In FIFO model, each channel acts as a first-in first-out message queue and thus, message
ordering is preserved by a channel.
• In non-FIFO model, a channel acts like a set in which the sender process adds messages
and the receiver process removes messages from it in a random order.
• A system that supports causal delivery of messages satisfies the following property: “For
any two messages mij and mkj , if send (mij ) −→ send (mkj ), then rec (mij ) −→ rec (mkj
)”.

Consistent global state

• The global state of a distributed system is a collection of the local states of the processes
and the channels.
• Notationally, global state GS is defined as,

GS = {∪i LSi , ∪i ,j SCij }

• A global state GS is a consistent global state iff it satisfies the following two conditions :
C1: send(mij )∈LSi ⇒ mij ∈SCij ⊕ rec(mij )∈LSj . (⊕ is Ex-OR operator.)

Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 23


CS8603 - DISTRIBUTED SYSTEMS

• C2: send(mij )ƒ∈LSi ⇒ mij ƒ∈SCij ∧ rec(mij )ƒ∈LSj .

Interpretation in terms of cuts

• A cut in a space-time diagram is a line joining an arbitrary point on each process


line that slices the space-time diagram into a PAST and a FUTURE.
• A consistent global state corresponds to a cut in which every message received in
the PAST of the cut was sent in the PAST of that cut.
• Such a cut is known as a consistent cut.
• For example, consider the space-time diagram for the computation illustrated in
Figure 4.1.
• Cut C1 is inconsistent because message m1 is flowing from the FUTURE to the
PAST.
• Cut C2 is consistent and message m4 must be captured in the state of channel C21.

Figure 2.17: An Interpretation in Terms of a Cut.

Issues in recording a global state


The following two issues need to be addressed:
I1: How to distinguish between the messages to be recorded in the snapshot from those not to be
recorded.
-Any message that is sent by a process before recording its snapshot, must be recorded in
the global snapshot (from C1).
-Any message that is sent by a process after recording its snapshot, must not be recorded
in the global snapshot (from C2).

I2: How to determine the instant when a process takes its snapshot.
-A process pj must record its snapshot before processing a message mij that was sent by
process pi after recording its snapshot.

Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 24


CS8603 - DISTRIBUTED SYSTEMS

Snapshot Algorithms For FIFO Channels Chandy-Lamport algorithm

• The Chandy-Lamport algorithm uses a control message, called a marker whose role in a
FIFO system is to separate messages in the channels.
• After a site has recorded its snapshot, it sends a marker, along all of its outgoing channels
before sending out any more messages.
• A marker separates the messages in the channel into those to be included in the snapshot
from those not to be recorded in the snapshot.
• A process must record its snapshot no later than when it receives a marker on any of its
incoming channels.
• The algorithm can be initiated by any process by executing the “Marker Sending Rule” by
which it records its local state and sends a marker on each outgoing channel.
• A process executes the “Marker Receiving Rule” on receiving a marker. If the process has
not yet recorded its local state, it records the state of the channel on which the marker is
received as empty and executes the “Marker Sending Rule” to record its local state.
• The algorithm terminates after each process has received a marker on all of its incoming
channels.
• All the local snapshots get disseminated to all other processes and all the processes can
determine the global state.

Correctness and Complexity

Correctness
Due to FIFO property of channels, it follows that no message sent after the marker on that
channel is recorded in the channel state. Thus, condition C2 is satisfied.

Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 25


CS8603 - DISTRIBUTED SYSTEMS

When a process pj receives message mij that precedes the marker on channel Cij , it acts
as follows: If process pj has not taken its snapshot yet, then it includes mij in its recorded
snapshot. Otherwise, it records mij in the state of the channel Cij . Thus, condition C1 is
satisfied.
Complexity
The recording part of a single instance of the algorithm requires O(e ) messages and O(d )
time, where e is the number of edges in the network and d is the diameter of the network.

Properties of the recorded global state

The recorded global state may not correspond to any of the global states that occurred during the
computation.

This happens because a process can change its state asynchronously before the markers it sent are
received by other sites and the other sites record their states.

• But the system could have passed through the recorded global states in some equivalent
executions.
• The recorded global state is a valid state in an equivalent execution and if a stable property
(i.e., a property that persists) holds in the system before the snapshot algorithm begins, it
holds in the recorded global snapshot.
• Therefore, a recorded global state is useful in detecting stable properties.

Spezialetti-Kearns algorithm

There are two phases in obtaining a global snapshot: locally recording the snapshot at every
process and distributing the resultant global snapshot to all the initiators.

Efficient snapshot recording


• In the Spezialetti-Kearns algorithm, a markers carries the identifier of the initiator of the
algorithm. Each process has a variable master to keep track of the initiator of the algorithm.
• A key notion used by the optimizations is that of a region in the system. A region
encompasses all the processes whose master field contains the identifier of the same
initiator.
• When the initiator’s identifier in a marker received along a channel is different from the
value in the master variable, the sender of the marker lies in a different region.
• The identifier of the concurrent initiator is recorded in a local variable id -border -set.
• The state of the channel is recorded just as in the Chandy-Lamport algorithm (including
those that cross a border between regions).

Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 26


CS8603 - DISTRIBUTED SYSTEMS

• Snapshot recording at a process is complete after it has received a marker along each of its
channels.
• After every process has recorded its snapshot, the system is partitioned into as many
regions as the number of concurrent initiations of the algorithm.
• Variable id -border -set at a process contains the identifiers of the neighboring regions.
• The state of the channel is recorded just as in the Chandy-Lamport algorithm (including
those that cross a border between regions).
• Snapshot recording at a process is complete after it has received a marker along each of its
channels.
• After every process has recorded its snapshot, the system is partitioned into as many
regions as the number of concurrent initiations of the algorithm.
• Variable id -border -set at a process contains the identifiers of the neighboring regions.

Efficient dissemination of the recorded snapshot


In the snapshot recording phase, a forest of spanning trees is implicitly created in the system.
The initiator of the algorithm is the root of a spanning tree and all processes in its region
belong to its spanning tree.

Efficient dissemination of the recorded snapshot

• If pi receives its first marker from pj then process pj is the parent of process pi in the
spanning tree.
• When an intermediate process in a spanning tree has received the recorded states from
all its child processes and has recorded the states of all incoming channels, it forwards
its locally recorded state and the locally recorded states of all its descendent processes to
its parent.
• When the initiator receives the locally recorded states of all its descendents from its
children processes, it assembles the snapshot for all the processes in its region and the
channels incident on these processes.
• The initiator exchanges the snapshot of its region with the initiators in adjacent regions
in rounds.
• The message complexity of snapshot recording is O(e ) irrespective of the number of
concurrent initiations of the algorithm. The message complexity of assembling and
disseminating the snapshot is O(rn2 ) where r is the number of concurrent initiations.

Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 27

You might also like