DC MidSem

Distributed Computing
Introduction
Dr. Barsha Mitra
CSIS Dept, BITS Pilani, Hyderabad Campus
BITS Pilani
Hyderabad Campus
1
Course Details
•Course will be run in FLIPPED MODE
•Course handout is uploaded on https://elearn.bits-pilani.ac.in
No Name Type Weight Day, Date, Session, Time
Quiz-1 5% August 16-30, 2022
EC-1 Quiz-2 5% September 16-30, 2022
Assignment 10% October 16-30, 2022
EC-2 Mid-Semester Test
30%
EC-3 Comprehensive Exam
50%
Course ID: SS ZG526, Title: Distributed Computing 2 BITS Pilani, Hyderabad Campus
Course Details
•Quizzes and Assignment will be comducted on
https://elearn.bits-pilani.ac.in
•Course slides will be uploaded on https://elearn.bits-pilani.ac.in
and will be available on Microsoft Teams
•Course recordings will be available on Microsoft Teams
•Lab capsules are available on https://bitscsis.vlabs.platifi.com/
Definition of Distributed Systems
•Collection of independent entities cooperating to collectively solve a task
•Autonomous processors communicating over a communication network
Features of Distributed System
•No common physical clock
•No shared memory - Requires message passing for communication
•Geographical seperation
•Processors are geographically wide apart
•Processors may reside on a WAN or LAN
•Autonomy and Heterogeneity
•Loosely coupled processors
•Different processor speeds and operating systems
•Processors are not part of a dedicated system
•Cooperate with one another
•May offer services or solve a problem jointly
Middleware
•Middleware –
• distributed software
• drives the distributed system
• provides transparency of heterogeneity
at platform level
•Middleware standards:
•Common Object Request Broker
Architecture (CORBA)
•Remote Procedure Call (RPC) mechanism
•Distributed Component Object Model
(DCOM)
•Remote Method Invocation (RMI)
Motivation for Distributed System
•Inherently distributed computation
•Resource sharing
•Access to remote resources
•Increased performance/cost ratio
•Reliability
•Scalability
•Modularity and Incremental Expandability
Coupling
• Interdependency/binding and/or homogeneity among modules,
whether hardware or software (e.g., OS, middleware)
• High coupling  tightly coupled modules
• Low coupling  loosely coupled modules
Parallel Systems
• Multiprocessor systems
• E.g., Omega, Butterfly Networks
• Multicomputer parallel systems
• E.g., NYU Ultracomputer, CM* Connection Machine, IBM Blue gene
• Array processors (collocated, tightly coupled, common system clock,
may not have shared memory)
• E.g., DSP applications, image processing
UMA Model
• Direct access to shared memory
• Access latency - waiting time to complete an
access to any memory location from any
processor
• Access latency is same for all processors
• Processors
• Remain in close proximity
• Connected by an interconnection network
UMA Model
• Interprocess communication occurs through read and write operations
on shared memory
• Processors are of same type
• Remain within same physical box/container
• Processors run the same operating system
• Hardware & software are very tightly coupled
• Interconnection network to access memory may be
• Bus
• Multistage switch
Omega Network
•Multi-stage networks formed by 2 x 2
switching elements
•Data can be sent on any one of the
input wires
•n-input and n-output network uses:
• log2(n) stages
• log2(n) bits for addressing
•n processors, n memory banks
Omega Network
• (n/2)log2(n) switching elements
of size 2x2
• Interconnection function:
Output i of a stage is connected
to input j of next stage:
• j = 2i, for 0 ≤ i ≤ n/2 – 1
= 2i + 1 – n, for n/2 ≤ i ≤ n – 1
•left-rotation operation on the
binary representation of i to
get j
References
• Ajay D. Kshemkalyani, and Mukesh Singhal, Chapter 1, “Distributed
Computing: Principles, Algorithms, and Systems”, Cambridge
University Press, 2008.
• https://medium.com/@rashmishehana_48965/middleware-
technologies-e5def95da4e
Thank You
Introduction
Dr. Barsha Mitra
BITS Pilani
Hyderabad Campus
1
Multicomputer Parallel Systems
• multiple processors do not have direct access to shared memory
• usually do not have a common clock
• processors are in close physical proximity
• very tightly coupled (homogenous hardware and software)
• connected by an interconnection network
• communicate either via a common address space or via message-passing
https://en.wikipedia.org/wiki/Parallel_computing#/media/File:IBM_Blue_Gene_P_supercomputer.jpg
https://www.cs.iusb.edu/~danav/teach/b424/b424_24_networks.html
NUMA Model
• non-uniform memory access

architecture
• multicomputer system that has a
common address space
• latency to access various shared
memory locations from the different
processors varies
• 2-D wraparound mesh

• k x k mesh contains k2 processors
processor + memory
• k-dimensional hypercube
has 2k processor-and-
memory units (nodes)
• each dimension is
associated with a bit
position in the label
• labels of two adjacent
nodes differ in the kth bit
position for dimension k
• shortest path between any
two processors 
Hamming distance
• Hamming distance <= k processor + memory
• routing in the hypercube is done hop-by-hop

• multiple routes exist between any pair of nodes  provides fault
tolerance and congestion control mechanism
Parallelism/Speedup
• measure of the relative speedup of a specific program on a
given machine
• depends on
• number of processors
• mapping of the code to the processors
• speedup = T(1)/T(n)
• T(1) = time with a single processor
• T(n) = time with n processors
Parallel/Distributed Program
Concurrency
non-communication and
non-shared memory
number of local operations
access
total number of operations
includes the
communication or shared
memory access
operations
Distributed Communication Models
• Remote Procedure Call (RPC)

• Publish/Subscribe Model
Publish/ Subscribe Model
• allows any number of publishers to
communicate with any number of subscribers
asynchronously
• Used in serverless and microservices
architecture
• form of multicasting
• events are produced by publishers
• interested subscribers are notified when the
events are available
• information is provided as notifications
• subscribers continue their tasks until they
receive notifications
• Oracle, AWS, Azure Web PubSub service
References
• Ajay D. Kshemkalyani, and Mukesh Singhal, Chapter 1, “Distributed Computing: Principles,
Algorithms, and Systems”, Cambridge University Press, 2008.
• https://medium.com/@rashmishehana_48965/middleware-technologies-e5def95da4e
• http://wiki.c2.com/?PublishSubscribeModel
• https://www.slideshare.net/ishraqabd/publish-subscribe-model-overview-13368808/5
• https://www.bmc.com/blogs/pub-sub-publish-subscribe/
• https://aws.amazon.com/pub-sub-
messaging/#:~:text=Publish%2Fsubscribe%20messaging%2C%20or%20pub,the%20subscribe
rs%20to%20the%20topic.
Thank You
A Model of Distributed Computations
Dr. Barsha Mitra
BITS Pilani
Hyderabad Campus
Preliminaries
•distributed system consists of a set of processors
•connected by a communication network
•communication delay is finite but unpredictable
•processors do not share a common global memory
•asynchronous message passing
•no common physical global clock
•communication medium may deliver messages out of order
A Distributed Program
•Cij : channel from process pi to process pj
•mij: message sent by pi to pj
•Events –
•Internal Events
•Message Send Events
•Message Receive Events
Space-Time Diagram
A Model of Distributed Executions
Causal Precedence/Dependency or Happens Before Relation
•ei → ej
•path
Models of Communication Networks
• FIFO model:
• each channel acts as a first-in first-out message queue
• message ordering is preserved by channel
• Non-FIFO model
• channel acts like a set
• sender process adds messages to channel
• receiver process removes messages from it
•Causal ordering model is based on Lamport’s “happens before”
relation
•A system supporting causal ordering model satisfies :
CO: for mij and mkj , if send(mij ) → send(mkj ),
then rec(mij ) → rec(mkj )
•Causally ordered delivery of messages implies FIFO message delivery
•CO ⊂ FIFO ⊂ Non-FIFO
Reference
• Ajay D. Kshemkalyani, and Mukesh Singhal, Chapter 2,

“Distributed Computing: Principles, Algorithms, and Systems”,
Cambridge University Press, 2008.
Logical Time
Dr. Barsha Mitra
BITS Pilani
Hyderabad Campus
Introduction
•not possible to have global physical time
•realize only an approximation of physical time using logical time
•logical time advances in jumps
•captures monotonicity property
• monotonicity property: if event a causally affects event b, then timestamp
of a is smaller than timestamp of b
• Ways to represent logical time:
•scalar time
•vector time
Logical Clocks
•every process has a logical clock

•logical clock is advanced using a set of rules
•each event is assigned a timestamp
Scalar Time - Definition
•time domain is the set of non-negative integers
•Ci:
•integer variable
•denotes logical clock of pi
Rules for updating clock:
•R1: Before executing an event (send, receive, or internal), process pi
executes
•d can have different values

•d = 1
R2: When process pi receives a message with timestamp Cmsg, it
executes the following actions:
1. Ci = max (Ci , Cmsg);
2. execute R1;
3. deliver the message.
Evolution of Scalar Time
p1
p2
Processes
p3
p4
Time
1 8 9 10 13 14
p1
1 7 9 10 12 14
6 7 11 12 13 15 16
p2
Processes 5 11 16
p3 4 5 10 11 12 17
3 12
p4
2 3 4 5 6 13 14
Time
Scalar Time – Basic Properties
•Consistency
•Total Ordering
•Event Counting
•No Strong Consistency
•From Recorded Lecture
1 8 9 10 13 14
p1
1 7 9 10 12 14
6 7 11 12 13 15 16
• d=1
p2
• height of event
Processes 5 11 16
p3 4 5 10 11 12 17
3 12
p4
2 3 4 5 6 13 14
Time
Vector Time - Definition
• time domain is represented by a set of n-dimensional non-
negative integer vectors
• vti[i], vti[j]
p1
p2
Processes
p3
p4
Time
[1 0 0 0] [2 0 0 0] [3 0 0 0] [4 0 0 0] [5 3 0 0]
p1
[1 0 0 0] [2 0 0 0] [3 0 0 0] [2 3 0 0]
[2 4 0 0] [2 5 0 0]
p2
[2 1 0 0] [2 2 0 0] [2 3 0 0]
Processes
[2 5 0 0]
[1 0 1 2] [1 0 2 3] [3 0 4 5] [3 0 5 5]
p3
[3 0 3 3] [3 5 6 5]
[1 0 0 2] [1 0 0 3] [ 1 0 0 5] [3 0 5 5]
p4
[1 0 0 1] [1 0 0 2] [1 0 0 3] [1 0 0 4] [1 0 0 5] [3 0 5 6] [3 0 5 7]
Time
Vector Time
• Comparing Vector timestamps from Recorded Lecture
• Vector Time properties from Recorded Lecture
Vector Time – Basic Properties
Event counting
• if d is always 1 and vti[i] is the ith component of vector clock at process pi
• then vti[i] = no. of events that have occurred at pi until that instant
Event counting
•if an event e has timestamp vh
• vh[j] = number of events executed by process pj that causally
precede e
• Σ vh[j] −1 : total no. of events that causally precede e in the
distributed computation
Singhal–Kshemkalyani’s Differential
Technique
p1
p2
Processes
p3
p4
Time
Technique
[1 0 0 0] [2 0 0 0] [3 0 0 0]
p1
{(1, 1)} {(1, 2)} {(1, 3)}
[2 3 1 0] [3 5 3 1] [3 6 6 3]
p2
[1 1 0 0] [2 2 0 0] [3 4 1 0]
{(3, 1)} {(3, 3), (4, 1)} {(3, 6), (4, 3)}
Processes
[0 0 2 1] [0 0 4 2] [0 0 5 3]
p3
[0 0 1 0] [0 0 3 1] [0 0 6 3]
{(4, 1)} {(4, 2)} {(4, 3)}
p4
[0 0 0 1] [0 0 0 2] [0 0 0 3]
Time
Fowler–Zwaenepoel’s Direct-
Dependency Technique
• reduces the size of messages by transmitting only a scalar value

• process only maintains information regarding direct dependencies on
other processes
• transitive (indirect) dependencies are not maintained by this method
p1 [1 0 0 0] [2 0 0 0] [3 0 0 0]
{1} {2} {3}
p2 [2 3 1 0] [3 5 3 0] [3 6 6 0]
[1 1 0 0] [2 2 0 0] [3 4 1 0]
{1} {3} {6}
Processes
p3 [0 0 2 1] [0 0 4 2] [0 0 5 3]
[0 0 1 0] [0 0 3 1] [0 0 6 3]
{1} {2} {3}

p4
[0 0 0 1] [0 0 0 2] [0 0 0 3]
Time
Reference
Logical Time
Dr. Barsha Mitra
BITS Pilani
Hyderabad Campus
Introduction
•scalar time
•vector time
Logical Clocks

•Ci:
•integer variable
executes

•d = 1
2. execute R1;
p1
p2
Processes
p3
p4
Time
1 8 9 10 13 14
p1
1 7 9 10 12 14
6 7 11 12 13 15 16
p2
Processes 5 11 16
p3 4 5 10 11 12 17
3 12
p4
2 3 4 5 6 13 14
Time
•Consistency
•Total Ordering
•Event Counting
1 8 9 10 13 14
p1
1 7 9 10 12 14
6 7 11 12 13 15 16
• d=1
p2
• height of event
Processes 5 11 16
p3 4 5 10 11 12 17
3 12
p4
2 3 4 5 6 13 14
Time
• vti[i], vti[j]
p1
p2
Processes
p3
p4
Time
[1 0 0 0] [2 0 0 0] [3 0 0 0] [4 0 0 0] [5 3 0 0]
p1
[1 0 0 0] [2 0 0 0] [3 0 0 0] [2 3 0 0]
[2 4 0 0] [2 5 0 0]
p2
[2 1 0 0] [2 2 0 0] [2 3 0 0]
Processes
[2 5 0 0]
[1 0 1 2] [1 0 2 3] [3 0 4 5] [3 0 5 5]
p3
[3 0 3 3] [3 5 6 5]
[1 0 0 2] [1 0 0 3] [ 1 0 0 5] [3 0 5 5]
p4
[1 0 0 1] [1 0 0 2] [1 0 0 3] [1 0 0 4] [1 0 0 5] [3 0 5 6] [3 0 5 7]
Time
Vector Time
Event counting
Event counting
precede e
Technique
p1
p2
Processes
p3
p4
Time
Technique
[1 0 0 0] [2 0 0 0] [3 0 0 0]
p1
{(1, 1)} {(1, 2)} {(1, 3)}
[2 3 1 0] [3 5 3 1] [3 6 6 3]
p2
[1 1 0 0] [2 2 0 0] [3 4 1 0]
{(3, 1)} {(3, 3), (4, 1)} {(3, 6), (4, 3)}
Processes
[0 0 2 1] [0 0 4 2] [0 0 5 3]
p3
[0 0 1 0] [0 0 3 1] [0 0 6 3]
{(4, 1)} {(4, 2)} {(4, 3)}
p4
[0 0 0 1] [0 0 0 2] [0 0 0 3]
Time

other processes
p1 [1 0 0 0] [2 0 0 0] [3 0 0 0]
{1} {2} {3}
p2 [2 3 1 0] [3 5 3 0] [3 6 6 0]
[1 1 0 0] [2 2 0 0] [3 4 1 0]
{1} {3} {6}
Processes
p3 [0 0 2 1] [0 0 4 2] [0 0 5 3]
[0 0 1 0] [0 0 3 1] [0 0 6 3]
{1} {2} {3}

p4
[0 0 0 1] [0 0 0 2] [0 0 0 3]
Time
Reference
Logical Time
Dr. Barsha Mitra
BITS Pilani
Hyderabad Campus
Introduction
•scalar time
•vector time
Logical Clocks

•Ci:
•integer variable
executes

•d = 1
2. execute R1;
p1
p2
Processes
p3
p4
Time
1 8 9 10 13 14
p1
1 7 9 10 12 14
6 7 11 12 13 15 16
p2
Processes 5 11 16
p3 4 5 10 11 12 17
3 12
p4
2 3 4 5 6 13 14
Time
•Consistency
•Total Ordering
•Event Counting
1 8 9 10 13 14
p1
1 7 9 10 12 14
6 7 11 12 13 15 16
• d=1
p2
• height of event
Processes 5 11 16
p3 4 5 10 11 12 17
3 12
p4
2 3 4 5 6 13 14
Time
• vti[i], vti[j]
p1
p2
Processes
p3
p4
Time
[1 0 0 0] [2 0 0 0] [3 0 0 0] [4 0 0 0] [5 3 0 0]
p1
[1 0 0 0] [2 0 0 0] [3 0 0 0] [2 3 0 0]
[2 4 0 0] [2 5 0 0]
p2
[2 1 0 0] [2 2 0 0] [2 3 0 0]
Processes
[2 5 0 0]
[1 0 1 2] [1 0 2 3] [3 0 4 5] [3 0 5 5]
p3
[3 0 3 3] [3 5 6 5]
[1 0 0 2] [1 0 0 3] [ 1 0 0 5] [3 0 5 5]
p4
[1 0 0 1] [1 0 0 2] [1 0 0 3] [1 0 0 4] [1 0 0 5] [3 0 5 6] [3 0 5 7]
Time
Vector Time
Event counting
Event counting
precede e
Technique
p1
p2
Processes
p3
p4
Time
Technique
[1 0 0 0] [2 0 0 0] [3 0 0 0]
p1
{(1, 1)} {(1, 2)} {(1, 3)}
[2 3 1 0] [3 5 3 1] [3 6 6 3]
p2
[1 1 0 0] [2 2 0 0] [3 4 1 0]
{(3, 1)} {(3, 3), (4, 1)} {(3, 6), (4, 3)}
Processes
[0 0 2 1] [0 0 4 2] [0 0 5 3]
p3
[0 0 1 0] [0 0 3 1] [0 0 6 3]
{(4, 1)} {(4, 2)} {(4, 3)}
p4
[0 0 0 1] [0 0 0 2] [0 0 0 3]
Time

other processes
p1 [1 0 0 0] [2 0 0 0] [3 0 0 0]
{1} {2} {3}
p2 [2 3 1 0] [3 5 3 0] [3 6 6 0]
[1 1 0 0] [2 2 0 0] [3 4 1 0]
{1} {3} {6}
Processes
p3 [0 0 2 1] [0 0 4 2] [0 0 5 3]
[0 0 1 0] [0 0 3 1] [0 0 6 3]
{1} {2} {3}

p4
[0 0 0 1] [0 0 0 2] [0 0 0 3]
Time
Reference
Global State & Snapshot Recording
Algorithms
Dr. Barsha Mitra
BITS Pilani CSIS Dept, BITS Pilani, Hyderabad Campus
Hyderabad Campus
Introduction
•problems in recording global state
•lack of a globally shared memory
•lack of a global clock
•message transmission is asynchronous
•message transfer delays are finite but unpredictable
Course No.- SS ZG526, Course Title - Distributed Computing 2 BITS Pilani, Hyderabad Campus
•Global state of a distributed computation consists of
•states of the processes
•states of the communication channels
•State of a process is characterized by its local memory state
•Set of messages in transit in the channel characterizes state of channel
• occurrence of events
•changes processes' states
•changes channels' states
•causes transition in global system state
System Model
•if a snapshot recording algorithm records the states of pi and pj as LSi

and LSj , respectively, it must record the state of channel Cij as transit(LSi,
LSj)
•For Cij , intransit messages are:
transit(LSi, LSj) = {mij | send(mij) ∈ LSi  rec(mij)  LSj}
A Consistent Global State
• global state GS is a consistent global state iff:
• C1: every message mij that is recorded as sent in the local state of
sender pi must be captured either in the state of Cij or in the
collected local state of the receiver pj
• C1 states the law of conservation of messages
• C2: send(mij) LSi ⇒ mij  SCij ∧ rec(mij)  LSj
• Condition C2 states that for every effect, its cause must be present
Chandy–Lamport Algorithm
Marker sending rule for process pi
(1) Process pi records its state.
(2) For each outgoing channel C on which a marker has not been sent, pi sends
a marker along C before pi sends further messages along C.
Marker receiving rule for process pj
On receiving a marker along channel C:
if pj has not recorded its state then
Record the state of C as the empty set
Execute the “marker sending rule”
else
Record the state of C as the set of messages received along C after pj’s
state was recorded and before pj received the marker along C
Putting together recorded snapshots - each process can send its local
snapshot to the initiator of the algorithm
termination criterion - each process has received a marker on all of its

incoming channels
500 450 450 350 370 370
S1: A
100
50
20
S2: B
1000 1000 1050 1030 1130 1130

t0 t1 t2 t3 t4 t5
Snapshot Algorithms for Non-FIFO
Channels
• a marker cannot be used to
• non-FIFO algorithm by Lai and Yang use message piggybacking to

distinguish computation messages sent after the marker from those
sent before the marker
Lai–Yang Algorithm
• coloring scheme
• every process is initially white
• process turns red while taking a snapshot
• equivalent of the “marker sending rule” is executed when
a process turns red
• every message sent by a white process is colored white
• a white message is a message that was sent before the
sender of that message recorded its local snapshot
• every message sent by a red process is colored red

• a red message is a message that was sent after the
• every white process takes its snapshot no later than the

instant it receives a red message
• when a white process receives a red message, it records
its local snapshot before processing the message
• every white process records a history of all white messages sent or
received by it along each channel
• when a process turns red, it sends these histories along with its
snapshot to the initiator process that collects the global snapshot
• initiator process evaluates transit(LSi, LSj) to compute the state of a
channel Cij as:
SCij = {white messages sent by pi on Cij} − {white messages
received by pj on Cij}
= {mij | send(mij) ∈ LSi} − {mij | rec(mij) ∈ LSj}
800 780 750 740 740 740 770 770 790
P1: X
30 10
20
20
30
P2: Y
200 200 220 220 190 200 230 210

t1 t2 t3 t4 t5 t6 t7 t8 t9
SC12 = {white messages sent by P1 on C12} – {white messages received by P2 on C12}
Reference
Global State & Snapshot Recording
Algorithms
Dr. Barsha Mitra
Hyderabad Campus
Introduction
•problems in recording global state
•lack of a globally shared memory
•lack of a global clock
•message transmission is asynchronous
•message transfer delays are finite but unpredictable
•Global state of a distributed computation consists of
•states of the processes
•states of the communication channels
•State of a process is characterized by its local memory state
•Set of messages in transit in the channel characterizes state of channel
• occurrence of events
•changes processes' states
•changes channels' states
•causes transition in global system state
System Model
•if a snapshot recording algorithm records the states of pi and pj as LSi

and LSj , respectively, it must record the state of channel Cij as transit(LSi,
LSj)
•For Cij , intransit messages are:
transit(LSi, LSj) = {mij | send(mij) ∈ LSi  rec(mij)  LSj}
A Consistent Global State
• global state GS is a consistent global state iff:
• C1: every message mij that is recorded as sent in the local state of
sender pi must be captured either in the state of Cij or in the
collected local state of the receiver pj
• C1 states the law of conservation of messages
• C2: send(mij) LSi ⇒ mij  SCij ∧ rec(mij)  LSj
• Condition C2 states that for every effect, its cause must be present
Marker sending rule for process pi
(1) Process pi records its state.
(2) For each outgoing channel C on which a marker has not been sent, pi sends
a marker along C before pi sends further messages along C.
Marker receiving rule for process pj
On receiving a marker along channel C:
if pj has not recorded its state then
Record the state of C as the empty set
Execute the “marker sending rule”
else
Record the state of C as the set of messages received along C after pj’s
state was recorded and before pj received the marker along C
Putting together recorded snapshots - each process can send its local
snapshot to the initiator of the algorithm
termination criterion - each process has received a marker on all of its

incoming channels
500 450 450 350 370 370
S1: A
100
50
20
S2: B
1000 1000 1050 1030 1130 1130

t0 t1 t2 t3 t4 t5
Snapshot Algorithms for Non-FIFO
Channels
• a marker cannot be used to
• non-FIFO algorithm by Lai and Yang use message piggybacking to

distinguish computation messages sent after the marker from those
sent before the marker
• coloring scheme
• every process is initially white
• process turns red while taking a snapshot
• equivalent of the “marker sending rule” is executed when
a process turns red
• every message sent by a white process is colored white
• a white message is a message that was sent before the
• every message sent by a red process is colored red

• a red message is a message that was sent after the
• every white process takes its snapshot no later than the

instant it receives a red message
• when a white process receives a red message, it records
its local snapshot before processing the message
• every white process records a history of all white messages sent or
received by it along each channel
• when a process turns red, it sends these histories along with its
snapshot to the initiator process that collects the global snapshot
• initiator process evaluates transit(LSi, LSj) to compute the state of a
channel Cij as:
SCij = {white messages sent by pi on Cij} − {white messages
received by pj on Cij}
= {mij | send(mij) ∈ LSi} − {mij | rec(mij) ∈ LSj}
800 780 750 740 740 740 770 770 790
P1: X
30 10
20
20
30
P2: Y
200 200 220 220 190 200 230 210

t1 t2 t3 t4 t5 t6 t7 t8 t9
Message ordering
Dr. Barsha Mitra

Hyderabad Campus
Group Communication
•broadcast - sending a message to all members in the
distributed system
•multicasting - a message is sent to a certain subset,
identified as a group, of the processes in the system
•unicasting - point-to-point message communication
Causal Order
•A system supporting causal ordering model satisfies
CO: for mij and mkj , if send(mij ) → send(mkj ),
then rec(mij ) → rec(mkj )
•2 criteria must be satisfied by a causal ordering
protocol:
•Safety
•Liveness
Causal Order
Safety
•a message M arriving at a process may need to be buffered until all
system-wide messages sent in the causal past of the send(M) event to the
same destination have already arrived
•distinction is made between
•arrival of a message at a process
•event at which the message is given to the application process
Liveness
A message that arrives at a process must eventually be delivered to the
process
Raynal–Schiper–Toueg Algorithm
• each message M should carry a log of
• all other messages
• sent causally before M’s send event, and sent to the same
destination dest(M)
• log can be examined to ensure whether it is safe to deliver a
message
• channels are FIFO
local variables:
• array of int SENT[1 …… n, 1 …… n] (n x n array)
• SENTi[j, k] = no. of messages sent by Pj to Pk as known to Pi
• array of int DELIV [1 …… n]
• DELIVi[j] = no. of messages from Pj that have been delivered to Pi
(1) send event, where Pi wants to send message M to Pj :

(1a) send (M, SENT) to Pj
(1b) SENT[i, j] = SENT[i, j] + 1
(2) message arrival, when (M, ST) arrives at Pi from Pj :

(2a) deliver M to Pi when for each process x,
(2b) DELIVi[x]  ST[x, i ]
(2c)  x, y, SENT[x, y] = max(SENT[x, y], ST[x, y])
(2d) DELIVi[j] = DELIVi[j] + 1
Complexity:
•space requirement at each process: O(n2) integers
•space overhead per message: n2 integers
•time complexity at each process for each send and deliver
event: O(n2)
• P1 sent 3 messages to P2 SENT of P1 Updated SENT of P1
• P1 sent 4 messages to P3 • send(m1, SENT) to P2
0 3 4 0 4 4
• P2 sent 5 messages to P1 • SENT1[1, 2] = 4
5 0 2 5 0 2
• P2 sent 2 messages to P3 0 4 0 0 4 0
• P3 sent 4 messages to P2
• Now if,
• P1 sends m1 to P2 SENT of P2
DELIV2[1] >= ST[1, 2] 0 3 4
DELIV1 = [0 4 0]
5 0 2
DELIV2[3] >= ST[3, 2]
DELIV2 = [4
[3 0 4] 0 4 0
DELIV3 = [3 2 0]
Updated SENT of P3
• P1 sent 3 messages to P2 SENT of P3
• P1 sent 4 messages to P3 • send(m2, SENT) to P2 0 3 4
0 3 4
• P2 sent 5 messages to P1 • SENT3[3, 2] = 5 5 0 2
5 0 2
• P2 sent 2 messages to P3 0 4 0
0 5 0
• Now if, SENT of P2
• P3 sends m2 to P2 0 3 4
DELIV2[1] >= ST[1, 2] 5 0 2
DELIV1 = [0 4 0]
0 4 0
DELIV2[3] >= ST[3, 2]
DELIV2 = [4 0 5]
4]
DELIV3 = [3 2 0]
• P1 sent 3 messages to P2 SENT of P1 Updated SENT of P1
• P1 sent 4 messages to P3 0 4 4 • send(m3, SENT) to P3 0 4 5
• P2 sent 5 messages to P1 5 0 2 • SENT1[1, 3] = 5 5 0 2
• P2 sent 2 messages to P3 0 4 0 0 4 0
• Now if,
• P1 sends m3 to P3 SENT of P3
DELIV3[1] >= ST[1, 3] 0 3 4
DELIV1 = [0 4 0] 5 0 2
DELIV3[2] >= ST[2, 3] 0 5 0
DELIV2 = [4 0 5]
DELIV3 = [3 2 0]
Birman-Schiper-Stephenson Protocol
• Ci = vector clock of Pi
• Ci[j] = jth element of Ci
• tm = vector timestamp for message m
Pi sends a message m to Pj
• Pi increments Ci[i]
• Pi sets the timestamp tm = Ci for message m
Pj receives a message m from Pi

• when Pj (j ≠ i) receives m with timestamp tm, it delays message
delivery until:
• Cj[i] = tm[i] - 1
•  k <= n and k ≠ i, Cj[k]  tm[k]
• when m is delivered to Pj, update Pj's vector clock
i, Cj[i] = max(Cj[i], tm[i])
• check buffered messages to see if any can be delivered
P1
000
001
P2
000
001
P3
000
001
P1 • C2[3] = tb[3] – 1
000
• C2[1] >= tb[1]
001
• C2[2] >= tb[2]
P2 001
000
001
P3
000
001
P1 • C3[2] = tr[2] – 1
000
• C3[1] >= tr[1]
011 • C3[3] >= tr[3]
001
P2 001 011
000
001 011
P3
000
001
011
P1 (buffer) 0 0 1 • C1[2] = tr[2] – 1

000
• C1[1] >= tr[1]
011 • C1[3] >= tr[3]
001
P2 0 0 1 011
000
• C1[3] = tb[3] – 1
001 011
• C1[1] >= tb[1]
P3 • C1[2] >= tb[2]
000
001
011
P1 (buffer) 0 0 1 011
000
011 deliver
0 0 1 from buffer SES Protocol from
P2 0 0 1 011
Recorded Lecture
000
001 011
P3
000
001
011
References
• https://courses.csail.mit.edu/6.006/fall11/rec/rec14.pdf
Terminology & Basic Algorithms
Dr. Barsha Mitra
Hyderabad Campus
Notation & Definitions
• undirected unweighted graph G = (N, L)
represents topology
• vertices are nodes
• edges are channels
• n = |N|
• l = |L|
• diameter of a graph – A, B  1 A, C  2
A, D  1 A, F  2
• minimum number of edges that need A, E  1 B, D  2
to be traversed to go from any node B, C  1 B, E  2
to any other node C, E  1 B, F  2
C, F  1 C, D  2
• diameter = maxi, j ∈ N {length of the D, E  1 D, F  2
shortest path between i and j} E, F  1
Synchronous Single-Initiator Spanning
Tree Algorithm using Flooding
• algorithm proceeds in rounds (synchronous)
• root initiates the algorithm
• flooding of QUERY messages
• produce a spanning tree rooted at the root node
• each process Pi (Pi  root) should output its own parent for the
spanning tree
Local Variables maintained at each Pi
• int visited
• int depth
• int parent
• set of int Neighbors Initially at each Pi
• visited = 0
• depth = 0
• parent = NULL
• Neighbors = set of neighbors
Algorithm for Pi
Round r = 1
if Pi = root then
visited = 1
depth = 0
send QUERY to Neighbors
if Pi receives a QUERY message then
visited = 1
depth = r
parent = root
plan to send QUERYs to Neighbors at next round
Algorithm for Pi
Round r > 1 and r <= diameter
if Pi planned to send in previous round then
Pi sends QUERY to Neighbors
if visited = 0 then
if Pi receives QUERY messages then
visited = 1
depth = r
parent = any randomly selected node from which QUERY was received
plan to send QUERY to Neighbors \ {senders of QUERYs received in r}
A (1) B C
QUERY
(1)
F E D
A (1) B C
QUERY
(1)
F E D
A B (2) C
QUERY
(2)
F (2) E D
A B (2) C
QUERY
(2)
F (2) E D
A B C
(3)
QUERY
(3)
(3)
F E (3) D
A B C
(3)
QUERY
(3)
(3)
F E (3) D
A B C
F E D
Broadcast and Convergecast on a Tree
•A spanning tree is useful for distributing (via a broadcast) and collecting
(via a convergecast) information to/from all the nodes
Broadcast Algorithm:
•BC1:
•The root sends the information to
be broadcast to all its children.
•BC2:
•When a (non-root) node receives
information from its parent, it
copies it and forwards it to its
children.
Convergecast Algorithm:
• CVC1:
• Leaf node sends its report to its parent.
• CVC2:
• At a non-leaf node that is not the root:
When a report is received from all the
child nodes, the collective report is sent
to the parent.
• CVC3:
• At the root: When a report is received
from all the child nodes, the global
function is evaluated using the reports.
Complexity
• each broadcast and each convergecast requires n − 1 messages

• each broadcast and each convergecast requires time equal to
the maximum height h of the tree, which is O(n)
Reference
Distributed Mutual Exclusion
Dr. Barsha Mitra

Hyderabad Campus
Types of Approaches
• Token-based
• Assertion-based
• From Recorded Lecture
• Quorum-based
Course Title: Distributed Computing BITS Pilani, Hyderabad Campus

Quorum-Based Approach
• each site requests permission to execute the CS from a subset

of sites called quorum
• quorums are formed in such a way that when two sites
concurrently request access to the CS
• at least one site receives both the requests
• makes sure that only one site executes the CS at any time

Requirements of Mutual Exclusion
Algorithms
•Safety property
•at any instant, only one process can execute the CS
•Liveness property
•a site must not wait indefinitely to execute the CS while other sites
are repeatedly executing the CS
•Fairness
•each process gets a fair chance to execute the CS
•CS execution requests are executed in order of their arrival time
•time is determined by a logical clock

Performance Metrics

Lamport’s Algorithm
•every site Si keeps a queue, request_queuei

•request_queuei contains mutual exclusion requests ordered
by their timestamps
•FIFO
•CS requests are executed in increasing order of timestamps

Requesting the critical section:

• When a site Si wants to enter the CS, it broadcasts a
REQUEST(tsi, i) message to all other sites and places the
request on request_queuei. ((tsi, i) denotes the timestamp of
the request)
• When a site Sj receives the REQUEST(tsi, i) message from site
Si, it places site Si’s request on request_queuej and returns a
timestamped REPLY message to Si
Executing the critical section:
Site Si enters the CS when the following two conditions hold:
• L1: Si has received a REPLY message with timestamp larger than
(tsi, i) from all other sites
• L2: Si’s request is at the top of request_queuei
Releasing the critical section:

• Site Si, upon exiting the CS, removes its request from the top of its
request queue and broadcasts a timestamped RELEASE message to all
other sites
• When a site Sj receives a RELEASE message from site Si, it removes Si’s
request from its request queue
Sites S1 and S2 make requests for the CS

REPLY
May also write as: request_queue1 = ((1, 1), (1, 2))

REQUEST
Site S1 enters the CS

RELEASE
Site S1 exits the CS and sends RELEASE messages


Performance - Requires 3(N − 1) messages per CS invocation
Ricart–Agrawala Algorithm
•communication channels are not required to be FIFO
•REQUEST and REPLY messages
•each process pi maintains the request-deferred array, RDi
•size of RDi = no. of processes in the system
•initially, ∀i ∀j: RDi[j] = 0
•whenever pi defers the request sent by pj, it sets RDi[j] = 1,
•after it has sent a REPLY message to pj , it sets RDi[j] = 0

(a) When a site Si wants to enter the CS, it broadcasts a
timestamped REQUEST message to all other sites
(b) When site Sj receives a REQUEST message from site Si, it

sends a REPLY message to site Si if site Sj is neither requesting
nor executing the CS, or if the site Sj is requesting and Si’s
request’s timestamp is smaller than site Sj’s own request’s
timestamp. Otherwise, the reply is deferred and Sj sets RDj [i] = 1

(c) Site Si enters the CS after it has received a REPLY message from every
site it sent a REQUEST message to
Releasing the critical section:

(d) When site Si exits the CS, it sends all the deferred REPLY messages:
∀j if RDi [j] = 1, then Si sends a REPLY message to Sj and sets RDi [j] = 0

Sites S1 and S2 each make a request for the CS


Site S1 exits the CS and sends a REPLY message

to S2’s deferred request

Performance - requires 2(N − 1) messages per CS execution

Maekawa’s Algorithm
• quorum-based mutual exclusion algorithm

• request sets for sites (i.e., quorums) are constructed to satisfy
the following conditions:
• M1: (∀i ∀j : i  j, 1 ≤ i, j ≤ N :: Ri ∩ Rj   )
• M2: (∀i : 1 ≤ i ≤ N :: Si ∈ Ri)
• M3: (∀i : 1 ≤ i ≤ N :: |Ri| = K for some K)
• M4: Any site Sj is contained in K number of Ri’s, 1 ≤ i, j ≤ N
• Maekawa showed that N = K(K − 1) + 1
• This relation gives |Ri| = K = N (square root of N)
• Uses REQUEST, REPLY and RELEASE messages
• Performance  3N messages per CS invocation
Maekawa’s Algorithm: Problem
7 sites – S1, S2, S3, S4, S5, S6 and S7, K = 3. R1 = {S1, S2, S4}
 R2 = {S2, S3, S5}
 R3 = {S1, S3, S7}
 R4 = {S2, S5, S6} R3  R4 =   M1.
 R5 = {S1, S4, S5} R5  R7 =   M1.
 R6 = {S3, S4, S6}
S4 does not belong to R4 M2.
 R7 = {S2, S6, S7}
S2 belongs to R1, R2, R4 and R7 and K = 3  M4.
S7 belongs to R3 and R7 and K = 3  M4.

Maekawa’s Algorithm
• Algorithm Steps: from Recorded Lecture

• Deadlocks: from Recorded Lecture

Reference

“Distributed Computing: Principles, Algorithms, and
Systems”, Cambridge University Press, 2008.

Dr. Barsha Mitra

Hyderabad Campus
Raymond’s Tree-Based Algorithm
• uses a spanning tree of the network

• each node maintains a HOLDER variable that provides information
about the placement of the privilege in relation to the node itself
• a node stores in its HOLDER variable the identity of a node that it
thinks has the privilege or leads to the node having the privilege
• for 2 nodes X and Y, if HOLDERX = Y, the undirected edge between X
and Y can be redrawn as a directed edge from X to Y
• Node containing the privilege is also known as the root node
Course Title: Distributed Computing 2 BITS Pilani, Hyderabad Campus

• messages between nodes traverse along undirected edges of the tree

• node needs to hold information about and communicate only to its
immediate-neighboring nodes
• HOLDERN1 = N2
N1 N2 N3 N4 • HOLDERN2 = N3
• HOLDERN3 = self
• HOLDERN4 = N3
• HOLDERN5 = N4
N9 N6 N5 • HOLDERN6 = N5
• HOLDERN7 = N5
• HOLDERN8 = N6
• HOLDERN9 = N6
N8 N7

N1 N2 N3 N4 N1 N2 N3 N4
N9 N6 N5 N9 N6 N5
N8 N7 N8 N7

Data Structures
• HOLDER
• USING
• possible values true or false
• indicates if the current node is executing the critical section
• ASKED
• possible values true or false
• indicates if node has sent a request for the privilege
• prevents the sending of duplicate requests for privilege

Data Structures
• REQUEST_Q
• FIFO queue that can contain “self ” or the identities of
immediate neighbors as elements
• REQUEST_Q of a node consists of the identities of
those immediate neighbors that have requested for
privilege but have not yet been sent the privilege
• maximum size of REQUEST_Q of a node is the number
of immediate neighbors + 1 (for “self ”)

Suzuki–Kasami’s Broadcast Algorithm
• broadcasts a REQUEST message for the token to all other sites
• site that possesses the token sends it to the requesting site
upon the receipt of its REQUEST message
• if a site receives a REQUEST message when it is executing the
CS, it sends the token only after it has completed the CS
execution

• Distinguish outdated and current REQUEST messages
• REQUEST(j, sn) where sn (sn = 1, 2…..) is a sequence number that indicates that Sj is
requesting its snth CS execution
• Si keeps an array of integers RNi[1, … ,n] where RNi[j] is the largest sequence number
received in a REQUEST message so far from Sj
• when Si receives a REQUEST(j, sn) message, it sets RNi[j] = max(RNi[j], sn)
• when Si receives a REQUEST(j,sn) message, the request is outdated if RNi[j] > sn
S1 receives REQUEST(2, 5), RN1 = [3 4 1] (assuming there are 3 sites)
S1 receives REQUEST(2, 3), RN1 = [3 4 1] (assuming there are 3 sites)

• sites with outstanding requests for the CS are determined in the following manner:
• token consists of a queue of requesting sites, Q, and an array of integers LN[1, … ,n],
where LN[j] is the sequence number of the request which Sj executed most recently
• after executing its CS, a Si updates LN[i] = RNi[i]
• token array LN[1, … ,n] permits a site to determine if a site has an outstanding request
for the CS
• at Si, if RNi[j] = LN[j] + 1, then Sj is currently requesting a token
• after executing the CS, a site checks this condition for all the j’s to determine all the
sites that are requesting the token and places their i.d.’s in queue Q if these i.d.’s are
not already present in Q
• finally, the site sends the token to the site whose i.d. is at the head of Q

(a) If requesting site Si does not have the token, then it increments
its sequence number, RNi[i], and sends a REQUEST(i, sn) message to
all other sites. (“sn” is the updated value of RNi[i].)
(b) When a site Sj receives this message, it sets RNj[i] to max(RNj[i],

sn). If Sj has the idle token, then it sends the token to Si if RNj[i] =
LN[i]+1.

(c) Site Si executes the CS after it has received the token.

Releasing the critical section: Having finished the execution of

the CS, site Si takes the following actions:
(d) It sets LN[i] element of the token array equal to RNi[i].
(e) For every site Sj whose i.d. is not in the token queue, it
appends its i.d. to the token queue if RNi[j] = LN[j] + 1.
(f) If the token queue is non-empty after the above update, Si

deletes the top site i.d. from the token queue and sends the
token to the site indicated by the i.d.

Performance
• if a site holds the token, no message is required
• if a site does not hold the token, N messages are required

Reference

“Distributed Computing: Principles, Algorithms, and
Systems”, Cambridge University Press, 2008.

SS ZG 526: Distributed Computing
Introduction: Part 1
Chittaranjan Hota
Professor, Dept. of Computer Sc. & Information Systems
BITS Pilani BITS, Pilani Hyderabad Campus, Hyderabad
Hyderabad Campus hota@hyderabad.bits-pilani.ac.in
Course Overview
Introduction to distributed computing
Logical time
Global snapshot
Causal ordering of messages in group communication
Distributed mutual exclusion

Consensus and agreement protocols
Distributed deadlock detection
Peer-to-Peer computing
Cluster computing
Grid computing
Internet of Things (IoT)
SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Text and References
T1 Ajay D. Kshemkalyani, and Mukesh Singhal “Distributed Computing:

Principles, Algorithms, and Systems”, Cambridge University Press, 2008
(Reprint 2013).
R1 Kai Hwang, Geoffrey C. Fox, and Jack J. Dongarra, “Distributed and

Cloud Computing: From Parallel processing to the Internet of Things”,
Morgan Kaufmann, 2012 Elsevier Inc.
R2 John F. Buford, Heather Yu, and Eng K. Lua, “P2P Networking and
Applications”, Morgan Kaufmann, 2009 Elsevier Inc.
Golden era in Computing
Powerful multi-core
processors
Explosion of General purpose
domain graphic processors
applications
Proliferation Superior
of devices software
methodologies
Virtualization
Wider bandwidth
leveraging the
for communication
powerful hardware
Image source: Cloud Futures 2011, Redmond
What is a Distributed System?
• Loosely coupled collection of autonomous computers connected by a

network running a distributed operating system to produce a single
integrated computing environment ( a virtual computer).
Relation between software components
Networked vs. distributed systems
Networking OS Distributed OS
Motivation for Distributed Computing
• Resource sharing
• Access to remote resources
• Increased performance/cost ratio
• Reliability
• Availability, integrity, fault-tolerance
• Scalability
• Modularity and incremental expandability
Few Examples …
Linux cluster Global Access to Resource Using Distributed

Architecture
Continued…
Tor P2P Overlay Image source: Wiki
Next session
Multiprocessor vs multicomputer systems, Distributed

systems design issues
Introduction: Part 2
Chittaranjan Hota
Multiprocessor Vs Multi-computer systems
Multiprocessors
a) A crossbar switch, b) An omega switching network
Distributed Systems: Pros and Cons
• Advantages
– Communication and resource sharing possible
– Economy : Price-performance
– Reliability & scalability
– Potential for incremental growth
• Disadvantages
– Distribution-aware OSs and applications
– High network connectivity is essential
– Security and privacy
Design Issues: Scaling
• If possible, do asynchronous communication
– Not always possible if the client has nothing to do
• Alternatively, by moving part of the computation
Technique (1): Hiding latency

Scaling Technique (2): Distribution aware
Examples: DNS resolutions
More Design Issues
• Lack of Global Knowledge

• Naming
• Compatibility
• Process Synchronization
• Resource Management
• Security
Image source: www.bbc.co.uk
Next session
Distributed Communication models
Distributed Communication Models
Chittaranjan Hota
RPC: Remote Procedure Call
• Issues:
– identifying and accessing the remote procedure
– parameters
– return value
• Sun RPC
• Microsoft’s DCOM
• OMG’s CORBA
• Java RMI
• XML/RPC
• SOAP/.NET
• AJAX (Asynchronous Javascript and XML)
Remote Procedure Calls (RPC)
Procedure arguments
• To reduce the complexity of the interface specification, Sun RPC includes

support for a single argument to a remote procedure.*
• Typically the single argument is a structure that contains a number of values.
• *Newer versions can handle multiple arguments.
Procedure identification
• Each procedure is identified by:

– Hostname (IP Address)
– Program identifier (32 bit integer)
– Procedure identifier (32 bit integer)
–Program Version identifier
»for testing and migration.
Program identifiers
• Each remote program has a unique ID.

• Sun divided up the IDs:
– 0x00000000 - 0x1fffffff Sun
– 0x20000000 - 0x3fffffff SysAdmin
– 0x40000000 - 0x5fffffff Transient
– 0x60000000 - 0xffffffff Reserved
SUN RPC
struct square_in {
long arg1;
};
struct square_out {
long res1;
};
program SQUARE_PROG {
version SQUARE_VERS {
square_out SQUAREPROC(square_in) = 1;
} = 1;
} = 0x13451111;
Rpcgen
Protocol Input File

Description
rpcgen
Client Stubs XDR filters header file Server skeleton
C Source Code
Rpcgen continued…
bash$ rpcgen square.x
produces:
– square.h header
– square_svc.c server stub
– square_clnt.c client stub
– square_xdr.c XDR conversion routines
Function names derived from IDL function names and version numbers
Square Client: Client.c
#include “square.h”
#include <stdio.h>
int main (int argc, char **argv)
{ CLIENT *cl;
square_in in;
square_out *out;
if (argc != 3) { printf(“client <localhost> <integer>”); exit (1); }
cl = clnt_create (argv[1], SQUARE_PROG, SQUARE_VERS, “tcp”);
in.arg1 = atol (argv [2]);
if ((out = squareproc_1(&in, cl)) == NULL)
{ printf (“Error”); exit(1); }
printf (“Result %ld\n”, out -> res1);
exit(0);
}
Square server: server.c
#include “square.h”
#include <stdio.h>
square_out *squareproc_1_svc (square_in *inp, struct svc_req *rqstp)
{
static square_out *outp;
outp.res1 = inp -> arg1 * inp -> arg1;
return (&outp);
}
Exe creation
• gcc –o client client.c square_clnt.c square_xdr.c –lnsl
• gcc –o server server.c square_svc.c square_xdr.c –lrpcsvc -lnsl
Next module
How to model a distributed computation, and logical

clocks?
Distributed Computation Model, and Logical Clocks
Chittaranjan Hota
A distributed program
• A distributed program is composed of a set of n

asynchronous processes, p1, p2, ..., pi , ..., pn.
• The processes do not share a global memory and

communicate solely by message passing.
• The processes do not share a global clock that is

instantaneously accessible to these processes.
• Process executions and message transfers are

asynchronous.
A model of distributed executions
• There are several models of services provided by

communication networks, namely, FIFO, Non-FIFO, and
causal ordering.
Global state of a distributed system
Consistent global state
Next session
Lamport’s logical clocks
Distributed Mutual Exclusion (DME)
Chittaranjan Hota
What is mutual exclusion?
1. Simultaneous update and read of a directory/file
2. Can two processes send their data to a printer?
So, it is exclusive access to a shared resource or to the

critical region.
Performance of DME Algorithms
• Performance of each algorithm is measured in terms of

- no. of messages required for CS invocation
- synchronization delay (leaving & entering)
- response time (arrival, sending out, Entry & Exit)
• System throughput = 1/(sd + E)
Continued…
A Centralized Algorithm
• A Central controller with a queue for deferring

replies.
• Request, Reply, and Release messages.
• Reliability and Performance bottleneck.
Next session
Lamports’ DME
Chittaranjan Hota
Lamport’s Distributed Mutual Exclusion
Requesting the critical section.
1. When a site Si wants to enter the CS, it sends a REQUEST(T=tsi, i)

message to all the sites in its request set Ri and places the request on
request_queuei.
2. When a site Sj receives the REQUEST(tsi , i) message from site Si, it
returns a timestamped REPLY message to Si and places site Si’s request
on request_queuej.
Executing the critical section.
Site Si enters the CS when the two following conditions hold:

1. Si has received a message with timestamp larger than (tsi, i) from all
other sites.
2. Si’s request is at the top of request_queuei.
Continued…
Releasing the critical section.
1. Site Si, upon exiting the CS, removes its’ request from the top of its’ request
queue and sends a timestamped RELEASE message to all the sites in its
request set.
2. When a site Sj receives a RELEASE message from site Si, it removes Si’s
request from its request queue.
3. When a site removes a request from its request queue, its own request may
become the top of the queue, enabling it to enter the CS. The algorithm
executes CS requests in the increasing order of timestamps.
Example
Correctness
• Suppose that both Si and Sj were in CS at the same time (t).
• Then we have:
Next Session
Ricart-Agrawala, and Maekawa’s DME
Chittaranjan Hota
Ricart-Agrawala DME
Requesting Site:
– A requesting site Pi sends a message request(ts,i) to all
sites.
Receiving Site:
– Upon reception of a request(ts,i) message, the receiving
site Pj will immediately send a timestamped reply(ts,j)
message if and only if:
• Pj is not requesting or executing the critical section OR
• Pj is requesting the critical section but sent a request
with a higher timestamp than the timestamp of Pi
– Otherwise, Pj will defer the reply message.
Example
Maekawa’s DME
• A site requests permission only from a subset of sites.
• Request set of sites Si & Sj: Ri, Rj such that Ri and Rj will have
at-least one common site (Sk). Sk mediates conflicts between
Ri and Rj.
• A site can send only one REPLY message at a time, i.e., a site
can send a REPLY message only after receiving a
RELEASE message for the previous REPLY message.
Request set
• Conditions M1 and M2 are necessary for correctness; whereas

conditions M3 and M4 provide other desirable features to the algorithm.
• Condition M3 states that the size of the request sets of all sites must be
equal implying that all sites should have to do equal amount of work to
invoke mutual exclusion.
• Condition M4 enforces that exactly the same number of sites should
request permission from any site implying that all sites have “equal
responsibility” in granting permission to other sites.
Maekawa’s Request set
R1 = { 1, 2, 3, 4 }
R2 = { 2, 5, 8, 11 }
R3 = { 3, 6, 8, 13 }
R4 = { 4, 6, 10, 11 }
R5 = { 1, 5, 6, 7 }
R6 = { 2, 6, 9, 12 }
R7 = { 2, 7, 10, 13 } N=13
R8 = { 1, 8, 9, 10 }
R9 = { 3, 7, 9, 11 }
R10 = { 3, 5, 10, 12 }
R11 = { 1, 11, 12, 13 }
R12 = { 4, 7, 8, 12 }
R13 = { 4, 5, 9, 13 }
Next Session
Maekawa’s DME algorithm
Chittaranjan Hota
Maekawa’s DME Algo
Requesting the critical section
1. A site Si requests access to the CS by sending REQUEST(i)
messages to all the sites in its request set Ri.
2. When a site Sj receives the REQUEST(i) message, it sends a

– REPLY(j) message to Si provided it hasn’t sent a REPLY
message to a site from the time it received the last RELEASE
message.
– Otherwise, it queues up the REQUEST for later consideration.
Executing the critical section

1. Site Si accesses the CS only after receiving REPLY messages
from all the sites in Ri .
Continued…
Releasing the critical section

1. After the execution of the CS, site Si sends RELEASE(i) message to all the
sites in Ri .
2. When a site Sj receives a RELEASE(i) message from site Si, it sends a

REPLY message to the next site waiting in the queue and deletes that entry
from the queue. If the queue is empty, then the site updates its state to reflect
that the site has not sent out any REPLY message.
Example
R1  {S1, S2, S3}
R2  {S2, S4, S6}
R3  {S3, S5, S6}
R4  {S1, S4, S5}
R5  {S2, S5, S7}
R6  {S1, S6, S7}
R7  {S3, S4, S7}
Deadlocks
R1  {S1, S2, S3}
R2  {S2, S4, S6}
R3  {S3, S5, S6}
R4  {S1, S4, S5}
R5  {S2, S5, S7}
R6  {S1, S6, S7}
R7  {S3, S4, S7}
Deadlock handling
FAILED
A FAILED message from site Si to site Sj indicates that Si
cannot grant Sj’s request because it has currently granted
permission to a site with a higher priority request. So that Sj
will not think that it is just waiting for the message to arrive.
INQUIRE
An INQUIRE message from Si to Sj indicates that Si would like
to find out from Sj if it has succeeded in locking all the sites in
its request set.
YIELD
A YIELD message from site Si to Sj indicates that Si is
returning the permission to Sj (to yield to a higher priority
request at Sj ).
Next Session
Token based DME
Chittaranjan Hota
Logical Clocks: Scalar time and Vector time
• Causality among events in a distributed system is a

powerful concept in reasoning, analysing, and drawing
inferences about a computation.
• The knowledge of the causal precedence relation among

the events of processes helps solve a variety of problems
in distributed systems, such as distributed algorithms
design, tracking of dependent events, knowledge about the
progress of a computation, and concurrency measures.
Scalar Time
• Proposed by Lamport as an attempt to totally order events
in a distributed system.
• Time domain is the set of non-negative integers.
• The logical local clock of a process pi and its local view of
the global time are squashed into one integer variable Ci.
• Rules R1 and R2 to update the clocks are as follows:
• R1: Before executing an event (send, receive, or internal),
process pi executes the following:
Ci := Ci + d (d > 0)
In general, every time R1 is executed, d can have a
different value; however, typically d is kept at 1.
Scalar Time
• R2: Each message piggybacks the clock value of its sender at

sending time. When a process pi receives a message with
timestamp Cmsg , it executes the following actions:
 Ci := max(Ci , Cmsg )
 Execute R1.
 Deliver the message.
Scalar Time Examples
Basic Properties
Consistency Property
• Scalar clocks satisfy the monotonicity and hence the
consistency property:
for two events ei and ej,
Total Ordering
• Scalar clocks can be used to totally order events in a distributed
system.
• The main problem in totally ordering events is that two or more
events at different processes may have identical timestamp.
Total Ordering
• A tie-breaking mechanism is needed to order such events. A tie

is broken as follows:
 Process identifiers are linearly ordered and tie among events with
identical scalar timestamp is broken on the basis of their process
identifiers.
 The lower the process identifier in the ranking, the higher the priority.
 The timestamp of an event is denoted by a tuple (t, i) where t is its time of
occurrence and i is the identity of the process where it occurred.
 The total order relation ≺ on two events x and y with timestamps (h,i) and
(k,j), respectively, is defined as follows:
Properties. . .
Event counting
• If the increment value d is always 1, the scalar time has the
following interesting property: if event e has a timestamp h, then
h-1 represents the minimum logical duration, counted in units of
events, required before producing the event e;
• We call it the height of the event e.
• In other words, h-1 events have been produced sequentially
before the event e regardless of the processes that produced
these events.
Properties. . .
No Strong Consistency
• The system of scalar clocks is not strongly consistent; that is,

for two events ei and ej,
• The reason that scalar clocks are not strongly consistent is
that the logical local clock and logical global clock of a process
are squashed into one, resulting in the loss of causal
dependency information among events at different processes.
Next session
Vector clocks
Chittaranjan Hota
Vector Time
• The system of vector clocks was developed independently

by Fidge, Mattern and Schmuck.
• In the system of vector clocks, the time domain is

represented by a set of n-dimensional non-negative integer
vectors.
Vector Time
Process pi uses the following two rules R1 and R2 to update its
clock:
• R1: Before executing an event, process pi updates its local
logical time as follows:
vti[i] := vti[i] + d (d > 0)
• R2: Each message m is piggybacked with the vector clock vt of
the sender process at sending time. On the receipt of such a
message (m,vt), process pi executes the following sequence of
actions:
 Update its global logical time as follows:
1≤ k≤ n : vti[k] := max(vti [k], vt[k])
 Execute R1.
 Deliver the message m.
Vector Time Example
Vector Time
Comparing Vector Timestamps
• The following relations are defined to compare two vector

timestamps, vh and vk:
• If the process at which an event occurred is known, the test to

compare two timestamps can be simplified as follows: If events x
and y respectively occurred at processes pi and pj and are
assigned timestamps vh and vk, respectively, then
Vector Time
Strong Consistency
• The system of vector clocks is strongly consistent; thus, by examining the
vector timestamp of two events, we can determine if the events are causally
related.
• However, Charron-Bost showed that the dimension of vector clocks cannot be
less than n, the total number of processes in the distributed computation, for
this property to hold.
Event Counting
• If d=1 (in rule R1), then the ith component of vector clock at process pi, vti[i],
denotes the number of events that have occurred at pi until that instant.
• So, if an event e has timestamp vh, vh[j] denotes the number of events
executed by process pj that causally precede e. Clearly, Σvh[j] − 1 represents
the total number of events that causally precede e in the distributed
computation.
Efficient Implementations of Vector Clocks
• If the number of processes in a distributed computation is large,

then vector clocks will require piggybacking of huge amount of
information in messages.
• The message overhead grows linearly with the number of
processors in the system and when there are thousands of
processors in the system, the message size becomes huge even if
there are only a few events occurring in few processors.
• We discuss an efficient way to maintain vector clocks.
• Charron-Bost showed that if vector clocks have to satisfy the strong
consistency property, then in general vector timestamps must be at
least of size n, the total number of processes.
• However, optimizations are possible and next, and we discuss a
technique to implement vector clocks efficiently.
Singhal-Kshemkalyani’s Differential Technique
• based on the fact that between successive message sends to the same process, only
a few entries of the vector clock at the sender process are likely to change.
Next module
Global state recording and Causal ordering of

messages
Global State Recording
Chittaranjan Hota
Global state recording
• Recording the global state of a distributed system on-the-fly is

an important problem.
• Why is it non-trivial?
Uses of Global State Recording
• Recording a “consistent” state of the global

computation
– checkpointing in fault tolerance (rollback,
recovery)
• Detecting stable properties in a distributed

system via snapshots. A property is “stable”
if, once it holds in a state, it holds in all
subsequent states.
– termination detection
– deadlock detection
…
Issues in recording a global state
• The following two issues need to be addressed:

I1: How to distinguish between the messages to be recorded in
the snapshot from those not to be recorded?
-Any message that is sent by a process before recording its
snapshot, must be recorded in the global snapshot.
-Any message that is sent by a process after recording its
snapshot, must not be recorded in the global snapshot.
I2: How to determine the instant when a process takes its
snapshot?
-A process pj must record its snapshot before processing a
message mij that was sent by process pi after recording its
snapshot.
Difficulties due to Non-determinism
• Deterministic Computation
• - At any point in computation there is at
most one event that can happen next.
• Non-Deterministic Computation
• - At any point in computation there can
be more than one event that can happen
next.
Example
Non-deterministic
Deterministic
Consistent/ In-consistent global states
p q
p
m2
m1
m3
q
Properties of the recorded global state
• The recorded global state may not correspond to

any of the global states that occurred during the
computation.
• This happens because a process can change its

state asynchronously before the markers it sent
are received by other sites and other sites record
their states.
 But the system could have passed through the recorded global
states in some equivalent executions.
Next session
Chandy Lamports’ Global state recording algorithm
Global State Recording
Chittaranjan Hota
Snapshot algorithms for FIFO channels
Chandy-Lamport algorithm
• The Chandy-Lamport algorithm uses a control message, called
a marker whose role in a FIFO system is to separate
messages in the channels.
• After a site has recorded its snapshot, it sends a marker, along
all of its outgoing channels before sending out any more
messages.
• A marker separates the messages in the channel into those to
be included in the snapshot from those not to be recorded in
the snapshot.
• A process must record its snapshot no later than when it
receives a marker on any of its incoming channels.
Chandy-Lamport algorithm continued…
• The algorithm can be initiated by any process by executing the

“Marker Sending Rule” by which it records its local state and
sends a marker on each outgoing channel.
• A process executes the “Marker Receiving Rule” on receiving a
marker. If the process has not yet recorded its local state, it
records the state of the channel on which the marker is
received as empty and executes the “Marker Sending Rule” to
record its local state.
• The algorithm terminates after each process has received a
marker on all of its incoming channels.
• All the local snapshots get disseminated to all other processes
and all the processes can determine the global state.
State Recording Example
Global state recording for non-FIFO channels
• A marker (like in Chandy-Lamport) cannot be used

to delineate messages.
• Here, either some degree of inhibition or

piggybacking of control information on computation
messages is to be used to capture out-of-sequence
messages.
Lai-Yang algorithm for non-FIFO channels
The Lai-Yang algorithm uses a coloring scheme on computation
messages that works as follows:
1 Every process is initially white and turns red while taking a
snapshot. The equivalent of the “Marker Sending Rule” is
executed when a process turns red.
2 Every message sent by a white (red) process is colored white
(red).
3 Thus, a white (red) message is a message that was sent before
(after) the sender of that message recorded its local snapshot.
4 Every white process takes its snapshot at its convenience, but
no later than the instant it receives a red message.
Lai-Yang algorithm continued…
4 Every white process records a history of all white messages
sent or received by it along each channel.
5 When a process turns red, it sends these histories along with its
snapshot to the initiator process that collects the global
snapshot.
6 The initiator process evaluates transit (LSi, LSj) to compute the
state of a channel Cij as given below:
SCij= white messages sent by Pi on Cij − white messages
received by Pj on Cij
Next module
Causal ordering of messages
Causal Ordering of Messages
Chittaranjan Hota
Group communication
Causal ordering of Messages
Birman- Schiper- Stephenson (BSS) protocol
• Before broadcasting m, process Pi increments vector time

VT Pi [i] and timestamps ‘m’.
• Process P j ≠ P i receives ‘m’ with timestamp VT m from Pi ,
• delays delivery until both:
– VT Pj [i] = VT m [i] - 1 // received all previous m’s
– VT Pj [k] >= VT m [k] for every k Є {1,2,…, n} - {i}
// received all messages also received by Pi before sending message ‘m’
• When Pj delivers ‘m’, VT Pj updated by IR2 of vector clock
SS ZG 526: Distributed Computing SS ZG526, Distributed Computing 9/22/20224

Example1 of BSS

Example2 of BSS

Next session
SES protocol for Causal ordering
Causal Ordering of Messages
Chittaranjan Hota
Schiper-Eggli-Sandoz (SES) protocol
• No need for broadcast messages.

• Each process maintains a vector V_P of size N - 1, N being the
number of processes in the system.
• V_P is a vector of tuple (P’,t): P’ the destination process id and t, a
vector timestamp.
• Tm: logical time of sending message ‘m’
• Tpi: present logical time at pi
• Initially, V_P is empty.

SES Continued…
Sending a Message:
– Send message M, time stamped tm, along with V_P1 to P2.
– Insert (P2, tm) into V_P1. Overwrite the previous value of
(P2,t), if any.
– (P2,tm) is not sent. Any future message carrying (P2,tm) in
V_P1 cannot be delivered to P2 until tm < Tp2.
Delivering a message
– If V_M (in the message) does not contain any pair (P2, t), it
can be delivered.
– /* (P2, t) exists in V_M*/ If t !< Tp2, buffer the message.
(Don’t deliver).
– else (t < Tp2) deliver it

Example1 of SES

Example2 of SES

Next module
Distributed mutual exclusion
Chittaranjan Hota
Suzuki-kasami broadcast DME
A site has RN:
LN[j] is the sequence no. of the request that site Sj executed most recently.
The DME algorithm
1. If the requesting site Si does not have the token, then it increments it’s sequence number,
RNi [i], and sends a REQUEST(i, sn) message to all other sites. (sn is the updated value of
RNi [i].)
2. When a site Sj receives this message, it sets RNj [i] to max(RNj [i], sn). If Sj has the idle
token, it sends the token to Si if RNj [i] = LN[i] + 1.
Executing the critical section.

3. Site Si executes the CS when it has received the token.
Releasing the critical section.

Having finished the execution of the CS, site Si takes the following actions:
4. It sets LN[i] element of the token array equal to RNi [i].
5. For every site Sj whose ID is not in the token queue, it appends its ID to the token queue if
RNi [j] = LN[j] + 1.
6. If token queue is nonempty after the above update, then it deletes the top site ID from the
queue and sends the token to the site indicated by the ID.
Example
Analysis of the DME algorithm
Correctness
Mutex is trivial.
– Theorem:
A requesting site enters the CS in finite amount of time.
– Proof:
A request enters the token queue in finite time. The queue is
in FIFO order, and there can be a maximum N-1 sites ahead
of the request.
Performance
0 or N messages per CS invocation.
Next Session
Raymonds tree based DME
Chittaranjan Hota
Raymonds’ Tree Based Algo
where the
The Algorithm
1. When a site wants to enter into the CS, it sends a REQUEST message to the
node along the directed path to the root, provided it does not hold the token and
it’s request_q is empty. It then adds it to it’s request_q.
2. When a site on the path receives this message, it places the REQUEST in it’s
request_q and sends a REQUEST message along the directed path to the root
provided it has not sent out a REQUEST message on its outgoing edge.
3. When the root site receives a REQUEST message, it sends the token to the
site from which it received the REQUEST message and sets it’s holder
variable to point at that site.
4. When the site receives the token, it deletes the top entry from its request_q,
sends the token to the site indicated in this entry, and sets its holder variable
to point at that site. If the request_q is nonempty at this point, then the site
sends a REQUEST message to the site which is pointed at by holder
variable.
Continued…
Executing the critical section

1. A site enters the CS when it receives the token and its own entry is at the
top of its request_q. In this case, the site deletes the top entry from it’s
request_q and enters the CS.
Releasing the critical section

1. If its’ request_q is nonempty, then it deletes the top entry from it’s
request_q, sends the token to that site, and sets its holder variable to point at
that site.
2. If the request_q is nonempty at this point, then the site sends a REQUEST
message to the site which is pointed at by the holder variable.
Example1
Example2
Analysis
Proof of Correctness
Mutex is trivial.
Finite waiting: all the requests in the system form a FIFO queue and the
token is passed in that order.
Performance
O(logN) messages per CS invocation. (Average distance between two
nodes in a tree)
Next module
Deadlock detection in distributed systems

DC MidSem

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DC MidSem

Uploaded by

Copyright:

Available Formats

Distributed Computing

•Autonomous processors communicating over a communication network

• non-uniform memory access

• 2-D wraparound mesh

• routing in the hypercube is done hop-by-hop

• Remote Procedure Call (RPC)

• Ajay D. Kshemkalyani, and Mukesh Singhal, Chapter 2,

•every process has a logical clock

•time domain is the set of non-negative integers

•d can have different values

{(4, 1)} {(4, 2)} {(4, 3)}

• reduces the size of messages by transmitting only a scalar value

{1} {2} {3}

{1} {2} {3}

•every process has a logical clock

•time domain is the set of non-negative integers

•d can have different values

{(4, 1)} {(4, 2)} {(4, 3)}

• reduces the size of messages by transmitting only a scalar value

{1} {2} {3}

{1} {2} {3}

•every process has a logical clock

•time domain is the set of non-negative integers

•d can have different values

{(4, 1)} {(4, 2)} {(4, 3)}

• reduces the size of messages by transmitting only a scalar value

{1} {2} {3}

{1} {2} {3}

•if a snapshot recording algorithm records the states of pi and pj as LSi

termination criterion - each process has received a marker on all of its

1000 1000 1050 1030 1130 1130

• non-FIFO algorithm by Lai and Yang use message piggybacking to

• every message sent by a red process is colored red

• every white process takes its snapshot no later than the

200 200 220 220 190 200 230 210

SC12 = {white messages sent by P1 on C12} – {white messages received by P2 on C12}

SC21 = {white messages sent by P2 on C21} – {white messages received by P1 on C21}

•if a snapshot recording algorithm records the states of pi and pj as LSi

termination criterion - each process has received a marker on all of its

1000 1000 1050 1030 1130 1130

• non-FIFO algorithm by Lai and Yang use message piggybacking to

• every message sent by a red process is colored red

• every white process takes its snapshot no later than the

200 200 220 220 190 200 230 210

SC12 = {white messages sent by P1 on C12} – {white messages received by P2 on C12}

SC21 = {white messages sent by P2 on C21} – {white messages received by P1 on C21}

Dr. Barsha Mitra

(1) send event, where Pi wants to send message M to Pj :

(2) message arrival, when (M, ST) arrives at Pi from Pj :

• tm = vector timestamp for message m

• Pi sets the timestamp tm = Ci for message m

Pj receives a message m from Pi

P1 (buffer) 0 0 1 • C1[2] = tr[2] – 1

• each broadcast and each convergecast requires n − 1 messages

Dr. Barsha Mitra

Course Title: Distributed Computing BITS Pilani, Hyderabad Campus

• each site requests permission to execute the CS from a subset

Course Title: Distributed Computing BITS Pilani, Hyderabad Campus

Course Title: Distributed Computing BITS Pilani, Hyderabad Campus

Course Title: Distributed Computing BITS Pilani, Hyderabad Campus

•every site Si keeps a queue, request_queuei

Course Title: Distributed Computing BITS Pilani, Hyderabad Campus

Requesting the critical section: