You are on page 1of 333

Distributed Computing

Introduction
Dr. Barsha Mitra
CSIS Dept, BITS Pilani, Hyderabad Campus
BITS Pilani
Hyderabad Campus

1
Course Details
•Course will be run in FLIPPED MODE
•Course handout is uploaded on https://elearn.bits-pilani.ac.in
No Name Type Weight Day, Date, Session, Time
Quiz-1 5% August 16-30, 2022
EC-1 Quiz-2 5% September 16-30, 2022
Assignment 10% October 16-30, 2022
EC-2 Mid-Semester Test
30%
EC-3 Comprehensive Exam
50%

Course ID: SS ZG526, Title: Distributed Computing 2 BITS Pilani, Hyderabad Campus
Course Details
•Quizzes and Assignment will be comducted on
https://elearn.bits-pilani.ac.in
•Course slides will be uploaded on https://elearn.bits-pilani.ac.in
and will be available on Microsoft Teams
•Course recordings will be available on Microsoft Teams
•Lab capsules are available on https://bitscsis.vlabs.platifi.com/

Course ID: SS ZG526, Title: Distributed Computing 3 BITS Pilani, Hyderabad Campus
Definition of Distributed Systems
•Collection of independent entities cooperating to collectively solve a task

•Autonomous processors communicating over a communication network

Course ID: SS ZG526, Title: Distributed Computing 4 BITS Pilani, Hyderabad Campus
Features of Distributed System
•No common physical clock
•No shared memory - Requires message passing for communication
•Geographical seperation
•Processors are geographically wide apart
•Processors may reside on a WAN or LAN
•Autonomy and Heterogeneity
•Loosely coupled processors
•Different processor speeds and operating systems
•Processors are not part of a dedicated system
•Cooperate with one another
•May offer services or solve a problem jointly

Course ID: SS ZG526, Title: Distributed Computing 5 BITS Pilani, Hyderabad Campus
Middleware
•Middleware –
• distributed software
• drives the distributed system
• provides transparency of heterogeneity
at platform level
•Middleware standards:
•Common Object Request Broker
Architecture (CORBA)
•Remote Procedure Call (RPC) mechanism
•Distributed Component Object Model
(DCOM)
•Remote Method Invocation (RMI)
Course ID: SS ZG526, Title: Distributed Computing 6 BITS Pilani, Hyderabad Campus
Motivation for Distributed System
•Inherently distributed computation
•Resource sharing
•Access to remote resources
•Increased performance/cost ratio
•Reliability
•Scalability
•Modularity and Incremental Expandability

Course ID: SS ZG526, Title: Distributed Computing 7 BITS Pilani, Hyderabad Campus
Coupling
• Interdependency/binding and/or homogeneity among modules,
whether hardware or software (e.g., OS, middleware)
• High coupling  tightly coupled modules
• Low coupling  loosely coupled modules

Course ID: SS ZG526, Title: Distributed Computing 8 BITS Pilani, Hyderabad Campus
Parallel Systems
• Multiprocessor systems
• E.g., Omega, Butterfly Networks
• Multicomputer parallel systems
• E.g., NYU Ultracomputer, CM* Connection Machine, IBM Blue gene
• Array processors (collocated, tightly coupled, common system clock,
may not have shared memory)
• E.g., DSP applications, image processing

Course ID: SS ZG526, Title: Distributed Computing 9 BITS Pilani, Hyderabad Campus
UMA Model
• Direct access to shared memory
• Access latency - waiting time to complete an
access to any memory location from any
processor
• Access latency is same for all processors
• Processors
• Remain in close proximity
• Connected by an interconnection network

Course ID: SS ZG526, Title: Distributed Computing 10 BITS Pilani, Hyderabad Campus
UMA Model
• Interprocess communication occurs through read and write operations
on shared memory
• Processors are of same type
• Remain within same physical box/container
• Processors run the same operating system
• Hardware & software are very tightly coupled
• Interconnection network to access memory may be
• Bus
• Multistage switch

Course ID: SS ZG526, Title: Distributed Computing 11 BITS Pilani, Hyderabad Campus
Omega Network
•Multi-stage networks formed by 2 x 2
switching elements
•Data can be sent on any one of the
input wires
•n-input and n-output network uses:
• log2(n) stages
• log2(n) bits for addressing
•n processors, n memory banks

Course ID: SS ZG526, Title: Distributed Computing 12 BITS Pilani, Hyderabad Campus
Omega Network
• (n/2)log2(n) switching elements
of size 2x2
• Interconnection function:
Output i of a stage is connected
to input j of next stage:
• j = 2i, for 0 ≤ i ≤ n/2 – 1
= 2i + 1 – n, for n/2 ≤ i ≤ n – 1
•left-rotation operation on the
binary representation of i to
get j

Course ID: SS ZG526, Title: Distributed Computing 13 BITS Pilani, Hyderabad Campus
References
• Ajay D. Kshemkalyani, and Mukesh Singhal, Chapter 1, “Distributed
Computing: Principles, Algorithms, and Systems”, Cambridge
University Press, 2008.
• https://medium.com/@rashmishehana_48965/middleware-
technologies-e5def95da4e

Course ID: SS ZG526, Title: Distributed Computing 14 BITS Pilani, Hyderabad Campus
Thank You

Course ID: SS ZG526, Title: Distributed Computing 15 BITS Pilani, Hyderabad Campus
Distributed Computing
Introduction
Dr. Barsha Mitra
CSIS Dept, BITS Pilani, Hyderabad Campus
BITS Pilani
Hyderabad Campus

1
Multicomputer Parallel Systems
• multiple processors do not have direct access to shared memory
• usually do not have a common clock
• processors are in close physical proximity
• very tightly coupled (homogenous hardware and software)
• connected by an interconnection network
• communicate either via a common address space or via message-passing

Course ID: SS ZG526, Title: Distributed Computing 2 BITS Pilani, Hyderabad Campus
https://en.wikipedia.org/wiki/Parallel_computing#/media/File:IBM_Blue_Gene_P_supercomputer.jpg

Course ID: SS ZG526, Title: Distributed Computing 3 BITS Pilani, Hyderabad Campus
https://www.cs.iusb.edu/~danav/teach/b424/b424_24_networks.html

NUMA Model

• non-uniform memory access


architecture
• multicomputer system that has a
common address space
• latency to access various shared
memory locations from the different
processors varies

Course ID: SS ZG526, Title: Distributed Computing 4 BITS Pilani, Hyderabad Campus
Multicomputer Parallel Systems

• 2-D wraparound mesh


• k x k mesh contains k2 processors

processor + memory

Course ID: SS ZG526, Title: Distributed Computing 5 BITS Pilani, Hyderabad Campus
Multicomputer Parallel Systems

• k-dimensional hypercube
has 2k processor-and-
memory units (nodes)
• each dimension is
associated with a bit
position in the label
• labels of two adjacent
nodes differ in the kth bit
position for dimension k
• shortest path between any
two processors 
Hamming distance
• Hamming distance <= k processor + memory

Course ID: SS ZG526, Title: Distributed Computing 6 BITS Pilani, Hyderabad Campus
Multicomputer Parallel Systems

• routing in the hypercube is done hop-by-hop


• multiple routes exist between any pair of nodes  provides fault
tolerance and congestion control mechanism

Course ID: SS ZG526, Title: Distributed Computing 7 BITS Pilani, Hyderabad Campus
Parallelism/Speedup
• measure of the relative speedup of a specific program on a
given machine
• depends on
• number of processors
• mapping of the code to the processors
• speedup = T(1)/T(n)
• T(1) = time with a single processor
• T(n) = time with n processors

Course ID: SS ZG526, Title: Distributed Computing 8 BITS Pilani, Hyderabad Campus
Parallel/Distributed Program
Concurrency

non-communication and
non-shared memory
number of local operations
access
total number of operations
includes the
communication or shared
memory access
operations

Course ID: SS ZG526, Title: Distributed Computing 9 BITS Pilani, Hyderabad Campus
Distributed Communication Models

• Remote Procedure Call (RPC)


• Publish/Subscribe Model

Course ID: SS ZG526, Title: Distributed Computing 10 BITS Pilani, Hyderabad Campus
Publish/ Subscribe Model
• allows any number of publishers to
communicate with any number of subscribers
asynchronously
• Used in serverless and microservices
architecture
• form of multicasting
• events are produced by publishers
• interested subscribers are notified when the
events are available
• information is provided as notifications
• subscribers continue their tasks until they
receive notifications
• Oracle, AWS, Azure Web PubSub service
Course ID: SS ZG526, Title: Distributed Computing 11 BITS Pilani, Hyderabad Campus
References
• Ajay D. Kshemkalyani, and Mukesh Singhal, Chapter 1, “Distributed Computing: Principles,
Algorithms, and Systems”, Cambridge University Press, 2008.
• https://medium.com/@rashmishehana_48965/middleware-technologies-e5def95da4e
• http://wiki.c2.com/?PublishSubscribeModel
• https://www.slideshare.net/ishraqabd/publish-subscribe-model-overview-13368808/5
• https://www.bmc.com/blogs/pub-sub-publish-subscribe/
• https://aws.amazon.com/pub-sub-
messaging/#:~:text=Publish%2Fsubscribe%20messaging%2C%20or%20pub,the%20subscribe
rs%20to%20the%20topic.

Course ID: SS ZG526, Title: Distributed Computing 12 BITS Pilani, Hyderabad Campus
Thank You

Course ID: SS ZG526, Title: Distributed Computing 13 BITS Pilani, Hyderabad Campus
Distributed Computing
A Model of Distributed Computations
Dr. Barsha Mitra
CSIS Dept, BITS Pilani, Hyderabad Campus
BITS Pilani
Hyderabad Campus
Preliminaries
•distributed system consists of a set of processors
•connected by a communication network
•communication delay is finite but unpredictable
•processors do not share a common global memory
•asynchronous message passing
•no common physical global clock
•communication medium may deliver messages out of order

Course ID: SS ZG526, Title: Distributed Computing 2 BITS Pilani, Hyderabad Campus
A Distributed Program
•Cij : channel from process pi to process pj
•mij: message sent by pi to pj
•Events –
•Internal Events
•Message Send Events
•Message Receive Events

Course ID: SS ZG526, Title: Distributed Computing 3 BITS Pilani, Hyderabad Campus
Space-Time Diagram

Course ID: SS ZG526, Title: Distributed Computing 4 BITS Pilani, Hyderabad Campus
A Model of Distributed Executions
Causal Precedence/Dependency or Happens Before Relation
•ei → ej
•path

Course ID: SS ZG526, Title: Distributed Computing 5 BITS Pilani, Hyderabad Campus
Models of Communication Networks
• FIFO model:
• each channel acts as a first-in first-out message queue
• message ordering is preserved by channel

• Non-FIFO model
• channel acts like a set
• sender process adds messages to channel
• receiver process removes messages from it

Course ID: SS ZG526, Title: Distributed Computing 6 BITS Pilani, Hyderabad Campus
Models of Communication Networks
•Causal ordering model is based on Lamport’s “happens before”
relation
•A system supporting causal ordering model satisfies :
CO: for mij and mkj , if send(mij ) → send(mkj ),
then rec(mij ) → rec(mkj )
•Causally ordered delivery of messages implies FIFO message delivery
•CO ⊂ FIFO ⊂ Non-FIFO

Course ID: SS ZG526, Title: Distributed Computing 7 BITS Pilani, Hyderabad Campus
Reference

• Ajay D. Kshemkalyani, and Mukesh Singhal, Chapter 2,


“Distributed Computing: Principles, Algorithms, and Systems”,
Cambridge University Press, 2008.

Course ID: SS ZG526, Title: Distributed Computing 8 BITS Pilani, Hyderabad Campus
Course ID: SS ZG526, Title: Distributed Computing 9 BITS Pilani, Hyderabad Campus
Distributed Computing
Logical Time
Dr. Barsha Mitra
CSIS Dept, BITS Pilani, Hyderabad Campus
BITS Pilani
Hyderabad Campus
Introduction
•not possible to have global physical time
•realize only an approximation of physical time using logical time
•logical time advances in jumps
•captures monotonicity property
• monotonicity property: if event a causally affects event b, then timestamp
of a is smaller than timestamp of b
• Ways to represent logical time:
•scalar time
•vector time
Course ID: SS ZG526, Title: Distributed Computing 2 BITS Pilani, Hyderabad Campus
Logical Clocks

•every process has a logical clock


•logical clock is advanced using a set of rules
•each event is assigned a timestamp

Course ID: SS ZG526, Title: Distributed Computing 3 BITS Pilani, Hyderabad Campus
Scalar Time - Definition

•time domain is the set of non-negative integers

•Ci:
•integer variable
•denotes logical clock of pi

Course ID: SS ZG526, Title: Distributed Computing 4 BITS Pilani, Hyderabad Campus
Scalar Time - Definition
Rules for updating clock:
•R1: Before executing an event (send, receive, or internal), process pi
executes

•d can have different values


•d = 1
R2: When process pi receives a message with timestamp Cmsg, it
executes the following actions:
1. Ci = max (Ci , Cmsg);
2. execute R1;
3. deliver the message.

Course ID: SS ZG526, Title: Distributed Computing 5 BITS Pilani, Hyderabad Campus
Evolution of Scalar Time
p1

p2

Processes

p3

p4

Time

Course ID: SS ZG526, Title: Distributed Computing 6 BITS Pilani, Hyderabad Campus
Evolution of Scalar Time
1 8 9 10 13 14
p1

1 7 9 10 12 14
6 7 11 12 13 15 16
p2

Processes 5 11 16

p3 4 5 10 11 12 17
3 12
p4
2 3 4 5 6 13 14
Time

Course ID: SS ZG526, Title: Distributed Computing 7 BITS Pilani, Hyderabad Campus
Scalar Time – Basic Properties

•Consistency
•Total Ordering
•Event Counting
•No Strong Consistency
•From Recorded Lecture

Course ID: SS ZG526, Title: Distributed Computing 8 BITS Pilani, Hyderabad Campus
Scalar Time – Basic Properties
1 8 9 10 13 14
p1

1 7 9 10 12 14
6 7 11 12 13 15 16
• d=1
p2
• height of event
Processes 5 11 16

p3 4 5 10 11 12 17
3 12
p4
2 3 4 5 6 13 14
Time

Course ID: SS ZG526, Title: Distributed Computing 9 BITS Pilani, Hyderabad Campus
Vector Time - Definition
• time domain is represented by a set of n-dimensional non-
negative integer vectors
• vti[i], vti[j]

Course ID: SS ZG526, Title: Distributed Computing 10 BITS Pilani, Hyderabad Campus
Vector Time - Definition
p1

p2
Processes

p3

p4

Time
Course ID: SS ZG526, Title: Distributed Computing 11 BITS Pilani, Hyderabad Campus
Vector Time - Definition
[1 0 0 0] [2 0 0 0] [3 0 0 0] [4 0 0 0] [5 3 0 0]
p1
[1 0 0 0] [2 0 0 0] [3 0 0 0] [2 3 0 0]

[2 4 0 0] [2 5 0 0]
p2
[2 1 0 0] [2 2 0 0] [2 3 0 0]
Processes

[2 5 0 0]
[1 0 1 2] [1 0 2 3] [3 0 4 5] [3 0 5 5]
p3
[3 0 3 3] [3 5 6 5]
[1 0 0 2] [1 0 0 3] [ 1 0 0 5] [3 0 5 5]

p4
[1 0 0 1] [1 0 0 2] [1 0 0 3] [1 0 0 4] [1 0 0 5] [3 0 5 6] [3 0 5 7]

Time
Course ID: SS ZG526, Title: Distributed Computing 12 BITS Pilani, Hyderabad Campus
Vector Time
• Comparing Vector timestamps from Recorded Lecture
• Vector Time properties from Recorded Lecture

Course ID: SS ZG526, Title: Distributed Computing 13 BITS Pilani, Hyderabad Campus
Vector Time – Basic Properties
Event counting
• if d is always 1 and vti[i] is the ith component of vector clock at process pi
• then vti[i] = no. of events that have occurred at pi until that instant

Course ID: SS ZG526, Title: Distributed Computing 14 BITS Pilani, Hyderabad Campus
Vector Time – Basic Properties
Event counting
•if an event e has timestamp vh
• vh[j] = number of events executed by process pj that causally
precede e
• Σ vh[j] −1 : total no. of events that causally precede e in the
distributed computation

Course ID: SS ZG526, Title: Distributed Computing 15 BITS Pilani, Hyderabad Campus
Singhal–Kshemkalyani’s Differential
Technique

p1

p2

Processes
p3

p4

Time

Course ID: SS ZG526, Title: Distributed Computing 16 BITS Pilani, Hyderabad Campus
Singhal–Kshemkalyani’s Differential
Technique

[1 0 0 0] [2 0 0 0] [3 0 0 0]
p1
{(1, 1)} {(1, 2)} {(1, 3)}

[2 3 1 0] [3 5 3 1] [3 6 6 3]
p2
[1 1 0 0] [2 2 0 0] [3 4 1 0]
{(3, 1)} {(3, 3), (4, 1)} {(3, 6), (4, 3)}
Processes
[0 0 2 1] [0 0 4 2] [0 0 5 3]
p3
[0 0 1 0] [0 0 3 1] [0 0 6 3]

{(4, 1)} {(4, 2)} {(4, 3)}

p4
[0 0 0 1] [0 0 0 2] [0 0 0 3]

Time

Course ID: SS ZG526, Title: Distributed Computing 17 BITS Pilani, Hyderabad Campus
Fowler–Zwaenepoel’s Direct-
Dependency Technique

• reduces the size of messages by transmitting only a scalar value


• process only maintains information regarding direct dependencies on
other processes
• transitive (indirect) dependencies are not maintained by this method

Course ID: SS ZG526, Title: Distributed Computing 18 BITS Pilani, Hyderabad Campus
Fowler–Zwaenepoel’s Direct-
Dependency Technique
p1 [1 0 0 0] [2 0 0 0] [3 0 0 0]

{1} {2} {3}

p2 [2 3 1 0] [3 5 3 0] [3 6 6 0]
[1 1 0 0] [2 2 0 0] [3 4 1 0]
{1} {3} {6}
Processes
p3 [0 0 2 1] [0 0 4 2] [0 0 5 3]
[0 0 1 0] [0 0 3 1] [0 0 6 3]

{1} {2} {3}


p4
[0 0 0 1] [0 0 0 2] [0 0 0 3]

Time

Course ID: SS ZG526, Title: Distributed Computing 19 BITS Pilani, Hyderabad Campus
Reference
• Ajay D. Kshemkalyani, and Mukesh Singhal, Chapter 3, “Distributed
Computing: Principles, Algorithms, and Systems”, Cambridge
University Press, 2008.

Course ID: SS ZG526, Title: Distributed Computing 20 BITS Pilani, Hyderabad Campus
Course ID: SS ZG526, Title: Distributed Computing 21 BITS Pilani, Hyderabad Campus
Distributed Computing
Logical Time
Dr. Barsha Mitra
CSIS Dept, BITS Pilani, Hyderabad Campus
BITS Pilani
Hyderabad Campus
Introduction
•not possible to have global physical time
•realize only an approximation of physical time using logical time
•logical time advances in jumps
•captures monotonicity property
• monotonicity property: if event a causally affects event b, then timestamp
of a is smaller than timestamp of b
• Ways to represent logical time:
•scalar time
•vector time
Course ID: SS ZG526, Title: Distributed Computing 2 BITS Pilani, Hyderabad Campus
Logical Clocks

•every process has a logical clock


•logical clock is advanced using a set of rules
•each event is assigned a timestamp

Course ID: SS ZG526, Title: Distributed Computing 3 BITS Pilani, Hyderabad Campus
Scalar Time - Definition

•time domain is the set of non-negative integers

•Ci:
•integer variable
•denotes logical clock of pi

Course ID: SS ZG526, Title: Distributed Computing 4 BITS Pilani, Hyderabad Campus
Scalar Time - Definition
Rules for updating clock:
•R1: Before executing an event (send, receive, or internal), process pi
executes

•d can have different values


•d = 1
R2: When process pi receives a message with timestamp Cmsg, it
executes the following actions:
1. Ci = max (Ci , Cmsg);
2. execute R1;
3. deliver the message.

Course ID: SS ZG526, Title: Distributed Computing 5 BITS Pilani, Hyderabad Campus
Evolution of Scalar Time
p1

p2

Processes

p3

p4

Time

Course ID: SS ZG526, Title: Distributed Computing 6 BITS Pilani, Hyderabad Campus
Evolution of Scalar Time
1 8 9 10 13 14
p1

1 7 9 10 12 14
6 7 11 12 13 15 16
p2

Processes 5 11 16

p3 4 5 10 11 12 17
3 12
p4
2 3 4 5 6 13 14
Time

Course ID: SS ZG526, Title: Distributed Computing 7 BITS Pilani, Hyderabad Campus
Scalar Time – Basic Properties

•Consistency
•Total Ordering
•Event Counting
•No Strong Consistency
•From Recorded Lecture

Course ID: SS ZG526, Title: Distributed Computing 8 BITS Pilani, Hyderabad Campus
Scalar Time – Basic Properties
1 8 9 10 13 14
p1

1 7 9 10 12 14
6 7 11 12 13 15 16
• d=1
p2
• height of event
Processes 5 11 16

p3 4 5 10 11 12 17
3 12
p4
2 3 4 5 6 13 14
Time

Course ID: SS ZG526, Title: Distributed Computing 9 BITS Pilani, Hyderabad Campus
Vector Time - Definition
• time domain is represented by a set of n-dimensional non-
negative integer vectors
• vti[i], vti[j]

Course ID: SS ZG526, Title: Distributed Computing 10 BITS Pilani, Hyderabad Campus
Vector Time - Definition
p1

p2
Processes

p3

p4

Time
Course ID: SS ZG526, Title: Distributed Computing 11 BITS Pilani, Hyderabad Campus
Vector Time - Definition
[1 0 0 0] [2 0 0 0] [3 0 0 0] [4 0 0 0] [5 3 0 0]
p1
[1 0 0 0] [2 0 0 0] [3 0 0 0] [2 3 0 0]

[2 4 0 0] [2 5 0 0]
p2
[2 1 0 0] [2 2 0 0] [2 3 0 0]
Processes

[2 5 0 0]
[1 0 1 2] [1 0 2 3] [3 0 4 5] [3 0 5 5]
p3
[3 0 3 3] [3 5 6 5]
[1 0 0 2] [1 0 0 3] [ 1 0 0 5] [3 0 5 5]

p4
[1 0 0 1] [1 0 0 2] [1 0 0 3] [1 0 0 4] [1 0 0 5] [3 0 5 6] [3 0 5 7]

Time
Course ID: SS ZG526, Title: Distributed Computing 12 BITS Pilani, Hyderabad Campus
Vector Time
• Comparing Vector timestamps from Recorded Lecture
• Vector Time properties from Recorded Lecture

Course ID: SS ZG526, Title: Distributed Computing 13 BITS Pilani, Hyderabad Campus
Vector Time – Basic Properties
Event counting
• if d is always 1 and vti[i] is the ith component of vector clock at process pi
• then vti[i] = no. of events that have occurred at pi until that instant

Course ID: SS ZG526, Title: Distributed Computing 14 BITS Pilani, Hyderabad Campus
Vector Time – Basic Properties
Event counting
•if an event e has timestamp vh
• vh[j] = number of events executed by process pj that causally
precede e
• Σ vh[j] −1 : total no. of events that causally precede e in the
distributed computation

Course ID: SS ZG526, Title: Distributed Computing 15 BITS Pilani, Hyderabad Campus
Singhal–Kshemkalyani’s Differential
Technique

p1

p2

Processes
p3

p4

Time

Course ID: SS ZG526, Title: Distributed Computing 16 BITS Pilani, Hyderabad Campus
Singhal–Kshemkalyani’s Differential
Technique

[1 0 0 0] [2 0 0 0] [3 0 0 0]
p1
{(1, 1)} {(1, 2)} {(1, 3)}

[2 3 1 0] [3 5 3 1] [3 6 6 3]
p2
[1 1 0 0] [2 2 0 0] [3 4 1 0]
{(3, 1)} {(3, 3), (4, 1)} {(3, 6), (4, 3)}
Processes
[0 0 2 1] [0 0 4 2] [0 0 5 3]
p3
[0 0 1 0] [0 0 3 1] [0 0 6 3]

{(4, 1)} {(4, 2)} {(4, 3)}

p4
[0 0 0 1] [0 0 0 2] [0 0 0 3]

Time

Course ID: SS ZG526, Title: Distributed Computing 17 BITS Pilani, Hyderabad Campus
Fowler–Zwaenepoel’s Direct-
Dependency Technique

• reduces the size of messages by transmitting only a scalar value


• process only maintains information regarding direct dependencies on
other processes
• transitive (indirect) dependencies are not maintained by this method

Course ID: SS ZG526, Title: Distributed Computing 18 BITS Pilani, Hyderabad Campus
Fowler–Zwaenepoel’s Direct-
Dependency Technique
p1 [1 0 0 0] [2 0 0 0] [3 0 0 0]

{1} {2} {3}

p2 [2 3 1 0] [3 5 3 0] [3 6 6 0]
[1 1 0 0] [2 2 0 0] [3 4 1 0]
{1} {3} {6}
Processes
p3 [0 0 2 1] [0 0 4 2] [0 0 5 3]
[0 0 1 0] [0 0 3 1] [0 0 6 3]

{1} {2} {3}


p4
[0 0 0 1] [0 0 0 2] [0 0 0 3]

Time

Course ID: SS ZG526, Title: Distributed Computing 19 BITS Pilani, Hyderabad Campus
Reference
• Ajay D. Kshemkalyani, and Mukesh Singhal, Chapter 3, “Distributed
Computing: Principles, Algorithms, and Systems”, Cambridge
University Press, 2008.

Course ID: SS ZG526, Title: Distributed Computing 20 BITS Pilani, Hyderabad Campus
Course ID: SS ZG526, Title: Distributed Computing 21 BITS Pilani, Hyderabad Campus
Distributed Computing
Logical Time
Dr. Barsha Mitra
CSIS Dept, BITS Pilani, Hyderabad Campus
BITS Pilani
Hyderabad Campus
Introduction
•not possible to have global physical time
•realize only an approximation of physical time using logical time
•logical time advances in jumps
•captures monotonicity property
• monotonicity property: if event a causally affects event b, then timestamp
of a is smaller than timestamp of b
• Ways to represent logical time:
•scalar time
•vector time
Course ID: SS ZG526, Title: Distributed Computing 2 BITS Pilani, Hyderabad Campus
Logical Clocks

•every process has a logical clock


•logical clock is advanced using a set of rules
•each event is assigned a timestamp

Course ID: SS ZG526, Title: Distributed Computing 3 BITS Pilani, Hyderabad Campus
Scalar Time - Definition

•time domain is the set of non-negative integers

•Ci:
•integer variable
•denotes logical clock of pi

Course ID: SS ZG526, Title: Distributed Computing 4 BITS Pilani, Hyderabad Campus
Scalar Time - Definition
Rules for updating clock:
•R1: Before executing an event (send, receive, or internal), process pi
executes

•d can have different values


•d = 1
R2: When process pi receives a message with timestamp Cmsg, it
executes the following actions:
1. Ci = max (Ci , Cmsg);
2. execute R1;
3. deliver the message.

Course ID: SS ZG526, Title: Distributed Computing 5 BITS Pilani, Hyderabad Campus
Evolution of Scalar Time
p1

p2

Processes

p3

p4

Time

Course ID: SS ZG526, Title: Distributed Computing 6 BITS Pilani, Hyderabad Campus
Evolution of Scalar Time
1 8 9 10 13 14
p1

1 7 9 10 12 14
6 7 11 12 13 15 16
p2

Processes 5 11 16

p3 4 5 10 11 12 17
3 12
p4
2 3 4 5 6 13 14
Time

Course ID: SS ZG526, Title: Distributed Computing 7 BITS Pilani, Hyderabad Campus
Scalar Time – Basic Properties

•Consistency
•Total Ordering
•Event Counting
•No Strong Consistency
•From Recorded Lecture

Course ID: SS ZG526, Title: Distributed Computing 8 BITS Pilani, Hyderabad Campus
Scalar Time – Basic Properties
1 8 9 10 13 14
p1

1 7 9 10 12 14
6 7 11 12 13 15 16
• d=1
p2
• height of event
Processes 5 11 16

p3 4 5 10 11 12 17
3 12
p4
2 3 4 5 6 13 14
Time

Course ID: SS ZG526, Title: Distributed Computing 9 BITS Pilani, Hyderabad Campus
Vector Time - Definition
• time domain is represented by a set of n-dimensional non-
negative integer vectors
• vti[i], vti[j]

Course ID: SS ZG526, Title: Distributed Computing 10 BITS Pilani, Hyderabad Campus
Vector Time - Definition
p1

p2
Processes

p3

p4

Time
Course ID: SS ZG526, Title: Distributed Computing 11 BITS Pilani, Hyderabad Campus
Vector Time - Definition
[1 0 0 0] [2 0 0 0] [3 0 0 0] [4 0 0 0] [5 3 0 0]
p1
[1 0 0 0] [2 0 0 0] [3 0 0 0] [2 3 0 0]

[2 4 0 0] [2 5 0 0]
p2
[2 1 0 0] [2 2 0 0] [2 3 0 0]
Processes

[2 5 0 0]
[1 0 1 2] [1 0 2 3] [3 0 4 5] [3 0 5 5]
p3
[3 0 3 3] [3 5 6 5]
[1 0 0 2] [1 0 0 3] [ 1 0 0 5] [3 0 5 5]

p4
[1 0 0 1] [1 0 0 2] [1 0 0 3] [1 0 0 4] [1 0 0 5] [3 0 5 6] [3 0 5 7]

Time
Course ID: SS ZG526, Title: Distributed Computing 12 BITS Pilani, Hyderabad Campus
Vector Time
• Comparing Vector timestamps from Recorded Lecture
• Vector Time properties from Recorded Lecture

Course ID: SS ZG526, Title: Distributed Computing 13 BITS Pilani, Hyderabad Campus
Vector Time – Basic Properties
Event counting
• if d is always 1 and vti[i] is the ith component of vector clock at process pi
• then vti[i] = no. of events that have occurred at pi until that instant

Course ID: SS ZG526, Title: Distributed Computing 14 BITS Pilani, Hyderabad Campus
Vector Time – Basic Properties
Event counting
•if an event e has timestamp vh
• vh[j] = number of events executed by process pj that causally
precede e
• Σ vh[j] −1 : total no. of events that causally precede e in the
distributed computation

Course ID: SS ZG526, Title: Distributed Computing 15 BITS Pilani, Hyderabad Campus
Singhal–Kshemkalyani’s Differential
Technique

p1

p2

Processes
p3

p4

Time

Course ID: SS ZG526, Title: Distributed Computing 16 BITS Pilani, Hyderabad Campus
Singhal–Kshemkalyani’s Differential
Technique

[1 0 0 0] [2 0 0 0] [3 0 0 0]
p1
{(1, 1)} {(1, 2)} {(1, 3)}

[2 3 1 0] [3 5 3 1] [3 6 6 3]
p2
[1 1 0 0] [2 2 0 0] [3 4 1 0]
{(3, 1)} {(3, 3), (4, 1)} {(3, 6), (4, 3)}
Processes
[0 0 2 1] [0 0 4 2] [0 0 5 3]
p3
[0 0 1 0] [0 0 3 1] [0 0 6 3]

{(4, 1)} {(4, 2)} {(4, 3)}

p4
[0 0 0 1] [0 0 0 2] [0 0 0 3]

Time

Course ID: SS ZG526, Title: Distributed Computing 17 BITS Pilani, Hyderabad Campus
Fowler–Zwaenepoel’s Direct-
Dependency Technique

• reduces the size of messages by transmitting only a scalar value


• process only maintains information regarding direct dependencies on
other processes
• transitive (indirect) dependencies are not maintained by this method

Course ID: SS ZG526, Title: Distributed Computing 18 BITS Pilani, Hyderabad Campus
Fowler–Zwaenepoel’s Direct-
Dependency Technique
p1 [1 0 0 0] [2 0 0 0] [3 0 0 0]

{1} {2} {3}

p2 [2 3 1 0] [3 5 3 0] [3 6 6 0]
[1 1 0 0] [2 2 0 0] [3 4 1 0]
{1} {3} {6}
Processes
p3 [0 0 2 1] [0 0 4 2] [0 0 5 3]
[0 0 1 0] [0 0 3 1] [0 0 6 3]

{1} {2} {3}


p4
[0 0 0 1] [0 0 0 2] [0 0 0 3]

Time

Course ID: SS ZG526, Title: Distributed Computing 19 BITS Pilani, Hyderabad Campus
Reference
• Ajay D. Kshemkalyani, and Mukesh Singhal, Chapter 3, “Distributed
Computing: Principles, Algorithms, and Systems”, Cambridge
University Press, 2008.

Course ID: SS ZG526, Title: Distributed Computing 20 BITS Pilani, Hyderabad Campus
Course ID: SS ZG526, Title: Distributed Computing 21 BITS Pilani, Hyderabad Campus
Distributed Computing
Global State & Snapshot Recording
Algorithms
Dr. Barsha Mitra
BITS Pilani CSIS Dept, BITS Pilani, Hyderabad Campus
Hyderabad Campus
Introduction
•problems in recording global state
•lack of a globally shared memory
•lack of a global clock
•message transmission is asynchronous
•message transfer delays are finite but unpredictable

Course No.- SS ZG526, Course Title - Distributed Computing 2 BITS Pilani, Hyderabad Campus
A Distributed Program
•Cij : channel from process pi to process pj
•mij: message sent by pi to pj
•Global state of a distributed computation consists of
•states of the processes
•states of the communication channels
•State of a process is characterized by its local memory state
•Set of messages in transit in the channel characterizes state of channel
• occurrence of events
•changes processes' states
•changes channels' states
•causes transition in global system state
Course ID: SS ZG526, Title: Distributed Computing 3 BITS Pilani, Hyderabad Campus
System Model

•if a snapshot recording algorithm records the states of pi and pj as LSi


and LSj , respectively, it must record the state of channel Cij as transit(LSi,
LSj)
•For Cij , intransit messages are:
transit(LSi, LSj) = {mij | send(mij) ∈ LSi  rec(mij)  LSj}

Course No.- SS ZG526, Course Title - Distributed Computing 4 BITS Pilani, Hyderabad Campus
A Consistent Global State
• global state GS is a consistent global state iff:
• C1: every message mij that is recorded as sent in the local state of
sender pi must be captured either in the state of Cij or in the
collected local state of the receiver pj
• C1 states the law of conservation of messages
• C2: send(mij) LSi ⇒ mij  SCij ∧ rec(mij)  LSj
• Condition C2 states that for every effect, its cause must be present

Course No.- SS ZG526, Course Title - Distributed Computing 5 BITS Pilani, Hyderabad Campus
Chandy–Lamport Algorithm
Marker sending rule for process pi
(1) Process pi records its state.
(2) For each outgoing channel C on which a marker has not been sent, pi sends
a marker along C before pi sends further messages along C.
Marker receiving rule for process pj
On receiving a marker along channel C:
if pj has not recorded its state then
Record the state of C as the empty set
Execute the “marker sending rule”
else
Record the state of C as the set of messages received along C after pj’s
state was recorded and before pj received the marker along C
Course ID: SS ZG526, Title: Distributed Computing 6 BITS Pilani, Hyderabad Campus
Chandy–Lamport Algorithm
Putting together recorded snapshots - each process can send its local
snapshot to the initiator of the algorithm

termination criterion - each process has received a marker on all of its


incoming channels

Course ID: SS ZG526, Title: Distributed Computing 7 BITS Pilani, Hyderabad Campus
Chandy–Lamport Algorithm
500 450 450 350 370 370
S1: A
100

50

20

S2: B

1000 1000 1050 1030 1130 1130


t0 t1 t2 t3 t4 t5

Course ID: SS ZG526, Title: Distributed Computing 8 BITS Pilani, Hyderabad Campus
Snapshot Algorithms for Non-FIFO
Channels
• a marker cannot be used to

• non-FIFO algorithm by Lai and Yang use message piggybacking to


distinguish computation messages sent after the marker from those
sent before the marker

Course ID: SS ZG526, Title: Distributed Computing 9 BITS Pilani, Hyderabad Campus
Lai–Yang Algorithm

• coloring scheme
• every process is initially white
• process turns red while taking a snapshot
• equivalent of the “marker sending rule” is executed when
a process turns red
• every message sent by a white process is colored white
• a white message is a message that was sent before the
sender of that message recorded its local snapshot

Course ID: SS ZG526, Title: Distributed Computing 10 BITS Pilani, Hyderabad Campus
Lai–Yang Algorithm

• every message sent by a red process is colored red


• a red message is a message that was sent after the
sender of that message recorded its local snapshot

• every white process takes its snapshot no later than the


instant it receives a red message
• when a white process receives a red message, it records
its local snapshot before processing the message

Course ID: SS ZG526, Title: Distributed Computing 11 BITS Pilani, Hyderabad Campus
Lai–Yang Algorithm
• every white process records a history of all white messages sent or
received by it along each channel
• when a process turns red, it sends these histories along with its
snapshot to the initiator process that collects the global snapshot
• initiator process evaluates transit(LSi, LSj) to compute the state of a
channel Cij as:
SCij = {white messages sent by pi on Cij} − {white messages
received by pj on Cij}
= {mij | send(mij) ∈ LSi} − {mij | rec(mij) ∈ LSj}

Course ID: SS ZG526, Title: Distributed Computing 12 BITS Pilani, Hyderabad Campus
Lai–Yang Algorithm
800 780 750 740 740 740 770 770 790
P1: X
30 10

20

20
30
P2: Y

200 200 220 220 190 200 230 210


t1 t2 t3 t4 t5 t6 t7 t8 t9

SC12 = {white messages sent by P1 on C12} – {white messages received by P2 on C12}

SC21 = {white messages sent by P2 on C21} – {white messages received by P1 on C21}

Course ID: SS ZG526, Title: Distributed Computing 13 BITS Pilani, Hyderabad Campus
Reference
• Ajay D. Kshemkalyani, and Mukesh Singhal, Chapter 4,
“Distributed Computing: Principles, Algorithms, and Systems”,
Cambridge University Press, 2008.

Course No.- SS ZG526, Course Title - Distributed Computing 14 BITS Pilani, Hyderabad Campus
Course No.- SS ZG526, Course Title - Distributed Computing 15 BITS Pilani, Hyderabad Campus
Distributed Computing
Global State & Snapshot Recording
Algorithms
Dr. Barsha Mitra
BITS Pilani CSIS Dept, BITS Pilani, Hyderabad Campus
Hyderabad Campus
Introduction
•problems in recording global state
•lack of a globally shared memory
•lack of a global clock
•message transmission is asynchronous
•message transfer delays are finite but unpredictable

Course No.- SS ZG526, Course Title - Distributed Computing 2 BITS Pilani, Hyderabad Campus
A Distributed Program
•Cij : channel from process pi to process pj
•mij: message sent by pi to pj
•Global state of a distributed computation consists of
•states of the processes
•states of the communication channels
•State of a process is characterized by its local memory state
•Set of messages in transit in the channel characterizes state of channel
• occurrence of events
•changes processes' states
•changes channels' states
•causes transition in global system state
Course ID: SS ZG526, Title: Distributed Computing 3 BITS Pilani, Hyderabad Campus
System Model

•if a snapshot recording algorithm records the states of pi and pj as LSi


and LSj , respectively, it must record the state of channel Cij as transit(LSi,
LSj)
•For Cij , intransit messages are:
transit(LSi, LSj) = {mij | send(mij) ∈ LSi  rec(mij)  LSj}

Course No.- SS ZG526, Course Title - Distributed Computing 4 BITS Pilani, Hyderabad Campus
A Consistent Global State
• global state GS is a consistent global state iff:
• C1: every message mij that is recorded as sent in the local state of
sender pi must be captured either in the state of Cij or in the
collected local state of the receiver pj
• C1 states the law of conservation of messages
• C2: send(mij) LSi ⇒ mij  SCij ∧ rec(mij)  LSj
• Condition C2 states that for every effect, its cause must be present

Course No.- SS ZG526, Course Title - Distributed Computing 5 BITS Pilani, Hyderabad Campus
Chandy–Lamport Algorithm
Marker sending rule for process pi
(1) Process pi records its state.
(2) For each outgoing channel C on which a marker has not been sent, pi sends
a marker along C before pi sends further messages along C.
Marker receiving rule for process pj
On receiving a marker along channel C:
if pj has not recorded its state then
Record the state of C as the empty set
Execute the “marker sending rule”
else
Record the state of C as the set of messages received along C after pj’s
state was recorded and before pj received the marker along C
Course ID: SS ZG526, Title: Distributed Computing 6 BITS Pilani, Hyderabad Campus
Chandy–Lamport Algorithm
Putting together recorded snapshots - each process can send its local
snapshot to the initiator of the algorithm

termination criterion - each process has received a marker on all of its


incoming channels

Course ID: SS ZG526, Title: Distributed Computing 7 BITS Pilani, Hyderabad Campus
Chandy–Lamport Algorithm
500 450 450 350 370 370
S1: A
100

50

20

S2: B

1000 1000 1050 1030 1130 1130


t0 t1 t2 t3 t4 t5

Course ID: SS ZG526, Title: Distributed Computing 8 BITS Pilani, Hyderabad Campus
Snapshot Algorithms for Non-FIFO
Channels
• a marker cannot be used to

• non-FIFO algorithm by Lai and Yang use message piggybacking to


distinguish computation messages sent after the marker from those
sent before the marker

Course ID: SS ZG526, Title: Distributed Computing 9 BITS Pilani, Hyderabad Campus
Lai–Yang Algorithm

• coloring scheme
• every process is initially white
• process turns red while taking a snapshot
• equivalent of the “marker sending rule” is executed when
a process turns red
• every message sent by a white process is colored white
• a white message is a message that was sent before the
sender of that message recorded its local snapshot

Course ID: SS ZG526, Title: Distributed Computing 10 BITS Pilani, Hyderabad Campus
Lai–Yang Algorithm

• every message sent by a red process is colored red


• a red message is a message that was sent after the
sender of that message recorded its local snapshot

• every white process takes its snapshot no later than the


instant it receives a red message
• when a white process receives a red message, it records
its local snapshot before processing the message

Course ID: SS ZG526, Title: Distributed Computing 11 BITS Pilani, Hyderabad Campus
Lai–Yang Algorithm
• every white process records a history of all white messages sent or
received by it along each channel
• when a process turns red, it sends these histories along with its
snapshot to the initiator process that collects the global snapshot
• initiator process evaluates transit(LSi, LSj) to compute the state of a
channel Cij as:
SCij = {white messages sent by pi on Cij} − {white messages
received by pj on Cij}
= {mij | send(mij) ∈ LSi} − {mij | rec(mij) ∈ LSj}

Course ID: SS ZG526, Title: Distributed Computing 12 BITS Pilani, Hyderabad Campus
Lai–Yang Algorithm
800 780 750 740 740 740 770 770 790
P1: X
30 10

20

20
30
P2: Y

200 200 220 220 190 200 230 210


t1 t2 t3 t4 t5 t6 t7 t8 t9

SC12 = {white messages sent by P1 on C12} – {white messages received by P2 on C12}

SC21 = {white messages sent by P2 on C21} – {white messages received by P1 on C21}

Course ID: SS ZG526, Title: Distributed Computing 13 BITS Pilani, Hyderabad Campus
Course No.- SS ZG526, Course Title - Distributed Computing 14 BITS Pilani, Hyderabad Campus
Message ordering

Dr. Barsha Mitra


BITS Pilani CSIS Dept, BITS Pilani, Hyderabad Campus
Hyderabad Campus
Group Communication
•broadcast - sending a message to all members in the
distributed system
•multicasting - a message is sent to a certain subset,
identified as a group, of the processes in the system
•unicasting - point-to-point message communication

Course ID: SS ZG526, Title: Distributed Computing 2 BITS Pilani, Hyderabad Campus
Causal Order
•A system supporting causal ordering model satisfies
CO: for mij and mkj , if send(mij ) → send(mkj ),
then rec(mij ) → rec(mkj )
•2 criteria must be satisfied by a causal ordering
protocol:
•Safety
•Liveness

Course ID: SS ZG526, Title: Distributed Computing 3 BITS Pilani, Hyderabad Campus
Causal Order
Safety
•a message M arriving at a process may need to be buffered until all
system-wide messages sent in the causal past of the send(M) event to the
same destination have already arrived
•distinction is made between
•arrival of a message at a process
•event at which the message is given to the application process
Liveness
A message that arrives at a process must eventually be delivered to the
process

Course ID: SS ZG526, Title: Distributed Computing 4 BITS Pilani, Hyderabad Campus
Raynal–Schiper–Toueg Algorithm
• each message M should carry a log of
• all other messages
• sent causally before M’s send event, and sent to the same
destination dest(M)
• log can be examined to ensure whether it is safe to deliver a
message
• channels are FIFO

Course ID: SS ZG526, Title: Distributed Computing 5 BITS Pilani, Hyderabad Campus
Raynal–Schiper–Toueg Algorithm
local variables:
• array of int SENT[1 …… n, 1 …… n] (n x n array)
• SENTi[j, k] = no. of messages sent by Pj to Pk as known to Pi
• array of int DELIV [1 …… n]
• DELIVi[j] = no. of messages from Pj that have been delivered to Pi

(1) send event, where Pi wants to send message M to Pj :


(1a) send (M, SENT) to Pj
(1b) SENT[i, j] = SENT[i, j] + 1

Course ID: SS ZG526, Title: Distributed Computing 6 BITS Pilani, Hyderabad Campus
Raynal–Schiper–Toueg Algorithm

(2) message arrival, when (M, ST) arrives at Pi from Pj :


(2a) deliver M to Pi when for each process x,
(2b) DELIVi[x]  ST[x, i ]
(2c)  x, y, SENT[x, y] = max(SENT[x, y], ST[x, y])
(2d) DELIVi[j] = DELIVi[j] + 1

Course ID: SS ZG526, Title: Distributed Computing 7 BITS Pilani, Hyderabad Campus
Raynal–Schiper–Toueg Algorithm

Complexity:
•space requirement at each process: O(n2) integers
•space overhead per message: n2 integers
•time complexity at each process for each send and deliver
event: O(n2)

Course ID: SS ZG526, Title: Distributed Computing 8 BITS Pilani, Hyderabad Campus
Raynal–Schiper–Toueg Algorithm
• P1 sent 3 messages to P2 SENT of P1 Updated SENT of P1
• P1 sent 4 messages to P3 • send(m1, SENT) to P2
0 3 4 0 4 4
• P2 sent 5 messages to P1 • SENT1[1, 2] = 4
5 0 2 5 0 2
• P2 sent 2 messages to P3 0 4 0 0 4 0
• P3 sent 4 messages to P2
• Now if,
• P1 sends m1 to P2 SENT of P2
DELIV2[1] >= ST[1, 2] 0 3 4
DELIV1 = [0 4 0]
5 0 2
DELIV2[3] >= ST[3, 2]
DELIV2 = [4
[3 0 4] 0 4 0

DELIV3 = [3 2 0]

Course ID: SS ZG526, Title: Distributed Computing 9 BITS Pilani, Hyderabad Campus
Raynal–Schiper–Toueg Algorithm
Updated SENT of P3
• P1 sent 3 messages to P2 SENT of P3
• P1 sent 4 messages to P3 • send(m2, SENT) to P2 0 3 4
0 3 4
• P2 sent 5 messages to P1 • SENT3[3, 2] = 5 5 0 2
5 0 2
• P2 sent 2 messages to P3 0 4 0
0 5 0
• P3 sent 4 messages to P2
• Now if, SENT of P2
• P3 sends m2 to P2 0 3 4
DELIV2[1] >= ST[1, 2] 5 0 2
DELIV1 = [0 4 0]
0 4 0
DELIV2[3] >= ST[3, 2]
DELIV2 = [4 0 5]
4]
DELIV3 = [3 2 0]

Course ID: SS ZG526, Title: Distributed Computing 10 BITS Pilani, Hyderabad Campus
Raynal–Schiper–Toueg Algorithm
• P1 sent 3 messages to P2 SENT of P1 Updated SENT of P1
• P1 sent 4 messages to P3 0 4 4 • send(m3, SENT) to P3 0 4 5
• P2 sent 5 messages to P1 5 0 2 • SENT1[1, 3] = 5 5 0 2
• P2 sent 2 messages to P3 0 4 0 0 4 0
• P3 sent 4 messages to P2
• Now if,
• P1 sends m3 to P3 SENT of P3
DELIV3[1] >= ST[1, 3] 0 3 4
DELIV1 = [0 4 0] 5 0 2
DELIV3[2] >= ST[2, 3] 0 5 0
DELIV2 = [4 0 5]
DELIV3 = [3 2 0]
Course ID: SS ZG526, Title: Distributed Computing 11 BITS Pilani, Hyderabad Campus
Birman-Schiper-Stephenson Protocol
• Ci = vector clock of Pi
• Ci[j] = jth element of Ci

• tm = vector timestamp for message m

Pi sends a message m to Pj

• Pi increments Ci[i]

• Pi sets the timestamp tm = Ci for message m

Course ID: SS ZG526, Title: Distributed Computing 12 BITS Pilani, Hyderabad Campus
Birman-Schiper-Stephenson Protocol

Pj receives a message m from Pi


• when Pj (j ≠ i) receives m with timestamp tm, it delays message
delivery until:
• Cj[i] = tm[i] - 1
•  k <= n and k ≠ i, Cj[k]  tm[k]
• when m is delivered to Pj, update Pj's vector clock
i, Cj[i] = max(Cj[i], tm[i])
• check buffered messages to see if any can be delivered

Course ID: SS ZG526, Title: Distributed Computing 13 BITS Pilani, Hyderabad Campus
Birman-Schiper-Stephenson Protocol

P1
000
001

P2
000
001

P3
000
001

Course ID: SS ZG526, Title: Distributed Computing 14 BITS Pilani, Hyderabad Campus
Birman-Schiper-Stephenson Protocol

P1 • C2[3] = tb[3] – 1
000
• C2[1] >= tb[1]
001
• C2[2] >= tb[2]

P2 001
000
001

P3
000
001

Course ID: SS ZG526, Title: Distributed Computing 15 BITS Pilani, Hyderabad Campus
Birman-Schiper-Stephenson Protocol

P1 • C3[2] = tr[2] – 1
000
• C3[1] >= tr[1]
011 • C3[3] >= tr[3]
001
P2 001 011
000
001 011

P3
000
001
011

Course ID: SS ZG526, Title: Distributed Computing 16 BITS Pilani, Hyderabad Campus
Birman-Schiper-Stephenson Protocol

P1 (buffer) 0 0 1 • C1[2] = tr[2] – 1


000
• C1[1] >= tr[1]
011 • C1[3] >= tr[3]
001
P2 0 0 1 011
000
• C1[3] = tb[3] – 1
001 011
• C1[1] >= tb[1]
P3 • C1[2] >= tb[2]
000
001
011

Course ID: SS ZG526, Title: Distributed Computing 17 BITS Pilani, Hyderabad Campus
Birman-Schiper-Stephenson Protocol

P1 (buffer) 0 0 1 011
000
011 deliver
0 0 1 from buffer SES Protocol from
P2 0 0 1 011
Recorded Lecture
000
001 011

P3
000
001
011

Course ID: SS ZG526, Title: Distributed Computing 18 BITS Pilani, Hyderabad Campus
References
• Ajay D. Kshemkalyani, and Mukesh Singhal, Chapter 6,
“Distributed Computing: Principles, Algorithms, and Systems”,
Cambridge University Press, 2008.
• Ajay D. Kshemkalyani, and Mukesh Singhal, Chapter 7,
“Distributed Computing: Principles, Algorithms, and Systems”,
Cambridge University Press, 2008.
• https://courses.csail.mit.edu/6.006/fall11/rec/rec14.pdf

Course ID: SS ZG526, Title: Distributed Computing 19 BITS Pilani, Hyderabad Campus
Course ID: SS ZG526, Title: Distributed Computing 20 BITS Pilani, Hyderabad Campus
Distributed Computing
Terminology & Basic Algorithms
Dr. Barsha Mitra
BITS Pilani CSIS Dept, BITS Pilani, Hyderabad Campus
Hyderabad Campus
Notation & Definitions
• undirected unweighted graph G = (N, L)
represents topology
• vertices are nodes
• edges are channels
• n = |N|
• l = |L|
• diameter of a graph – A, B  1 A, C  2
A, D  1 A, F  2
• minimum number of edges that need A, E  1 B, D  2
to be traversed to go from any node B, C  1 B, E  2
to any other node C, E  1 B, F  2
C, F  1 C, D  2
• diameter = maxi, j ∈ N {length of the D, E  1 D, F  2
shortest path between i and j} E, F  1
Course ID: SS ZG526, Title: Distributed Computing 2 BITS Pilani, Hyderabad Campus
Synchronous Single-Initiator Spanning
Tree Algorithm using Flooding
• algorithm proceeds in rounds (synchronous)
• root initiates the algorithm
• flooding of QUERY messages
• produce a spanning tree rooted at the root node
• each process Pi (Pi  root) should output its own parent for the
spanning tree

Course ID: SS ZG526, Title: Distributed Computing 3 BITS Pilani, Hyderabad Campus
Synchronous Single-Initiator Spanning
Tree Algorithm using Flooding
Local Variables maintained at each Pi
• int visited
• int depth
• int parent
• set of int Neighbors Initially at each Pi
• visited = 0
• depth = 0
• parent = NULL
• Neighbors = set of neighbors

Course ID: SS ZG526, Title: Distributed Computing 4 BITS Pilani, Hyderabad Campus
Synchronous Single-Initiator Spanning
Tree Algorithm using Flooding
Algorithm for Pi
Round r = 1
if Pi = root then
visited = 1
depth = 0
send QUERY to Neighbors
if Pi receives a QUERY message then
visited = 1
depth = r
parent = root
plan to send QUERYs to Neighbors at next round

Course ID: SS ZG526, Title: Distributed Computing 5 BITS Pilani, Hyderabad Campus
Synchronous Single-Initiator Spanning
Tree Algorithm using Flooding
Algorithm for Pi
Round r > 1 and r <= diameter
if Pi planned to send in previous round then
Pi sends QUERY to Neighbors
if visited = 0 then
if Pi receives QUERY messages then
visited = 1
depth = r
parent = any randomly selected node from which QUERY was received
plan to send QUERY to Neighbors \ {senders of QUERYs received in r}

Course ID: SS ZG526, Title: Distributed Computing 6 BITS Pilani, Hyderabad Campus
Synchronous Single-Initiator Spanning
Tree Algorithm using Flooding

A (1) B C

QUERY
(1)

F E D

Course ID: SS ZG526, Title: Distributed Computing 7 BITS Pilani, Hyderabad Campus
Synchronous Single-Initiator Spanning
Tree Algorithm using Flooding

A (1) B C

QUERY
(1)

F E D

Course ID: SS ZG526, Title: Distributed Computing 8 BITS Pilani, Hyderabad Campus
Synchronous Single-Initiator Spanning
Tree Algorithm using Flooding

A B (2) C

QUERY
(2)

F (2) E D

Course ID: SS ZG526, Title: Distributed Computing 9 BITS Pilani, Hyderabad Campus
Synchronous Single-Initiator Spanning
Tree Algorithm using Flooding

A B (2) C

QUERY
(2)

F (2) E D

Course ID: SS ZG526, Title: Distributed Computing 10 BITS Pilani, Hyderabad Campus
Synchronous Single-Initiator Spanning
Tree Algorithm using Flooding

A B C

(3)
QUERY
(3)

(3)

F E (3) D

Course ID: SS ZG526, Title: Distributed Computing 11 BITS Pilani, Hyderabad Campus
Synchronous Single-Initiator Spanning
Tree Algorithm using Flooding

A B C

(3)
QUERY
(3)

(3)

F E (3) D

Course ID: SS ZG526, Title: Distributed Computing 12 BITS Pilani, Hyderabad Campus
Synchronous Single-Initiator Spanning
Tree Algorithm using Flooding

A B C

F E D

Course ID: SS ZG526, Title: Distributed Computing 13 BITS Pilani, Hyderabad Campus
Broadcast and Convergecast on a Tree
•A spanning tree is useful for distributing (via a broadcast) and collecting
(via a convergecast) information to/from all the nodes
Broadcast Algorithm:
•BC1:
•The root sends the information to
be broadcast to all its children.
•BC2:
•When a (non-root) node receives
information from its parent, it
copies it and forwards it to its
children.

Course ID: SS ZG526, Title: Distributed Computing 14 BITS Pilani, Hyderabad Campus
Broadcast and Convergecast on a Tree
Convergecast Algorithm:
• CVC1:
• Leaf node sends its report to its parent.
• CVC2:
• At a non-leaf node that is not the root:
When a report is received from all the
child nodes, the collective report is sent
to the parent.
• CVC3:
• At the root: When a report is received
from all the child nodes, the global
function is evaluated using the reports.
Course ID: SS ZG526, Title: Distributed Computing 15 BITS Pilani, Hyderabad Campus
Broadcast and Convergecast on a Tree

Complexity

• each broadcast and each convergecast requires n − 1 messages


• each broadcast and each convergecast requires time equal to
the maximum height h of the tree, which is O(n)

Course ID: SS ZG526, Title: Distributed Computing 16 BITS Pilani, Hyderabad Campus
Reference
• Ajay D. Kshemkalyani, and Mukesh Singhal, Chapter 5,
“Distributed Computing: Principles, Algorithms, and Systems”,
Cambridge University Press, 2008.

Course ID: SS ZG526, Title: Distributed Computing 17 BITS Pilani, Hyderabad Campus
Course ID: SS ZG526, Title: Distributed Computing 18 BITS Pilani, Hyderabad Campus
Distributed Mutual Exclusion

Dr. Barsha Mitra


BITS Pilani CSIS Dept, BITS Pilani, Hyderabad Campus
Hyderabad Campus
Types of Approaches

• Token-based
• Assertion-based
• From Recorded Lecture
• Quorum-based

Course Title: Distributed Computing BITS Pilani, Hyderabad Campus


Quorum-Based Approach

• each site requests permission to execute the CS from a subset


of sites called quorum
• quorums are formed in such a way that when two sites
concurrently request access to the CS
• at least one site receives both the requests
• makes sure that only one site executes the CS at any time

Course Title: Distributed Computing BITS Pilani, Hyderabad Campus


Requirements of Mutual Exclusion
Algorithms
•Safety property
•at any instant, only one process can execute the CS
•Liveness property
•a site must not wait indefinitely to execute the CS while other sites
are repeatedly executing the CS
•Fairness
•each process gets a fair chance to execute the CS
•CS execution requests are executed in order of their arrival time
•time is determined by a logical clock

Course Title: Distributed Computing BITS Pilani, Hyderabad Campus


Performance Metrics
•From Recorded Lecture

Course Title: Distributed Computing BITS Pilani, Hyderabad Campus


Lamport’s Algorithm

•every site Si keeps a queue, request_queuei


•request_queuei contains mutual exclusion requests ordered
by their timestamps
•FIFO
•CS requests are executed in increasing order of timestamps

Course Title: Distributed Computing BITS Pilani, Hyderabad Campus


Lamport’s Algorithm

Requesting the critical section:


• When a site Si wants to enter the CS, it broadcasts a
REQUEST(tsi, i) message to all other sites and places the
request on request_queuei. ((tsi, i) denotes the timestamp of
the request)
• When a site Sj receives the REQUEST(tsi, i) message from site
Si, it places site Si’s request on request_queuej and returns a
timestamped REPLY message to Si
Course Title: Distributed Computing BITS Pilani, Hyderabad Campus
Lamport’s Algorithm
Executing the critical section:
Site Si enters the CS when the following two conditions hold:
• L1: Si has received a REPLY message with timestamp larger than
(tsi, i) from all other sites
• L2: Si’s request is at the top of request_queuei

Releasing the critical section:


• Site Si, upon exiting the CS, removes its request from the top of its
request queue and broadcasts a timestamped RELEASE message to all
other sites
• When a site Sj receives a RELEASE message from site Si, it removes Si’s
request from its request queue
Course Title: Distributed Computing BITS Pilani, Hyderabad Campus
Lamport’s Algorithm

Sites S1 and S2 make requests for the CS

Course Title: Distributed Computing BITS Pilani, Hyderabad Campus


Lamport’s Algorithm

REPLY

May also write as: request_queue1 = ((1, 1), (1, 2))


REQUEST

Site S1 enters the CS


Course Title: Distributed Computing BITS Pilani, Hyderabad Campus
Lamport’s Algorithm

RELEASE

Site S1 exits the CS and sends RELEASE messages

Course Title: Distributed Computing BITS Pilani, Hyderabad Campus


Lamport’s Algorithm

Site S2 enters the CS


Performance - Requires 3(N − 1) messages per CS invocation
Course Title: Distributed Computing BITS Pilani, Hyderabad Campus
Ricart–Agrawala Algorithm
•communication channels are not required to be FIFO
•REQUEST and REPLY messages
•each process pi maintains the request-deferred array, RDi
•size of RDi = no. of processes in the system
•initially, ∀i ∀j: RDi[j] = 0
•whenever pi defers the request sent by pj, it sets RDi[j] = 1,
•after it has sent a REPLY message to pj , it sets RDi[j] = 0

Course Title: Distributed Computing BITS Pilani, Hyderabad Campus


Ricart–Agrawala Algorithm
Requesting the critical section:
(a) When a site Si wants to enter the CS, it broadcasts a
timestamped REQUEST message to all other sites

(b) When site Sj receives a REQUEST message from site Si, it


sends a REPLY message to site Si if site Sj is neither requesting
nor executing the CS, or if the site Sj is requesting and Si’s
request’s timestamp is smaller than site Sj’s own request’s
timestamp. Otherwise, the reply is deferred and Sj sets RDj [i] = 1

Course Title: Distributed Computing BITS Pilani, Hyderabad Campus


Ricart–Agrawala Algorithm
Executing the critical section:
(c) Site Si enters the CS after it has received a REPLY message from every
site it sent a REQUEST message to

Releasing the critical section:


(d) When site Si exits the CS, it sends all the deferred REPLY messages:
∀j if RDi [j] = 1, then Si sends a REPLY message to Sj and sets RDi [j] = 0

Course Title: Distributed Computing BITS Pilani, Hyderabad Campus


Ricart–Agrawala Algorithm

Sites S1 and S2 each make a request for the CS

Course Title: Distributed Computing BITS Pilani, Hyderabad Campus


Ricart–Agrawala Algorithm

Site S1 enters the CS

Course Title: Distributed Computing BITS Pilani, Hyderabad Campus


Ricart–Agrawala Algorithm

Site S1 exits the CS and sends a REPLY message


to S2’s deferred request

Course Title: Distributed Computing BITS Pilani, Hyderabad Campus


Ricart–Agrawala Algorithm

Site S2 enters the CS

Performance - requires 2(N − 1) messages per CS execution


Course Title: Distributed Computing BITS Pilani, Hyderabad Campus
Maekawa’s Algorithm

• quorum-based mutual exclusion algorithm


• request sets for sites (i.e., quorums) are constructed to satisfy
the following conditions:
• M1: (∀i ∀j : i  j, 1 ≤ i, j ≤ N :: Ri ∩ Rj   )
• M2: (∀i : 1 ≤ i ≤ N :: Si ∈ Ri)
• M3: (∀i : 1 ≤ i ≤ N :: |Ri| = K for some K)
• M4: Any site Sj is contained in K number of Ri’s, 1 ≤ i, j ≤ N
• Maekawa showed that N = K(K − 1) + 1
• This relation gives |Ri| = K = N (square root of N)
• Uses REQUEST, REPLY and RELEASE messages
• Performance  3N messages per CS invocation
Course Title: Distributed Computing BITS Pilani, Hyderabad Campus
Maekawa’s Algorithm: Problem

7 sites – S1, S2, S3, S4, S5, S6 and S7, K = 3. R1 = {S1, S2, S4}
 R2 = {S2, S3, S5}
 R3 = {S1, S3, S7}
 R4 = {S2, S5, S6} R3  R4 =   M1.
 R5 = {S1, S4, S5} R5  R7 =   M1.
 R6 = {S3, S4, S6}
S4 does not belong to R4 M2.
 R7 = {S2, S6, S7}
S2 belongs to R1, R2, R4 and R7 and K = 3  M4.
S7 belongs to R3 and R7 and K = 3  M4.

Course Title: Distributed Computing BITS Pilani, Hyderabad Campus


Maekawa’s Algorithm

• Algorithm Steps: from Recorded Lecture


• Deadlocks: from Recorded Lecture

Course Title: Distributed Computing BITS Pilani, Hyderabad Campus


Reference

• Ajay D. Kshemkalyani, and Mukesh Singhal, Chapter 9,


“Distributed Computing: Principles, Algorithms, and
Systems”, Cambridge University Press, 2008.

Course Title: Distributed Computing BITS Pilani, Hyderabad Campus


Course ID: SS ZG526, Title: Distributed Computing 24 BITS Pilani, Hyderabad Campus
Distributed Mutual Exclusion

Dr. Barsha Mitra


BITS Pilani CSIS Dept, BITS Pilani, Hyderabad Campus
Hyderabad Campus
Raymond’s Tree-Based Algorithm

• uses a spanning tree of the network


• each node maintains a HOLDER variable that provides information
about the placement of the privilege in relation to the node itself
• a node stores in its HOLDER variable the identity of a node that it
thinks has the privilege or leads to the node having the privilege
• for 2 nodes X and Y, if HOLDERX = Y, the undirected edge between X
and Y can be redrawn as a directed edge from X to Y
• Node containing the privilege is also known as the root node

Course Title: Distributed Computing 2 BITS Pilani, Hyderabad Campus


Raymond’s Tree-Based Algorithm

• messages between nodes traverse along undirected edges of the tree


• node needs to hold information about and communicate only to its
immediate-neighboring nodes
• HOLDERN1 = N2
N1 N2 N3 N4 • HOLDERN2 = N3
• HOLDERN3 = self
• HOLDERN4 = N3
• HOLDERN5 = N4
N9 N6 N5 • HOLDERN6 = N5
• HOLDERN7 = N5
• HOLDERN8 = N6
• HOLDERN9 = N6
N8 N7

Course Title: Distributed Computing 3 BITS Pilani, Hyderabad Campus


Raymond’s Tree-Based Algorithm

N1 N2 N3 N4 N1 N2 N3 N4

N9 N6 N5 N9 N6 N5

N8 N7 N8 N7

Course Title: Distributed Computing 4 BITS Pilani, Hyderabad Campus


Raymond’s Tree-Based Algorithm

Data Structures
• HOLDER
• USING
• possible values true or false
• indicates if the current node is executing the critical section
• ASKED
• possible values true or false
• indicates if node has sent a request for the privilege
• prevents the sending of duplicate requests for privilege

Course Title: Distributed Computing 5 BITS Pilani, Hyderabad Campus


Raymond’s Tree-Based Algorithm

Data Structures
• REQUEST_Q
• FIFO queue that can contain “self ” or the identities of
immediate neighbors as elements
• REQUEST_Q of a node consists of the identities of
those immediate neighbors that have requested for
privilege but have not yet been sent the privilege
• maximum size of REQUEST_Q of a node is the number
of immediate neighbors + 1 (for “self ”)

Course Title: Distributed Computing 6 BITS Pilani, Hyderabad Campus


Suzuki–Kasami’s Broadcast Algorithm
• broadcasts a REQUEST message for the token to all other sites
• site that possesses the token sends it to the requesting site
upon the receipt of its REQUEST message
• if a site receives a REQUEST message when it is executing the
CS, it sends the token only after it has completed the CS
execution

Course Title: Distributed Computing 7 BITS Pilani, Hyderabad Campus


Suzuki–Kasami’s Broadcast Algorithm
• Distinguish outdated and current REQUEST messages
• REQUEST(j, sn) where sn (sn = 1, 2…..) is a sequence number that indicates that Sj is
requesting its snth CS execution
• Si keeps an array of integers RNi[1, … ,n] where RNi[j] is the largest sequence number
received in a REQUEST message so far from Sj
• when Si receives a REQUEST(j, sn) message, it sets RNi[j] = max(RNi[j], sn)
• when Si receives a REQUEST(j,sn) message, the request is outdated if RNi[j] > sn

S1 receives REQUEST(2, 5), RN1 = [3 4 1] (assuming there are 3 sites)

S1 receives REQUEST(2, 3), RN1 = [3 4 1] (assuming there are 3 sites)

Course Title: Distributed Computing 8 BITS Pilani, Hyderabad Campus


Suzuki–Kasami’s Broadcast Algorithm
• sites with outstanding requests for the CS are determined in the following manner:
• token consists of a queue of requesting sites, Q, and an array of integers LN[1, … ,n],
where LN[j] is the sequence number of the request which Sj executed most recently
• after executing its CS, a Si updates LN[i] = RNi[i]
• token array LN[1, … ,n] permits a site to determine if a site has an outstanding request
for the CS
• at Si, if RNi[j] = LN[j] + 1, then Sj is currently requesting a token
• after executing the CS, a site checks this condition for all the j’s to determine all the
sites that are requesting the token and places their i.d.’s in queue Q if these i.d.’s are
not already present in Q
• finally, the site sends the token to the site whose i.d. is at the head of Q

Course Title: Distributed Computing 9 BITS Pilani, Hyderabad Campus


Suzuki–Kasami’s Broadcast Algorithm
Requesting the critical section:
(a) If requesting site Si does not have the token, then it increments
its sequence number, RNi[i], and sends a REQUEST(i, sn) message to
all other sites. (“sn” is the updated value of RNi[i].)

(b) When a site Sj receives this message, it sets RNj[i] to max(RNj[i],


sn). If Sj has the idle token, then it sends the token to Si if RNj[i] =
LN[i]+1.

Executing the critical section:


(c) Site Si executes the CS after it has received the token.

Course Title: Distributed Computing 10 BITS Pilani, Hyderabad Campus


Suzuki–Kasami’s Broadcast Algorithm

Releasing the critical section: Having finished the execution of


the CS, site Si takes the following actions:
(d) It sets LN[i] element of the token array equal to RNi[i].

(e) For every site Sj whose i.d. is not in the token queue, it
appends its i.d. to the token queue if RNi[j] = LN[j] + 1.

(f) If the token queue is non-empty after the above update, Si


deletes the top site i.d. from the token queue and sends the
token to the site indicated by the i.d.

Course Title: Distributed Computing 11 BITS Pilani, Hyderabad Campus


Suzuki–Kasami’s Broadcast Algorithm

Performance
• if a site holds the token, no message is required
• if a site does not hold the token, N messages are required

Course Title: Distributed Computing 12 BITS Pilani, Hyderabad Campus


Reference

• Ajay D. Kshemkalyani, and Mukesh Singhal, Chapter 9,


“Distributed Computing: Principles, Algorithms, and
Systems”, Cambridge University Press, 2008.

Course Title: Distributed Computing BITS Pilani, Hyderabad Campus


Course ID: SS ZG526, Title: Distributed Computing 14 BITS Pilani, Hyderabad Campus
SS ZG 526: Distributed Computing
Introduction: Part 1

Chittaranjan Hota
Professor, Dept. of Computer Sc. & Information Systems
BITS Pilani BITS, Pilani Hyderabad Campus, Hyderabad
Hyderabad Campus hota@hyderabad.bits-pilani.ac.in
Course Overview
Introduction to distributed computing
Logical time
Global snapshot
Causal ordering of messages in group communication

Distributed mutual exclusion


Consensus and agreement protocols

Distributed deadlock detection

Peer-to-Peer computing
Cluster computing

Grid computing

Internet of Things (IoT)

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Text and References

T1 Ajay D. Kshemkalyani, and Mukesh Singhal “Distributed Computing:


Principles, Algorithms, and Systems”, Cambridge University Press, 2008
(Reprint 2013).

R1 Kai Hwang, Geoffrey C. Fox, and Jack J. Dongarra, “Distributed and


Cloud Computing: From Parallel processing to the Internet of Things”,
Morgan Kaufmann, 2012 Elsevier Inc.

R2 John F. Buford, Heather Yu, and Eng K. Lua, “P2P Networking and
Applications”, Morgan Kaufmann, 2009 Elsevier Inc.

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Golden era in Computing

Powerful multi-core
processors
Explosion of General purpose
domain graphic processors
applications

Proliferation Superior
of devices software
methodologies

Virtualization
Wider bandwidth
leveraging the
for communication
powerful hardware
Image source: Cloud Futures 2011, Redmond

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
What is a Distributed System?

• Loosely coupled collection of autonomous computers connected by a


network running a distributed operating system to produce a single
integrated computing environment ( a virtual computer).

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Relation between software components

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Networked vs. distributed systems

Networking OS Distributed OS

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Motivation for Distributed Computing

• Resource sharing
• Access to remote resources
• Increased performance/cost ratio
• Reliability
• Availability, integrity, fault-tolerance
• Scalability
• Modularity and incremental expandability

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Few Examples …

Linux cluster Global Access to Resource Using Distributed


Architecture

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Continued…

Tor P2P Overlay Image source: Wiki

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Next session

Multiprocessor vs multicomputer systems, Distributed


systems design issues

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
SS ZG 526: Distributed Computing
Introduction: Part 2

Chittaranjan Hota
Professor, Dept. of Computer Sc. & Information Systems
BITS Pilani BITS, Pilani Hyderabad Campus, Hyderabad
Hyderabad Campus hota@hyderabad.bits-pilani.ac.in
Multiprocessor Vs Multi-computer systems

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Multiprocessors

a) A crossbar switch, b) An omega switching network

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Distributed Systems: Pros and Cons
• Advantages
– Communication and resource sharing possible
– Economy : Price-performance
– Reliability & scalability
– Potential for incremental growth
• Disadvantages
– Distribution-aware OSs and applications
– High network connectivity is essential
– Security and privacy

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Design Issues: Scaling
• If possible, do asynchronous communication
– Not always possible if the client has nothing to do
• Alternatively, by moving part of the computation

Technique (1): Hiding latency


SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Scaling Technique (2): Distribution aware

Examples: DNS resolutions

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
More Design Issues

• Lack of Global Knowledge


• Naming
• Compatibility
• Process Synchronization
• Resource Management
• Security

Image source: www.bbc.co.uk

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Next session

Distributed Communication models

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
SS ZG 526: Distributed Computing
Distributed Communication Models

Chittaranjan Hota
Professor, Dept. of Computer Sc. & Information Systems
BITS Pilani BITS, Pilani Hyderabad Campus, Hyderabad
Hyderabad Campus hota@hyderabad.bits-pilani.ac.in
RPC: Remote Procedure Call

• Issues:
– identifying and accessing the remote procedure
– parameters
– return value
• Sun RPC
• Microsoft’s DCOM
• OMG’s CORBA
• Java RMI
• XML/RPC
• SOAP/.NET
• AJAX (Asynchronous Javascript and XML)
SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Remote Procedure Calls (RPC)

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Procedure arguments

• To reduce the complexity of the interface specification, Sun RPC includes


support for a single argument to a remote procedure.*

• Typically the single argument is a structure that contains a number of values.

• *Newer versions can handle multiple arguments.

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Procedure identification

• Each procedure is identified by:


– Hostname (IP Address)
– Program identifier (32 bit integer)
– Procedure identifier (32 bit integer)
–Program Version identifier
»for testing and migration.

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Program identifiers

• Each remote program has a unique ID.


• Sun divided up the IDs:
– 0x00000000 - 0x1fffffff Sun
– 0x20000000 - 0x3fffffff SysAdmin
– 0x40000000 - 0x5fffffff Transient
– 0x60000000 - 0xffffffff Reserved

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
SUN RPC

struct square_in {
long arg1;
};
struct square_out {
long res1;
};
program SQUARE_PROG {
version SQUARE_VERS {
square_out SQUAREPROC(square_in) = 1;
} = 1;
} = 0x13451111;

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Rpcgen

Protocol Input File


Description

rpcgen

Client Stubs XDR filters header file Server skeleton

C Source Code

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Rpcgen continued…

bash$ rpcgen square.x

produces:
– square.h header
– square_svc.c server stub
– square_clnt.c client stub
– square_xdr.c XDR conversion routines

Function names derived from IDL function names and version numbers

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Square Client: Client.c
#include “square.h”
#include <stdio.h>
int main (int argc, char **argv)
{ CLIENT *cl;
square_in in;
square_out *out;
if (argc != 3) { printf(“client <localhost> <integer>”); exit (1); }
cl = clnt_create (argv[1], SQUARE_PROG, SQUARE_VERS, “tcp”);
in.arg1 = atol (argv [2]);
if ((out = squareproc_1(&in, cl)) == NULL)
{ printf (“Error”); exit(1); }
printf (“Result %ld\n”, out -> res1);
exit(0);
}
SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Square server: server.c

#include “square.h”
#include <stdio.h>
square_out *squareproc_1_svc (square_in *inp, struct svc_req *rqstp)
{
static square_out *outp;
outp.res1 = inp -> arg1 * inp -> arg1;
return (&outp);
}

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Exe creation

• gcc –o client client.c square_clnt.c square_xdr.c –lnsl

• gcc –o server server.c square_svc.c square_xdr.c –lrpcsvc -lnsl

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Next module

How to model a distributed computation, and logical


clocks?

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
SS ZG 526: Distributed Computing
Distributed Computation Model, and Logical Clocks

Chittaranjan Hota
Professor, Dept. of Computer Sc. & Information Systems
BITS Pilani BITS, Pilani Hyderabad Campus, Hyderabad
Hyderabad Campus hota@hyderabad.bits-pilani.ac.in
A distributed program

• A distributed program is composed of a set of n


asynchronous processes, p1, p2, ..., pi , ..., pn.

• The processes do not share a global memory and


communicate solely by message passing.

• The processes do not share a global clock that is


instantaneously accessible to these processes.

• Process executions and message transfers are


asynchronous.

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
A model of distributed executions

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Models of Communication Networks

• There are several models of services provided by


communication networks, namely, FIFO, Non-FIFO, and
causal ordering.

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Global state of a distributed system

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Consistent global state

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Next session

Lamport’s logical clocks

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
SS ZG 526: Distributed Computing
Distributed Mutual Exclusion (DME)

Chittaranjan Hota
Professor, Dept. of Computer Sc. & Information Systems
BITS Pilani BITS, Pilani Hyderabad Campus, Hyderabad
Hyderabad Campus hota@hyderabad.bits-pilani.ac.in
Distributed Mutual Exclusion

What is mutual exclusion?

1. Simultaneous update and read of a directory/file

2. Can two processes send their data to a printer?

So, it is exclusive access to a shared resource or to the


critical region.

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Performance of DME Algorithms

• Performance of each algorithm is measured in terms of


- no. of messages required for CS invocation
- synchronization delay (leaving & entering)
- response time (arrival, sending out, Entry & Exit)

• System throughput = 1/(sd + E)

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Continued…

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
A Centralized Algorithm

• A Central controller with a queue for deferring


replies.
• Request, Reply, and Release messages.
• Reliability and Performance bottleneck.

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Next session

Lamports’ DME

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
SS ZG 526: Distributed Computing
Distributed Mutual Exclusion (DME)

Chittaranjan Hota
Professor, Dept. of Computer Sc. & Information Systems
BITS Pilani BITS, Pilani Hyderabad Campus, Hyderabad
Hyderabad Campus hota@hyderabad.bits-pilani.ac.in
Lamport’s Distributed Mutual Exclusion
Requesting the critical section.

1. When a site Si wants to enter the CS, it sends a REQUEST(T=tsi, i)


message to all the sites in its request set Ri and places the request on
request_queuei.
2. When a site Sj receives the REQUEST(tsi , i) message from site Si, it
returns a timestamped REPLY message to Si and places site Si’s request
on request_queuej.

Executing the critical section.

Site Si enters the CS when the two following conditions hold:


1. Si has received a message with timestamp larger than (tsi, i) from all
other sites.
2. Si’s request is at the top of request_queuei.

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Continued…
Releasing the critical section.

1. Site Si, upon exiting the CS, removes its’ request from the top of its’ request
queue and sends a timestamped RELEASE message to all the sites in its
request set.

2. When a site Sj receives a RELEASE message from site Si, it removes Si’s
request from its request queue.

3. When a site removes a request from its request queue, its own request may
become the top of the queue, enabling it to enter the CS. The algorithm
executes CS requests in the increasing order of timestamps.

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Example

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Correctness
• Suppose that both Si and Sj were in CS at the same time (t).
• Then we have:

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Next Session

Ricart-Agrawala, and Maekawa’s DME

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
SS ZG 526: Distributed Computing
Distributed Mutual Exclusion (DME)

Chittaranjan Hota
Professor, Dept. of Computer Sc. & Information Systems
BITS Pilani BITS, Pilani Hyderabad Campus, Hyderabad
Hyderabad Campus hota@hyderabad.bits-pilani.ac.in
Ricart-Agrawala DME

Requesting Site:
– A requesting site Pi sends a message request(ts,i) to all
sites.
Receiving Site:
– Upon reception of a request(ts,i) message, the receiving
site Pj will immediately send a timestamped reply(ts,j)
message if and only if:
• Pj is not requesting or executing the critical section OR
• Pj is requesting the critical section but sent a request
with a higher timestamp than the timestamp of Pi
– Otherwise, Pj will defer the reply message.

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Example

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Maekawa’s DME

• A site requests permission only from a subset of sites.

• Request set of sites Si & Sj: Ri, Rj such that Ri and Rj will have
at-least one common site (Sk). Sk mediates conflicts between
Ri and Rj.

• A site can send only one REPLY message at a time, i.e., a site
can send a REPLY message only after receiving a
RELEASE message for the previous REPLY message.

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Request set

• Conditions M1 and M2 are necessary for correctness; whereas


conditions M3 and M4 provide other desirable features to the algorithm.
• Condition M3 states that the size of the request sets of all sites must be
equal implying that all sites should have to do equal amount of work to
invoke mutual exclusion.
• Condition M4 enforces that exactly the same number of sites should
request permission from any site implying that all sites have “equal
responsibility” in granting permission to other sites.

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Maekawa’s Request set
R1 = { 1, 2, 3, 4 }
R2 = { 2, 5, 8, 11 }
R3 = { 3, 6, 8, 13 }
R4 = { 4, 6, 10, 11 }
R5 = { 1, 5, 6, 7 }
R6 = { 2, 6, 9, 12 }
R7 = { 2, 7, 10, 13 } N=13
R8 = { 1, 8, 9, 10 }
R9 = { 3, 7, 9, 11 }
R10 = { 3, 5, 10, 12 }
R11 = { 1, 11, 12, 13 }
R12 = { 4, 7, 8, 12 }
R13 = { 4, 5, 9, 13 }

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Next Session

Maekawa’s DME algorithm

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
SS ZG 526: Distributed Computing
Distributed Mutual Exclusion (DME)

Chittaranjan Hota
Professor, Dept. of Computer Sc. & Information Systems
BITS Pilani BITS, Pilani Hyderabad Campus, Hyderabad
Hyderabad Campus hota@hyderabad.bits-pilani.ac.in
Maekawa’s DME Algo
Requesting the critical section
1. A site Si requests access to the CS by sending REQUEST(i)
messages to all the sites in its request set Ri.

2. When a site Sj receives the REQUEST(i) message, it sends a


– REPLY(j) message to Si provided it hasn’t sent a REPLY
message to a site from the time it received the last RELEASE
message.
– Otherwise, it queues up the REQUEST for later consideration.

Executing the critical section


1. Site Si accesses the CS only after receiving REPLY messages
from all the sites in Ri .
SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Continued…

Releasing the critical section


1. After the execution of the CS, site Si sends RELEASE(i) message to all the
sites in Ri .

2. When a site Sj receives a RELEASE(i) message from site Si, it sends a


REPLY message to the next site waiting in the queue and deletes that entry
from the queue. If the queue is empty, then the site updates its state to reflect
that the site has not sent out any REPLY message.

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Example
R1  {S1, S2, S3}
R2  {S2, S4, S6}
R3  {S3, S5, S6}
R4  {S1, S4, S5}
R5  {S2, S5, S7}
R6  {S1, S6, S7}
R7  {S3, S4, S7}

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Deadlocks
R1  {S1, S2, S3}
R2  {S2, S4, S6}
R3  {S3, S5, S6}
R4  {S1, S4, S5}
R5  {S2, S5, S7}
R6  {S1, S6, S7}
R7  {S3, S4, S7}

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Deadlock handling
FAILED
A FAILED message from site Si to site Sj indicates that Si
cannot grant Sj’s request because it has currently granted
permission to a site with a higher priority request. So that Sj
will not think that it is just waiting for the message to arrive.

INQUIRE
An INQUIRE message from Si to Sj indicates that Si would like
to find out from Sj if it has succeeded in locking all the sites in
its request set.

YIELD
A YIELD message from site Si to Sj indicates that Si is
returning the permission to Sj (to yield to a higher priority
request at Sj ).

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Next Session

Token based DME

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
SS ZG 526: Distributed Computing
Distributed Computation Model, and Logical Clocks

Chittaranjan Hota
Professor, Dept. of Computer Sc. & Information Systems
BITS Pilani BITS, Pilani Hyderabad Campus, Hyderabad
Hyderabad Campus hota@hyderabad.bits-pilani.ac.in
Logical Clocks: Scalar time and Vector time

• Causality among events in a distributed system is a


powerful concept in reasoning, analysing, and drawing
inferences about a computation.

• The knowledge of the causal precedence relation among


the events of processes helps solve a variety of problems
in distributed systems, such as distributed algorithms
design, tracking of dependent events, knowledge about the
progress of a computation, and concurrency measures.

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Scalar Time
• Proposed by Lamport as an attempt to totally order events
in a distributed system.
• Time domain is the set of non-negative integers.
• The logical local clock of a process pi and its local view of
the global time are squashed into one integer variable Ci.
• Rules R1 and R2 to update the clocks are as follows:
• R1: Before executing an event (send, receive, or internal),
process pi executes the following:
Ci := Ci + d (d > 0)
In general, every time R1 is executed, d can have a
different value; however, typically d is kept at 1.

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Scalar Time

• R2: Each message piggybacks the clock value of its sender at


sending time. When a process pi receives a message with
timestamp Cmsg , it executes the following actions:
 Ci := max(Ci , Cmsg )
 Execute R1.
 Deliver the message.

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Scalar Time Examples

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Basic Properties

Consistency Property
• Scalar clocks satisfy the monotonicity and hence the
consistency property:
for two events ei and ej,

Total Ordering
• Scalar clocks can be used to totally order events in a distributed
system.
• The main problem in totally ordering events is that two or more
events at different processes may have identical timestamp.

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Total Ordering

• A tie-breaking mechanism is needed to order such events. A tie


is broken as follows:
 Process identifiers are linearly ordered and tie among events with
identical scalar timestamp is broken on the basis of their process
identifiers.
 The lower the process identifier in the ranking, the higher the priority.
 The timestamp of an event is denoted by a tuple (t, i) where t is its time of
occurrence and i is the identity of the process where it occurred.
 The total order relation ≺ on two events x and y with timestamps (h,i) and
(k,j), respectively, is defined as follows:

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Properties. . .

Event counting
• If the increment value d is always 1, the scalar time has the
following interesting property: if event e has a timestamp h, then
h-1 represents the minimum logical duration, counted in units of
events, required before producing the event e;
• We call it the height of the event e.
• In other words, h-1 events have been produced sequentially
before the event e regardless of the processes that produced
these events.

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Properties. . .

No Strong Consistency

• The system of scalar clocks is not strongly consistent; that is,


for two events ei and ej,
• The reason that scalar clocks are not strongly consistent is
that the logical local clock and logical global clock of a process
are squashed into one, resulting in the loss of causal
dependency information among events at different processes.

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Next session

Vector clocks

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
SS ZG 526: Distributed Computing
Distributed Computation Model, and Logical Clocks

Chittaranjan Hota
Professor, Dept. of Computer Sc. & Information Systems
BITS Pilani BITS, Pilani Hyderabad Campus, Hyderabad
Hyderabad Campus hota@hyderabad.bits-pilani.ac.in
Vector Time

• The system of vector clocks was developed independently


by Fidge, Mattern and Schmuck.

• In the system of vector clocks, the time domain is


represented by a set of n-dimensional non-negative integer
vectors.

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Vector Time
Process pi uses the following two rules R1 and R2 to update its
clock:
• R1: Before executing an event, process pi updates its local
logical time as follows:
vti[i] := vti[i] + d (d > 0)
• R2: Each message m is piggybacked with the vector clock vt of
the sender process at sending time. On the receipt of such a
message (m,vt), process pi executes the following sequence of
actions:
 Update its global logical time as follows:
1≤ k≤ n : vti[k] := max(vti [k], vt[k])
 Execute R1.
 Deliver the message m.
SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Vector Time Example

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Vector Time
Comparing Vector Timestamps

• The following relations are defined to compare two vector


timestamps, vh and vk:

• If the process at which an event occurred is known, the test to


compare two timestamps can be simplified as follows: If events x
and y respectively occurred at processes pi and pj and are
assigned timestamps vh and vk, respectively, then

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Vector Time
Strong Consistency
• The system of vector clocks is strongly consistent; thus, by examining the
vector timestamp of two events, we can determine if the events are causally
related.
• However, Charron-Bost showed that the dimension of vector clocks cannot be
less than n, the total number of processes in the distributed computation, for
this property to hold.
Event Counting
• If d=1 (in rule R1), then the ith component of vector clock at process pi, vti[i],
denotes the number of events that have occurred at pi until that instant.
• So, if an event e has timestamp vh, vh[j] denotes the number of events
executed by process pj that causally precede e. Clearly, Σvh[j] − 1 represents
the total number of events that causally precede e in the distributed
computation.

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Efficient Implementations of Vector Clocks

• If the number of processes in a distributed computation is large,


then vector clocks will require piggybacking of huge amount of
information in messages.
• The message overhead grows linearly with the number of
processors in the system and when there are thousands of
processors in the system, the message size becomes huge even if
there are only a few events occurring in few processors.
• We discuss an efficient way to maintain vector clocks.
• Charron-Bost showed that if vector clocks have to satisfy the strong
consistency property, then in general vector timestamps must be at
least of size n, the total number of processes.
• However, optimizations are possible and next, and we discuss a
technique to implement vector clocks efficiently.

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Singhal-Kshemkalyani’s Differential Technique

• based on the fact that between successive message sends to the same process, only
a few entries of the vector clock at the sender process are likely to change.

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Next module

Global state recording and Causal ordering of


messages

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
SS ZG 526: Distributed Computing
Global State Recording

Chittaranjan Hota
Professor, Dept. of Computer Sc. & Information Systems
BITS Pilani BITS, Pilani Hyderabad Campus, Hyderabad
Hyderabad Campus hota@hyderabad.bits-pilani.ac.in
Global state recording

• Recording the global state of a distributed system on-the-fly is


an important problem.
• Why is it non-trivial?

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Uses of Global State Recording

• Recording a “consistent” state of the global


computation
– checkpointing in fault tolerance (rollback,
recovery)

• Detecting stable properties in a distributed


system via snapshots. A property is “stable”
if, once it holds in a state, it holds in all
subsequent states.

– termination detection
– deadlock detection

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Issues in recording a global state

• The following two issues need to be addressed:


I1: How to distinguish between the messages to be recorded in
the snapshot from those not to be recorded?
-Any message that is sent by a process before recording its
snapshot, must be recorded in the global snapshot.
-Any message that is sent by a process after recording its
snapshot, must not be recorded in the global snapshot.
I2: How to determine the instant when a process takes its
snapshot?
-A process pj must record its snapshot before processing a
message mij that was sent by process pi after recording its
snapshot.

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Difficulties due to Non-determinism

• Deterministic Computation
• - At any point in computation there is at
most one event that can happen next.

• Non-Deterministic Computation
• - At any point in computation there can
be more than one event that can happen
next.

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Example

Non-deterministic

Deterministic

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Consistent/ In-consistent global states

p q

p
m2
m1
m3
q

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Properties of the recorded global state

• The recorded global state may not correspond to


any of the global states that occurred during the
computation.

• This happens because a process can change its


state asynchronously before the markers it sent
are received by other sites and other sites record
their states.
 But the system could have passed through the recorded global
states in some equivalent executions.

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Next session

Chandy Lamports’ Global state recording algorithm

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
SS ZG 526: Distributed Computing
Global State Recording

Chittaranjan Hota
Professor, Dept. of Computer Sc. & Information Systems
BITS Pilani BITS, Pilani Hyderabad Campus, Hyderabad
Hyderabad Campus hota@hyderabad.bits-pilani.ac.in
Snapshot algorithms for FIFO channels

Chandy-Lamport algorithm
• The Chandy-Lamport algorithm uses a control message, called
a marker whose role in a FIFO system is to separate
messages in the channels.
• After a site has recorded its snapshot, it sends a marker, along
all of its outgoing channels before sending out any more
messages.
• A marker separates the messages in the channel into those to
be included in the snapshot from those not to be recorded in
the snapshot.
• A process must record its snapshot no later than when it
receives a marker on any of its incoming channels.

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Chandy-Lamport algorithm continued…

• The algorithm can be initiated by any process by executing the


“Marker Sending Rule” by which it records its local state and
sends a marker on each outgoing channel.
• A process executes the “Marker Receiving Rule” on receiving a
marker. If the process has not yet recorded its local state, it
records the state of the channel on which the marker is
received as empty and executes the “Marker Sending Rule” to
record its local state.
• The algorithm terminates after each process has received a
marker on all of its incoming channels.
• All the local snapshots get disseminated to all other processes
and all the processes can determine the global state.

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
State Recording Example

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Global state recording for non-FIFO channels

• A marker (like in Chandy-Lamport) cannot be used


to delineate messages.

• Here, either some degree of inhibition or


piggybacking of control information on computation
messages is to be used to capture out-of-sequence
messages.

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Lai-Yang algorithm for non-FIFO channels
The Lai-Yang algorithm uses a coloring scheme on computation
messages that works as follows:
1 Every process is initially white and turns red while taking a
snapshot. The equivalent of the “Marker Sending Rule” is
executed when a process turns red.
2 Every message sent by a white (red) process is colored white
(red).
3 Thus, a white (red) message is a message that was sent before
(after) the sender of that message recorded its local snapshot.
4 Every white process takes its snapshot at its convenience, but
no later than the instant it receives a red message.

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Lai-Yang algorithm continued…
4 Every white process records a history of all white messages
sent or received by it along each channel.
5 When a process turns red, it sends these histories along with its
snapshot to the initiator process that collects the global
snapshot.
6 The initiator process evaluates transit (LSi, LSj) to compute the
state of a channel Cij as given below:
SCij= white messages sent by Pi on Cij − white messages
received by Pj on Cij

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Next module

Causal ordering of messages

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
SS ZG 526: Distributed Computing
Causal Ordering of Messages

Chittaranjan Hota
Professor, Dept. of Computer Sc. & Information Systems
BITS Pilani BITS, Pilani Hyderabad Campus, Hyderabad
Hyderabad Campus hota@hyderabad.bits-pilani.ac.in
Group communication

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Causal ordering of Messages

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Birman- Schiper- Stephenson (BSS) protocol

• Before broadcasting m, process Pi increments vector time


VT Pi [i] and timestamps ‘m’.
• Process P j ≠ P i receives ‘m’ with timestamp VT m from Pi ,
• delays delivery until both:
– VT Pj [i] = VT m [i] - 1 // received all previous m’s
– VT Pj [k] >= VT m [k] for every k Є {1,2,…, n} - {i}
// received all messages also received by Pi before sending message ‘m’
• When Pj delivers ‘m’, VT Pj updated by IR2 of vector clock

SS ZG 526: Distributed Computing SS ZG526, Distributed Computing 9/22/20224


Example1 of BSS

SS ZG 526: Distributed Computing


Example2 of BSS

SS ZG 526: Distributed Computing


Next session

SES protocol for Causal ordering

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
SS ZG 526: Distributed Computing
Causal Ordering of Messages

Chittaranjan Hota
Professor, Dept. of Computer Sc. & Information Systems
BITS Pilani BITS, Pilani Hyderabad Campus, Hyderabad
Hyderabad Campus hota@hyderabad.bits-pilani.ac.in
Schiper-Eggli-Sandoz (SES) protocol

• No need for broadcast messages.


• Each process maintains a vector V_P of size N - 1, N being the
number of processes in the system.
• V_P is a vector of tuple (P’,t): P’ the destination process id and t, a
vector timestamp.
• Tm: logical time of sending message ‘m’
• Tpi: present logical time at pi
• Initially, V_P is empty.

SS ZG 526: Distributed Computing


SES Continued…
Sending a Message:
– Send message M, time stamped tm, along with V_P1 to P2.
– Insert (P2, tm) into V_P1. Overwrite the previous value of
(P2,t), if any.
– (P2,tm) is not sent. Any future message carrying (P2,tm) in
V_P1 cannot be delivered to P2 until tm < Tp2.
Delivering a message
– If V_M (in the message) does not contain any pair (P2, t), it
can be delivered.
– /* (P2, t) exists in V_M*/ If t !< Tp2, buffer the message.
(Don’t deliver).
– else (t < Tp2) deliver it

SS ZG 526: Distributed Computing


Example1 of SES

SS ZG 526: Distributed Computing


Example2 of SES

SS ZG 526: Distributed Computing


Next module

Distributed mutual exclusion

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
SS ZG 526: Distributed Computing
Distributed Mutual Exclusion (DME)

Chittaranjan Hota
Professor, Dept. of Computer Sc. & Information Systems
BITS Pilani BITS, Pilani Hyderabad Campus, Hyderabad
Hyderabad Campus hota@hyderabad.bits-pilani.ac.in
Suzuki-kasami broadcast DME

A site has RN:

LN[j] is the sequence no. of the request that site Sj executed most recently.

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
The DME algorithm
Requesting the critical section.
1. If the requesting site Si does not have the token, then it increments it’s sequence number,
RNi [i], and sends a REQUEST(i, sn) message to all other sites. (sn is the updated value of
RNi [i].)
2. When a site Sj receives this message, it sets RNj [i] to max(RNj [i], sn). If Sj has the idle
token, it sends the token to Si if RNj [i] = LN[i] + 1.

Executing the critical section.


3. Site Si executes the CS when it has received the token.

Releasing the critical section.


Having finished the execution of the CS, site Si takes the following actions:
4. It sets LN[i] element of the token array equal to RNi [i].
5. For every site Sj whose ID is not in the token queue, it appends its ID to the token queue if
RNi [j] = LN[j] + 1.
6. If token queue is nonempty after the above update, then it deletes the top site ID from the
queue and sends the token to the site indicated by the ID.

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Example

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Analysis of the DME algorithm
Correctness
Mutex is trivial.
– Theorem:
A requesting site enters the CS in finite amount of time.
– Proof:
A request enters the token queue in finite time. The queue is
in FIFO order, and there can be a maximum N-1 sites ahead
of the request.

Performance
0 or N messages per CS invocation.

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Next Session

Raymonds tree based DME

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
SS ZG 526: Distributed Computing
Distributed Mutual Exclusion (DME)

Chittaranjan Hota
Professor, Dept. of Computer Sc. & Information Systems
BITS Pilani BITS, Pilani Hyderabad Campus, Hyderabad
Hyderabad Campus hota@hyderabad.bits-pilani.ac.in
Raymonds’ Tree Based Algo

where the

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
The Algorithm
Requesting the critical section.
1. When a site wants to enter into the CS, it sends a REQUEST message to the
node along the directed path to the root, provided it does not hold the token and
it’s request_q is empty. It then adds it to it’s request_q.

2. When a site on the path receives this message, it places the REQUEST in it’s
request_q and sends a REQUEST message along the directed path to the root
provided it has not sent out a REQUEST message on its outgoing edge.

3. When the root site receives a REQUEST message, it sends the token to the
site from which it received the REQUEST message and sets it’s holder
variable to point at that site.

4. When the site receives the token, it deletes the top entry from its request_q,
sends the token to the site indicated in this entry, and sets its holder variable
to point at that site. If the request_q is nonempty at this point, then the site
sends a REQUEST message to the site which is pointed at by holder
variable.

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Continued…

Executing the critical section


1. A site enters the CS when it receives the token and its own entry is at the
top of its request_q. In this case, the site deletes the top entry from it’s
request_q and enters the CS.

Releasing the critical section


1. If its’ request_q is nonempty, then it deletes the top entry from it’s
request_q, sends the token to that site, and sets its holder variable to point at
that site.
2. If the request_q is nonempty at this point, then the site sends a REQUEST
message to the site which is pointed at by the holder variable.

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Example1

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Example2

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Analysis

Proof of Correctness
Mutex is trivial.
Finite waiting: all the requests in the system form a FIFO queue and the
token is passed in that order.

Performance
O(logN) messages per CS invocation. (Average distance between two
nodes in a tree)

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)
Next module

Deadlock detection in distributed systems

SS ZG 526: Distributed Computing Birla Institute of Technology and Science, Pilani (Deemed University)

You might also like