Unit I Introduction

CS8603 – Distributed Systems
UNIT I INTRODUCTION
Introduction: Definition –Relation to computer system components –

Motivation –Relation to parallel systems – Message-passing systems versus
shared memory systems –Primitives for distributed communication –
Synchronous versus asynchronous executions –Design issues and challenges. A
model of distributed computations: A distributed program –A model of
distributed executions –Models of communication networks –Global state –
Cuts –Past and future cones of an event –Models of process communications.
Logical Time: A framework for a system of logical clocks –Scalar time – Vector
time – Physical clock synchronization: NTP.
DEFINITION:
Autonomous processors communicating over a communication network Some

characteristics
• No common physical clock

• No shared memory
• Geographical seperation
• Autonomy and heterogeneity
Distributed System Model
Figure 1.1 A distributed system connects processors by a communication network.
A typical distributed system is shown in Figure 1.1. Each computer has a

memoryprocessing unit and the computers are connected by a communication
network.
Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 1

RELATION TO COMPUTER SYSTEM COMPONENTS
Figure 1.2 shows the relationships of the software components that run on each of
the computers and use the local operating system and network protocol stack for
functioning. The distributed software is also termed as middleware. A distributed
execution is the execution of processes across the distributed system to
collaboratively achieve a common goal. An execution is also sometimes termed a
computation or a run.
Figure 1.2: Interaction of the software components at each process.
Figure 1.2 schematically shows the interaction of this software with these system
components at each processor. Here we assume that the middleware layer does not
contain the traditional application layer functions of the network protocol stack, such
as http, mail, ftp, and telnet. Various primitives and calls to functions defined in
various libraries of the middleware layer are embedded in the user program code.
There exist several libraries to choose from to invoke primitives for the more
common functions – such as reliable and ordered multicasting – of the middleware
layer. There are several standards such as Object Management Group‟s (OMG)
common object request broker architecture (CORBA) [36], and the remote
procedure call (RPC) mechanism
MOTIVATION
• Inherently distributed computation

• Resource sharing

• Access to remote resources

• Increased performance/cost ratio Reliability
• availability, integrity, fault-tolerance
• Scalability
• Modularity and incremental expandability
RELATION TO PARALLEL MULTIPROCESSOR/MULTICOMPUTER
SYSTEMS
Characteristics of parallel systems
A parallel system may be broadly classified as belonging to one of three types:
Multiprocessor systems (direct access to shared memory, UMA model)
• Interconnection network - bus, multi-stage sweitch

• E.g., Omega, Butterfly, Clos, Shuffle-exchange networks
• Interconnection generation function, routing function
Multicomputer parallel systems (no direct access to shared memory, NUMA
model)
• bus, ring, mesh (w w/o wraparound), hypercube topologies

• E.g., NYU Ultracomputer, CM* Conneciton Machine, IBM Blue gene
Array processors (colocated, tightly coupled, common system clock)
• Niche market, e.g., DSP applications

A multiprocessor system is a parallel system in which the multiple processors have
direct access to shared memory which forms a common address space. The
architecture is shown in Figure 1.3(a). Such processors usually do not have a
common clock.
A multiprocessor system usually corresponds to a uniform memory access (UMA)

architecture in which the access latency, i.e., waiting time, to complete an access to
any memory location from any processor is the same.
The processors are in very close physical proximity and are connected by an
interconnection network. Interprocesscommunication across processors is
traditionally through read and write operations on the shared memory,

Figure 1.3 Two standard architectures for parallel systems. (a) Uniform memory
access (UMA) multiprocessor system. (b) Non-uniform memory access (NUMA)
multiprocessor. In both architectures, the processors may locally cache data from
memory.
Figure 1.4 Interconnection networks for shared memory multiprocessor systems. (a)
Omega network [4] for n = 8 processors P0–P7 and memory banks M0–M7. (b)
Butterfly network [10] for n = 8 processors P0–P7 and memory banks M0–M7.
Figure 1.4 shows two popular interconnection networks – the Omega network [4] and
the Butterfly network [10], each of which is a multi-stage network formed of 2×2
switching elements. Each 2×2 switch allows data on either of the two input wires to
be switched to the upper or the lower output wire. In a single step, however, only one
data unit can be sent on an output wire. So if the data from both the input wires is to
be routed to the same output wire in a single step, there is a collision. Various

techniques such as buffering or more elaborate interconnection designs can address

collisions.
Omega interconnection function nprocessors,
nmemory banks
log n stages: with n/2 switches of size 2x2 in each stage
Interconnection function: Output i of a stage connected to input j of next stage:
Omega routing function The routing function from input line i to output line j
considers only j and the stage number s, where s
0log2n−1. In a stage s switch, if the s+1th MSB (most significant bit) of j is 0, the
data is routed to the upper output wire, otherwise it is routed to the lower output
wire.
Butterfly interconnection function
Let there be M =n/2 switches per stage, and let a switch be denoted by the tuple (x,
s), where x [0,M−1] and stage s [0, log2n−1]. The two outgoing edges from any
switch (x, s) are as follows. There is an edge from switch (x, s) to switch (y, s+1) if
(i) x = y or (ii) x XOR y has exactly one 1 bit, which is in the (s+1)th MSB. For
stage s, apply the rule above for M/2s switches.
Butterfly routing functionIn a stage s switch, if the s+1th MSB of j is 0, the data is
routed to the upper output wire, otherwise it is routed to the lower output wire.
A multicomputer parallel system is a parallel system in which the multiple

processors do not have direct access to shared memory. The memory of the multiple
processors may or may not form a common address space. The architecture is shown
in Figure 1.3(b).

Figure 1.5(a) shows a wrap-around 4×4 mesh. For a k×k mesh which will contain k2
processors, the maximum path length between any two processors is 2(k/2−1).
Routing can be done along the Manhattan grid.
Figure 1.5(b) shows a four-dimensional hypercube. A k-dimensional hypercubehas
2k processor-and-memory units [13,21]. Each such unit is a nodein the hypercube,
and has a unique k-bit label. Each of the k dimensions isassociated with a bit position
in the label. The labels of any two adjacentnodes are identical except for the bit
position corresponding to the dimensionin which the two nodes differ. Thus, the
processors are labelled suchthat the shortest path between any two processors is the
Hamming distance(defined as the number of bit positions in which the two equal
sized bitstrings differ) between the processor labels.
Figure 1.5 Some popular topologies for multicomputer shared-memory machines. (a)
Wrap-around 2D-mesh, also known as torus. (b) Hypercube of dimension 4.
Example Nodes 0101 and 1100 have a Hamming distance of 2. The shortest path
between them has length 2.
Array processors belong to a class of parallel computers that are physically

colocated, are very tightly coupled, and have a common system clock (but may not
share memory and communicate by passing data using messages). Array processors
and systolic arrays that perform tightly synchronized processing and data exchange
in lock-step for applications such as DSP and image processing belong to this
category. These applications usually involve a large number of iterations on the data.
Flynn’s taxonomy

Figure 1.6 Flynn‟s taxonomy of SIMD, MIMD, and MISD architectures for
multiprocessor/multicomputer systems.
SISD: Single Instruction Stream Single Data Stream (traditional)
SIMD: Single Instruction Stream Multiple Data Stream
• scientific applicaitons, applications on large arrays

• vector processors, systolic arrays, Pentium/SSE, DSP chips
MISD: Multiple Instruciton Stream Single Data Stream
• E.g., visualization
MIMD: Multiple Instruction Stream Multiple Data Stream
• distributed systems, vast majority of parallel systems

Terminology
Coupling
• Interdependency/binding among modules, whether hardware or

software (e.g., OS, middleware) Parallelism: T (1)/T (n).
• Function of program and system

Concurrency of a program
• Measures productive CPU time vs. waiting for synchronization

operations
Granularity of a program

• Amt. of computation vs. amt. of communication

• Fine-grained program suited for tightly-coupled system Message-
passing vs. Shared Memory
common) shared address space throughout the system.via shared data variables, and
control variables.Semaphores and monitors that were originally designed for shared
memory uniprocessors and multiprocessors are examples. For a distributed system,
this abstraction is called distributed shared memory.
Emulating MP over SM:
• Partition shared address space

• Send/Receive emulated by writing/reading from special mailbox per
pair of processes
• A Pi–Pj message-passing can be emulated by a write by Pi to the
mailbox and then a read by Pj from the mailbox
• synchronization primitives to inform the receiver/sender after the data
has been sent/received.
Emulating SM over MP:
• Model each shared object as a process

• Write to shared object emulated by sending message to owner process
for the object
• Read from shared object emulated by sending query to owner of shared

object.
• latencies involved in read and write operations may be high even when
using shared memory
emulation because the read and write operations are implemented by using
network-wide communication
• In a MIMD message-passing multicomputer system, each “processor”
may be a tightly coupled multiprocessor system with shared memory.

• Multiprocessor system, the processors communicate via shared

memory. Between two computers, the communication is by message
passing.
Classification of Primitives
• Blocking/non-blocking, synchronous/asynchronous primitives

Send() and Receive(),
A Send primitive has at least two parameters
• the destination, and

• the buffer in the user space, containing the data to be sent.
Receive primitive has at least two parameters –
• The source from which the data is to be received (this could be a

wildcard), and
• the user buffer into which the data is to be received.
Synchronous (send/receive)
• Handshake between sender and receiver

• Send completes when Receive completes
• Receive completes when data copied into buffer
Asynchronous (send)
• Control returns to process when data copied out of user-specified buffer

Blocking (send/receive)
• Control returns to invoking process after processing of primitive

(whether sync or async) completes
Nonblocking (send/receive)

• Control returns to process immediately after invocation

• Send: even before data copied out of user buffer
• Receive: even before data may have arrived from sender
Non-blocking Primitive
Figure 1.7 A non-blocking send primitive. When the Wait call returns, at least one
of its parameters is posted.
Figure 1.7: A nonblocking send primitive. When the Wait call returns, at least one
of its parameters is posted.
• Return parameter returns a system-generated handle

• Use later to check for status of completion of call
• Keep checking (loop or periodically) if handle has been posted
• Issue Wait(handle1, handle2, . . .) call with list of handles
• Wait call blocks until one of the stipulated handles is posted
Blocking/nonblocking; Synchronous/asynchronous; send/receive primities
 Send primitive
◦ synchronous blocking,
◦ synchronous non-blocking,
◦ asynchronous blocking, and
◦ Asynchronous non-blocking.
 Receive primitive
◦ the blocking synchronous
◦ non-blocking synchronous versions.

Timing diagram
 Blocking synchronous Send
 non-blocking synchronous Send
 Blocking asynchronous Send
 non-blocking asynchronous Send
 Blocking Receive
 non-blocking Receive

Figure 1.8 Blocking/non-blocking and synchronous/asynchronous primitives [12].

Process Pi is sending and process Pj is receiving. (a) Blocking synchronous Send
and blocking (synchronous) Receive. (b) Non-blocking synchronous Send and
nonblocking (synchronous) Receive. (c) Blocking asynchronous Send. (d)
Nonblocking asynchronous Send
Processor synchrony
• Processor synchrony indicates that all the processors execute in lock-step

with their clocks synchronized.
• a step, the processors are synchronized.
Libraries and standards
• software products (banking, payroll, etc., applications) use proprietary

primitive libraries
• The message-passing interface (MPI) library and the PVM (parallel

virtual machine) library
• socket primitives or socket-like transport layer primitives are invoked to

call the procedure remotely.
• software, libraries for remote method invocation (RMI) and remote object
invocation (ROI).
• CORBA (common object request broker architecture)

• DCOM (distributed component object model) Synchronous versus
asynchronous executions
 asynchronous execution
◦ no processor synchrony
◦ no bound on the drift rate of processor clocks,
◦ message delays (transmission + propagation times) are finite
◦ no upper bound on the time taken by a process to execute a step

Figure 1.9 An example of an asynchronous execution in a messagepassing

system. A timing diagram is used to illustrate the execution.
 synchronous execution
• processors are synchronized
• clock drift rate between any two processors is bounded.
• message delivery (transmission + delivery) times are such that they occur
in one logical step or round.
• known upper bound on the time taken by a process to execute a step.
Figure 1.10 An example of a synchronous execution in a message-passing system.

All the messages sent in a round are received within that same round.
 Emulating an asynchronous system by a synchronous system (A→S)
 Emulating a synchronous system by an asynchronous system (S →A)

◦ synchronizer

System Emulations
Figure 1.11 Emulations among the principal system classes in a failure-free system.
Figure 1.11: Sync ↔ async, and shared memory ↔ msg-passing emulations

Assumption: failure-free system
System A emulated by system B:
• If not solvable in B, not solvable in A

• If solvable in A, solvable in B
Challenges: System Perspective
Communication mechanisms: E.g., Remote Procedure Call (RPC), remote object

invocation (ROI), message-oriented vs. stream-oriented communication
Processes: Code migration, process/thread management at clients and servers,

design of software and mobile agents
Naming: Easy to use identifiers needed to locate resources and processes

transparently and scalably
Synchronization: Eg: Mutual exclusion
Data storage and access
• Schemes for data storage, search, and lookup should be

fast and scalable across network
• Revisit file system design

Consistency and replication
• Replication for fast access, scalability, avoid bottlenecks

• Require consistency management among replicas
Fault-tolerance: correct and efficient operation despite link, node, process failures
Distributed systems security
• Secure channels, access control, key management (key generation and key
distribution), authorization, secure group management Scalability and
modularity of algorithms, data, services
• replication, caching and cache management, and asynchronous processing
Some experimental systems: Globe, Globus, Grid
API for communications, services: ease of use
Transparency: hiding implementation policies from user
• Access: hide differences in data rep across systems, provide uniform

operations to access resources
• Location: locations of resources are transparent
• Migration: relocate resources without renaming
• Relocation: relocate resources as they are being accessed
• Replication: hide replication from the users
• Concurrency: mask the use of shared resources Failure: reliable and fault-
tolerant operation
• Useful execution models and frameworks: to reason with and design
correct distributed programs
o Interleaving model o Partial
order model o Input/Output
automata o Temporal Logic of
Actions

• Dynamic distributed graph algorithms and routing algorithms o System

topology: distributed graph, with only local neighborhood knowledge
o Graph algorithms: building blocks for group communication, data
dissemination, object location
o Algorithms need to deal with dynamically changing graphs o
Algorithm efficiency: also impacts resource consumption, latency,
traffic, congestion
Time and global state
• 3D space, 1D time
• Physical time (clock) accuracy
• Logical time captures inter-process dependencies and tracks relative time
progression
• Global state observation: inherent distributed nature of system
• Concurrency measures: concurrency depends on program logic, execution
speeds within logical threads, communication speeds
Synchronization/coordination mechanisms
• Physical clock synchronization: hardware drift needs correction

• Leader election: select a distinguished process, due to inherent symmetry
• Mutual exclusion: coordinate access to critical resources
• Distributed deadlock detection and resolution: need to observe global
state; avoid duplicate detection, unnecessary aborts
• Termination detection: global state of quiescence; no CPU processing and
no in-transit messages
• Garbage collection: Reclaim objects no longer pointed to by any process
Group communication, multicast, and ordered message delivery
• Group: processes sharing a context, collaborating

• Multiple joins, leaves, fails
• Concurrent sends: semantics of delivery order

Monitoring distributed events and predicates
• Predicate: condition on global system state

• Debugging, environmental sensing, industrial process control, analyzing
event streams
Distributed program design and verification tools
Debugging distributed programs
Data replication, consistency models, and caching
• Fast, scalable access;

• coordinate replica updates;
• optimize replica placement
World Wide Web design: caching, searching, scheduling
• Global scale distributed system; end-users

• Read-intensive; prefetching over caching
• Object search and navigation are resource-intensive
• User-perceived latency
Distributed shared memory abstraction
• Wait-free algorithm design: process completes execution, irrespective of

actions of other processes, i.e., n − 1 fault-resilience
Mutual exclusion
• Bakery algorithm, semaphores, based on atomic hardware primitives, fast

algorithms when contention-free access
Register constructions
• Revisit assumptions about memory access

• What behavior under concurrent unrestricted access to memory?
Foundation for future architectures, decoupled with technology (semiconductor,
biocomputing, quantum . . .)

Consistency models:
• coherence versus access cost trade-off

• Weaker models than strict consistency of uniprocessors
Reliable and fault-tolerant distributed systems
• Consensus algorithms: processes reach agreement in spite of faults (under

various fault models)
• Replication and replica management o Voting and quorum systems
• Distributed databases, commit: ACID properties
• Self-stabilizing systems: ”illegal” system state changes to ”legal” state;
requires built-in redundancy
• Checkpointing and recovery algorithms: roll back and restart from earlier
”saved” state
• Failure detectors:
o Difficult to distinguish a ”slow” process/message from a failed process/

never sent message o algorithms that ”suspect” a process as having failed
and converge on a determination of its up/down status
• Load balancing: to reduce latency, increase throughput, dynamically. E.g.,

server farms o Computation migration: relocate processes to redistribute
workload o Data migration: move data, based on access patterns
o Distributed scheduling: across processors
• Real-time scheduling: difficult without global view, network delays make

task harder
• Performance modeling and analysis: Network latency to access resources

must be reduced o Metrics: theoretical measures for algorithms, practical
measures for systems

o Measurement methodologies and tools
• Mobile systems o Wireless communication: unit disk model; broadcast

medium (MAC), power management etc. o CS perspective: routing, location
management, channel allocation, localization and position estimation,
mobility management
o Base station model (cellular model)
o Ad-hoc network model (rich in distributed graph theory problems)
• Sensor networks: Processor with electro-mechanical interface Ubiquitous

or pervasive computing o Processors embedded in and seamlessly pervading
environment o Wireless sensor and actuator mechanisms; self-organizing;
networkcentric, resource-constrained
o E.g., intelligent home, smart workplace
• Peer-to-peer computing o No hierarchy; symmetric role; self-organizing;

efficient object storage and lookup;scalable; dynamic reconfig
• Publish/subscribe, content distribution o Filtering information to extract
that of interest
• Distributed agents
o Processes that move and cooperate to perform specific tasks;

coordination, controlling mobility, software design and interfaces
• Distributed data mining o Extract patterns/trends of interest o Data not
available in a single repository
• Grid computing o Grid of shared computing resources; use idle CPU cycles
o Issues: scheduling, QOS guarantees, security of machines and jobs
• Security
o Confidentiality, authentication, availability in a distributed setting

o Manage wireless, peer-to-peer, grid environments

o Issues: e.g., Lack of trust, broadcast media, resource-constrained, lack
of structure
A Model of Distributed Computations
A Distributed Program o A distributed program is composed of a set of

nasynchronous processes, p1, p2, ..., pi , ..., pn.
o The processes do not share a global memory and communicate solely by
passing messages.
o The processes do not share a global clock that is instantaneously accessible to
these processes.
o Process execution and message transfer are asynchronous. o Without loss of
generality, we assume that each process is running on a different processor.
o Let Cijdenote the channel from process pi to process pjand let mijdenote a
message sent by pi to pj. o The message transmission delay is finite and
unpredictable.
A Model of Distributed Executions o The execution of a process consists of a
sequential execution of its actions.
o The actions are atomic and the actions of a process are modeled as three types
of events, namely, internal events, message send events, and message receive
events.
o Let ex denote the x th event at process pi .
o For a message m, let send (m) and rec (m) denote its send and receive events,
respectively.
o The occurrence of events changes the states of respective processes and

channels.
o An internal event changes the state of the process at which it occurs.

o A send event changes the state of the process that sends the message and the
state of the channel on which the message is sent.
o A receive event changes the state of the process that receives the message and
the state of the channel on which the message is received.
o The events at a process are linearly ordered by their order of occurrence. o
The execution of process pi produces a sequence of events e1, e2, ..., ex , o ex+1,
... and is denoted by Hi where o Hi = (hi , →i )
o hi is the set of events produced by pi and o binary relation →i defines a linear
order on these events.
o Relation →i expresses causal dependencies among the events of pi .
o The send and the receive events signify the flow of information between
processes and establish causal dependency from the sender process to the
receiver process. o A relation →msgthat captures the causal dependency due to
message exchange, is defined as follows. For every message m that is
exchanged between two processes, we have
o send(m) →msgrec (m).
o Relation →msgdefines causal dependencies between the pairs of

corresponding send and receive events.
o The evolution of a distributed execution is depicted by a space-time diagram.
o A horizontal line represents the progress of the process; a dot indicates an

event; a slant arrow indicates a message transfer.
o Since we assume that an event execution is atomic (hence, indivisible and

instantaneous), it is justified to denote it as a dot on a process line.
o In the Figure 2.1, for process p1, the second event is a message send event,
the third event is an internal event, and the fourth event is a message receive
event.

Figure 2.1: The space-time diagram of a distributed execution.
Causal Precedence Relation
The execution of a distributed application results in a set of distributed events

produced by the processes.
Let H= i hi denote the set of events executed in a distributed computation.

Define a binary relation → on the set H as follows that expresses causal
dependencies between events in the distributed execution.
The causal precedence relation induces an irreflexive partial order on the events of
a distributed computation that is denoted as H=(H, →).
Note that the relation → is nothing but Lamport‟s “happens before” relation.
For any two events eiand ej, if ei→ ej, then event ejis directly or transitively
dependent on event ei. (Graphically, it means that there exists a path consisting of

message arrows and process-line segments (along increasing time) in the spacetime
diagram that starts at eiand ends at ej.)
For example, in Figure 2.1, e11 →e33 and e33 → e26.
The relation → denotes flow of information in a distributed computation and ei→

ejdictates that all the information available at eiis potentially accessible at ej.
For example, in Figure 2.1, event e6 has the knowledge of all other events shown in
the figure.
For any two events eiand ej,eiƒ→ ejdenotes the fact that event ejdoes not directly or
transitively dependent on event ei. That is, event eidoes not causally affect event ej.
In this case, event ejis not aware of the execution of eior any event executed after
eion the same process.
For example, in Figure 2.1, e3 ƒ→ e3 and e4 ƒ→ e1.3 Note
the following two rules:
For any two events eiand ej,eiƒ→ ejƒ ejƒ→ ei.
For any two events eiand ej,ei→ ej ejƒ→ ei.

Concurrent events
For any two events eiand ej, if eiƒ→ ejand ejƒ→ ei, then events
eiand ejare said to be concurrent (denoted as eiǁ ej).
In the execution of Figure 2.1, e13 ǁ e33 and e24 ǁ e31.
The relation ǁ is not transitive; that is, (eiǁ ej)∧ (ejǁ ek) ƒ⇒eiǁ ek.
For example, in Figure 2.1, e33 ǁe24 and e24 ǁ e15, however, e33 ƒǁe15.
For any two events eiand ejin a distributed execution, ei→ ejor ej→
ei, or eiǁ ej.

Logical vs. Physical Concurrency
In a distributed computation, two events are logically concurrent if and only if they
do not causally affect each other.
Physical concurrency, on the other hand, has a connotation that the events occur at
the same instant in physical time.
Two or more events may be logically concurrent even though they do not occur at
the same instant in physical time.
However, if processor speed and message delays would have been different, the
execution of these events could have very well coincided in physical time.
Whether a set of logically concurrent events coincide in the physical time or not,
does not change the outcome of the computation.
Therefore, even though a set of logically concurrent events may not have occurred
at the same instant in physical time, we can assume that these events occured at the
same instant in physical time.
Models of Communication Networks
There are several models of the service provided by communication networks,

namely, FIFO, Non-FIFO, and causal ordering.
In the FIFO model, each channel acts as a first-in first-out message queue and thus,
message ordering is preserved by a channel.
In the non-FIFO model, a channel acts like a set in which the sender process adds
messages and the receiver process removes messages from it in a random order.
The “causal ordering” model is based on Lamport‟s “happens before” relation.
A system that supports the causal ordering model satisfies the following
property: CO: For any two messages mijand mkj, if send (mij) −→ send(mkj), then
rec (mij) −→ rec (mkj).

This property ensures that causally related messages destined to the same
destination are delivered in an order that is consistent with their causality relation.
Causally ordered delivery of messages implies FIFO message delivery. (Note that
CO FIFO Non-FIFO.)
Causal ordering model considerably simplifies the design of distributed algorithms

because it provides a built-in synchronization.
Global State of a Distributed System
“A collection of the local states of its components, namely, the processes and the
communication channels.”
The state of a process is defined by the contents of processor registers, stacks, local
memory, etc. and depends on the local context of the distributed application.
The state of channel is given by the set of messages in transit in the channel.
The occurrence of events changes the states of respective processes and channels.
An internal event changes the state of the process at which it occurs.
A send event changes the state of the process that sends the message and the state
of the channel on which the message is sent.
A receive event changes the state of the process that or receives the message and
the state of the channel on which the message is received.
Notations
A Channel State

The state of a channel depends upon the states of the processes it connects.
The state of a channel is defined as follows:
Global State
The global state of a distributed system is a collection of the local states of the
processes and the channels.
Notationally, global state GS is defined as,
For a global state to be meaningful, the states of all the components of the distributed
system must be recorded at the same instant.
This will be possible if the local clocks at processes were perfectly synchronized or
if there were a global system clock that can be instantaneously read by the processes.
(However, both are impossible.)
A Consistent Global State
Even if the state of all the components is not recorded at the same instant, such a
state will be meaningful provided every message that is recorded as received is also
recorded as sent.
Basic idea is that a state should not violate causality – an effect should not be present
without its cause. A message cannot be received if it was not sent.
Such states are called consistent global states and are meaningful global states.
Inconsistent global states are not meaningful in the sense that a distributed system
can never be in an inconsistent state.

An Example
Consider the distributed execution of Figure 2.2.
In Figure 2.2:
A global state is inconsistent
because the state of p2 has recorded the receipt of message m12, however, the state of
p1 has not recorded its send.
A global state GS2 consisting of local states
is consistent; all the channels are empty except C21 that
contains message m21.

Cuts of a Distributed Computation
“In the space-time diagram of a distributed computation, a cut is a
zigzag line joining one arbitrary point on each process line.”
• A cut slices the space-time diagram, and thus the set of events in the
distributed computation, into a PAST and a FUTURE.
• The PAST contains all the events to the left of the cut and the FUTURE
contains all the events to the right of the cut.
• For a cut C, let PAST(C) and FUTURE(C) denote the set of events in the
PAST and FUTURE of C, respectively.
• Every cut corresponds to a global state and every global state can be
graphically represented as a cut in the computation‟s space-time diagram.
• Cuts in a space-time diagram provide a powerful graphical aid in
representingand reasoning about global states of a computation.
Figure 2.3: Illustration of cuts in a distributed execution.
• In a consistent cut, every message received in the PAST of the cut was sent in
the PAST of that cut. (In Figure 2.3, cut C2 is a consistent cut.)
• All messages that cross the cut from the PAST to the FUTURE are in transit
in the corresponding consistent global state.
• A cut is inconsistent if a message crosses the cut from the FUTURE to the
PAST. (In Figure 2.3, cut C1 is an inconsistent cut.)

Past and Future Cones of an Event
Past Cone of an Event
• An event ej could have been affected only by all events ei such that ei → ej .
• In this situtaion, all the information available at ei could be made accessible
atej .
• All such events ei belong to the past of ej .
Let Past(ej ) denote all events in the past of ej in a computation (H, →). Then,
Past(ej ) = {ei |∀ei∈ H, ei → ej }.
Figure 2.4: Illustration of past and future cones.
• Let Pasti (ej ) be the set of all those events of Past(ej ) that are on process pi .
• Pasti (ej ) is a totally ordered set, ordered by the relation →i , whose maximal
element is denoted by max(Pasti (ej )).
• max(Pasti (ej )) is the latest event at process pi that affected event ej (Figure
2.4).
• Let Max Past(ej ) = S(8i ){max(Pasti (ej ))}.
• Max Past(ej ) consists of the latest event at every process that affected event
ej and is referred to as the surface of the past cone of ej .

• Past(ej ) represents all events on the past light cone that affect ej .
Future Cone of an Event
• The future of an event ej , denoted by Future(ej ), contains all events ei that
are causally affected by ej (see Figure 2.4).
• In a computation (H, →), Future(ej ) is defined as:

Future(ej ) = {ei |∀ei∈ H, ej → ei}.
• Define Futurei (ej ) as the set of those events of Future(ej ) that are on
processpi .
• define min(Futurei (ej )) as the first event on process pi that is affected by ej
.
• Define Min Future(ej ) as S(8i ){min(Futurei (ej ))}, which consists of the
firstevent at every process that is causally affected by event ej .
• Min Future(ej ) is referred to as the surface of the future cone of ej .
• All events at a process pi that occurred after max(Pasti (ej )) but before
min(Futurei (ej )) are concurrent with ej .
• Therefore, all and only those events of computation H that belong to the set
“H − Past(ej ) − Future(ej )” are concurrent with event ej .
Models of Process Communications
There are two basic models of process communications – synchronous

andasynchronous.The synchronous communication model is a blocking type where
on amessage send, the sender process blocks until the message has been receivedby
the receiver process.
The sender process resumes execution only after it learns that the receiverprocess
has accepted the message.Thus, the sender and the receiver processes must
synchronize to exchange amessage. On the other hand,asynchronous communication
model is a non-blocking type where the senderand the receiver do not synchronize
to exchange a message.

After having sent a message, the sender process does not wait for themessage to be
delivered to the receiver process.
The message is bufferred by the system and is delivered to the receiverprocess when
it is ready to accept the message.
Neither of the communication models is superior to the other.
Asynchronous communication provides higher parallelism because the

senderprocess can execute while the message is in transit to the receiver.
However, A buffer overflow may occur if a process sends a large number

ofmessages in a burst to another process.
Thus, an implementation of asynchronous communication requires morecomplex

buffer management.
In addition, due to higher degree of parallelism and non-determinism, it ismuch
more difficult to design, verify, and implement distributed algorithmsfor
asynchronous communications.
Synchronous communication is simpler to handle and implement.However, due to

frequent blocking, it is likely to have poor performance andis likely to be more
prone to deadlocks.
Logical Time
Introduction
The concept of causality between events is fundamental to the design andanalysis of

parallel and distributed computing and operating systems.Usually causality is
tracked using physical time.In distributed systems, it is not possible to have a global
physical time.As asynchronous distributed computations make progress in spurts,
thelogical time is sufficient to capture the fundamental monotonicity
propertyassociated with causality in distributed systems.
Three ways to implement logical time - scalar time,vector time, and matrix time.
Causality among events in a distributed system is a powerful concept inreasoning,

analyzing, and drawing inferences about a computation.The knowledge of the causal
precedence relation among the events ofprocesses helps solve a variety of problems

in distributed systems, such asdistributed algorithms design, tracking of dependent

events, knowledge aboutthe progress of a computation, and concurrency measures.
A Framework for a System of Logical Clocks
Definition
• A system of logical clocks consists of a time domain T and a logical clock C.

Elements of T form a partially ordered set over a relation <.
• Relation < is called the happened before or causal precedence. Intuitively, this
relation is analogous to the earlier than relation provided by the physical time.
• The logical clock C is a function that maps an event e in a distributed system
to an element in the time domain T, denoted as C(e) and called the timestamp
of e, and is defined as follows:
C:H→T
such that the following property is satisfied:
for two events ei and ej , ei → ej =⇒ C(ei ) < C(ej ).
This monotonicity property is called the clock consistency condition.
When T and C satisfy the following condition,
for two events ei and ej , ei → ej⇔ C(ei ) < C(ej )
the system of clocks is said to be strongly consistent.
Implementing Logical Clocks
Implementation of logical clocks requires addressing two issues: datastructures

local to every process to represent logical time and a protocol toupdate the data
structures to ensure the consistency condition.
Each process pi maintains data structures that allow it the following twocapabilities:
• A local logical clock, denoted by lci , that helps process pi measure its own
progress.

• A logical global clock, denoted by gci , that is a representation of process pi

‟s local view of the logical global time. Typically, lci is a part of gci .
The protocol ensures that a process‟s logical clock, and thus its view of the
globaltime, is managed consistently. The protocol consists of the following two
rules:
• R1: This rule governs how the local logical clock is updated by a process when
it executes an event.
• R2: This rule governs how a process updates its global logical clock to update
its view of the global time and global progress.
• Systems of logical clocks differ in their representation of logical time and also
in the protocol to update the logical clocks.
Scalar Time
• Proposed by Lamport in 1978 as an attempt to totally order events in a
distributed system.
• Time domain is the set of non-negative integers.
• The logical local clock of a process pi and its local view of the global timeare
squashed into one integer variable Ci .
• Rules R1 and R2 to update the clocks are as follows:
• R1: Before executing an event (send, receive, or internal), process pi
executesthe following:
Ci :=Ci + d (d > 0)
• In general, every time R1 is executed, d can have a different value;

however,typically d is kept at 1.
• R2: Each message piggybacks the clock value of its sender at sending time.
When a process pi receives a message with timestamp Cmsg , it executes the
following actions: o Ci := max(Ci , Cmsg ) o Execute R1.
o Deliver the message.
Figure 3.1 shows evolution of scalar time.
Evolution of scalar time:

Basic Properties
Consistency Property
Scalar clocks satisfy the monotonicity and hence the consistency property: for
two events ei and ej , ei → ej =⇒ C(ei ) < C(ej ).
Total Ordering
• Scalar clocks can be used to totally order events in a distributed system.

• The main problem in totally ordering events is that two or more events
atdifferent processes may have identical timestamp.
• For example in Figure 3.1, the third event of process P1 and the second
eventof process P2 have identical scalar timestamp.
Total Ordering
A tie-breaking mechanism is needed to order such events. A tie is broken asfollows:
• Process identifiers are linearly ordered and tie among events with identical
scalar timestamp is broken on the basis of their process identifiers.
• The lower the process identifier in the ranking, the higher the priority.
• The timestamp of an event is denoted by a tuple (t, i ) where t is its time
of occurrence and i is the identity of the process where it occurred.

• The total order relation ≺ on two events x and y with timestamps (h,i) and
(k,j), respectively, is defined as follows: x≺ y ⇔ (h < k or (h = k and i <
j))
Event counting
• If the increment value d is always 1, the scalar time has the following
interesting property: if event e has a timestamp h, then h-1 represents the
minimum logical duration, counted in units of events, required before
producing the event e;
• We call it the height of the event e.
• In other words, h-1 events have been produced sequentially before the
event eregardless of the processes that produced these events.
• For example, in Figure 3.1, five events precede event b on the longest
causal path ending at b.
No Strong Consistency
• The system of scalar clocks is not strongly consistent; that is, for two
events
ei and ej , C(ei ) < C(ej ) =/ ei → ej .
• For example, in Figure 3.1, the third event of process P1 has smaller scalar
timestamp than the third event of process P2.However, the former did not
happen before the latter.
• The reason that scalar clocks are not strongly consistent is that the logical
local clock and logical global clock of a process are squashed into one,
resulting in the loss causal dependency information among events at
differentprocesses.
• For example, in Figure 3.1, when process P2 receives the first message
from process P1, it updates its clock to 3, forgetting that the timestamp of
the latest event at P1 on which it depends is 2.
Vector Time

• The system of vector clocks was developed independently by Fidge,

Mattern and Schmuck.
• In the system of vector clocks, the time domain is represented by a set of
n-dimensional non-negative integer vectors.
• Each process pi maintains a vector vti [1..n], where vti [i ] is the local
logical clock of pi and describes the logical time progress at process pi .
• vti [j ] represents process pi ‟s latest knowledge of process pj local time.
If vti [j ]=x, then process pi knows that local time at process pj has
progressed till x.
• The entire vector vti constitutes pi ‟s view of the global logical time and
is used to timestamp events.
Process pi uses the following two rules R1 and R2 to update its clock:
• R1: Before executing an event, process pi updates its local logical time as
follows:
vti [i ] := vti [i ] + d (d > 0)
• R2: Each message m is piggybacked with the vector clock vt of the sender
process at sending time. On the receipt of such a message (m,vt), process
pi executes the following sequence of actions: Update its global logical
time as follows:
o Execute R1.
o Deliver the message m.
• The timestamp of an event is the value of the vector clock of its process
when the event is executed.
• Figure 3.2 shows an example of vector clocks progress with the increment
value d=1.
Initially, a vector clock is [0, 0, 0, ...., 0].
An Example of Vector Clocks

Figure 3.2: Evolution of vector time.
Comparing Vector Timestamps
The following relations are defined to compare two vector timestamps, vh andvk:
• If the process at which an event occurred is known, the test to compare

twotimestamps can be simplified as follows: If events x and y
respectivelyoccurred at processes pi and pj and are assigned timestamps
vh and vk,respectively, then
Properties of Vectot Time
Isomorphism

If events in a distributed system are timestamped using a system of vectorclocks, we

have the following property.
If two events x and y have timestamps vh and vk, respectively, then
Thus, there is an isomorphism between the set of partially ordered eventsproduced

by a distributed computation and their vector timestamps.
Strong Consistency
• The system of vector clocks is strongly consistent; thus, by examining

thevector timestamp of two events, we can determine if the events are
causallyrelated.
• However, Charron-Bost showed that the dimension of vector clocks
cannot beless than n, the total number of processes in the distributed
computation, forthis property to hold.
Event Counting
• If d=1 (in rule R1), then the i th component of vector clock at process pi ,
vti [i ], denotes the number of events that have occurred at pi until
thatinstant.
• So, if an event e has timestamp vh, vh[j ] denotes the number of
eventsexecuted by process pj that causally precede e. Clearly, Pvh[j ]–
1represents the total number of events that causally precede e in
thedistributed computation.
Efficient Implementations of Vector Clocks
• If the number of processes in a distributed computation is large, then

vector clocks will require piggybacking of huge amount of information in
messages.

• The message overhead grows linearly with the number of processors in

the system and when there are thousands of processors in the system, the
message size becomes huge even if there are only a few events occurring
in few processors.
• We discuss an efficient way to maintain vector clocks.
Charron-Bost showed that if vector clocks have to satisfy the strong
consistency property, then in general vector timestamps must be at least of
size n, the total number of processes.
• However, optimizations are possible and next, and we discuss a technique
to
Singhal-Kshemkalyani’s Differential Technique
• Singhal-Kshemkalyani‟s differential technique is based on the

observation thatbetween successive message sends to the same process,
only a few entries ofthe vector clock at the sender process are likely to
change.
• When a process pi sends a message to a process pj , it piggybacks only
those entries of its vector clock that differ since the last message sent to pj
.
• If entries i1, i2, . . . , in1 of the vector clock at pi have changed to v1, v2,
. . . , vn1 , respectively, since the last message sent to pj , then process
pipiggybacks a compressed timestamp of the form:implement vector
clocks efficiently.
to
the next message to pj .
When pj receives this message, it updates its vector clock as follows:
• Thus this technique cuts down the message size, communication

bandwidth and buffer (to store messages) requirements.

• In the worst of case, every element of the vector clock has been updated
at pi since the last message to process pj , and the next message from pi to
pj will need to carry the entire vector timestamp of size n.
• However, on the average the size of the timestamp on a message will be
less than n.
• Implementation of this technique requires each process to remember the
vector timestamp in the message last sent to every other process.
• Direct implementation of this will result in O(n2) storage overhead at each
process.
• Singhal and Kshemkalyani developed a clever technique that cuts down
this storage overhead at each process to O(n). The technique works in the
following manner:
• Process pi maintains the following two additional vectors:
Singhal-Kshemkalyani’s Differential Technique
• Since the last communication from pi to pj , only those elements of vector

clockvti [k] have changed for which LSi [j ] <LUi [k] holds.
• Hence, only these elements need to be sent in a message from pi to pj .
When pi sends a message to pj , it sends only a set of tuples
{(x, vti [x])|LSi [j ] <LUi [x]} as the vector timestamp to pj , instead of
sending a vector of n entries in a message.
• Thus the entire vector of size n is not sent along with a message. Instead,
only the elements in the vector clock that have changed since the last
message send to that process are sent in the format

{(p1, latest value), (p2, latest value), . . .}, where pi indicates that the pi th
component of the vector clock has changed.
• This technique requires that the communication channels follow FIFO
discipline for message delivery.
• This method is illustrated in Figure 3.3. For instance, the second message
from p3 to p2 (which contains a timestamp {(3, 2)}) informs p2 that the
thirdcomponent of the vector clock has been modified and the new value
is
2.
• This is because the process p3 (indicated by the third component of the
vector) has advanced its clock value from 1 to 2 since the last message
sent to p2.
• This technique substantially reduces the cost of maintaining vector clocks
in large systems, especially if the process interactions exhibit temporal or
spatiallocalities.
Figure 3.3: Vector clocks progress in Singhal-Kshemkalyani technique.
Matrix Time
• In a system of matrix clocks, the time is represented by a set of n × n

matrices ofnon-negative integers.

• A process pi maintains a matrix mti [1..n, 1..n] where, mti [i , i ] denotes

the local logical clock of pi and tracks the progress of the computation at
process pi .
• mti [i , j ] denotes the latest knowledge that process pi has about the local
logical clock, mtj [j , j ], of process pj .
• mti [j , k] represents the knowledge that process pi has about the latest
knowledge that pj has about the local logical clock, mtk [k, k], of pk .
• The entire matrix mti denotes pi ‟s local view of the global logical time.
Processpiuses the following rulesR1andR2to update its clock:
R1: Before executing an event, processpiupdates its local logical time asfollows:
mti[i,i] :=mti[i,i] +d(d>0)
R2: Each messagemis piggybacked with matrix timemt. Whenpireceivessuch a
message(m,mt)from a processpj,piexecutes the following sequence of actions:
• Update its global logical time as follows:

a) 1≤k≤n:mti[i,k] :=max(mti[i,k],mt[j,k])
(That is, update its rowmti[i, ] with thepj‟s row in the received timestamp,mt.)
b) 1≤k,l≤n:mti[k,l] :=max(mti[k,l],mt[k,l])
• ExecuteR1
• Deliver message m.
Figure 3.4 gives an example to illustrate how matrix clocks progress in a distributed
computation. We assume d=1.
• Let us consider the following events:e which is the xi-th event at process
pi,e1k and e2kwhich are the x1k-th and x2k-th event at process pk, and e1j
and e2j which are the x1j-th andx2j-th events atpj.
• Letmtedenote the matrix timestamp associated with evente. Due
tomessagem4,e2kis the last event ofpkthat causally precedese, therefore,
wehavemte[i,k]=mte[k,k]=x2k.
• Likewise,mte[i,j]=mte[j,j]=x2j. The last event ofpkknown bypj, to
theknowledge ofpiwhen it executed evente, ise1k; therefore,mte[j,k]=x1k.
• Likewise, we havemte[k,j]=x1j.

Figure 3.4: Evolution of matrix time.
Basic Properties
• Vector mti[i, .] contains all the properties of vector clocks.
• In addition, matrix clocks have the following property:
mink(mti[k,l])≥t process pi knows that every other process pk knows that
pl‟s local time has progressed till t.
o If this is true, it is clear that processpiknows that all other processes

knowthatplwill never send information with a local time≤t. o In many
applications, this implies that processes will no longer require
fromplcertain information and can use this fact to discard
obsoleteinformation.
Ifdis always 1 in the ruleR1, thenmti[k,l] denotes the number of eventsoccurred
atpland known bypkas far aspi‟s knowledge is concerned.
Virtual Time
• Virtual time system is a paradigm for organizing and synchronizing
distributed systems.
• This section a provides description of virtual time and its implementation
using the Time Warp mechanism.
• The implementation of virtual time using Time Warp mechanism works
on the basis of an optimistic assumption.

• Time Warp relies on the general look ahead-rollback mechanism where

each process executes without regard to other processes having
synchronization conflicts.
• If a conflict is discovered, the offending processes are rolled back to the
time just before the conflict and executed forward along the revised path.
• Detection of conflicts and rollbacks are transparent to user s.

The implementation of Virtual Time using Time Warp mechanism makes the
following optimistic assumption: synchronization conflicts and thus rollbacks
generally occurs rarely.
next, we discuss in detail Virtual Time and how Time Warp mechanism isused
to implement it.
Virtual Time Definition

“Virtual time is a global, one dimensional, temporal coordinate system on a
distributed computation to measure the computational progress and to define
synchronization.”
• A virtual time system is a distributed system executing in
coordination with an imaginary virtual clock that uses virtual time.
• Virtual times are real values that are totally ordered by the less than
relation, “<”.
• Virtual time is implemented a collection of several loosely
synchronized local virtual clocks.
• These local virtual clocks move forward to higher virtual times;
however, occasionaly they move backwards.
• Processes run concurrently and communicate with each other by
exchanging messages.
• Every message is characterized by four values:
a) Name of the sender
b) Virtual send time
c) Name of the receiver
d) Virtual receive time
• Virtual send time is the virtual time at the sender when the message
is sent, whereas virtual receive time specifies the virtual time when
the message must be received (and processed) by the receiver.
• A problem arises when a message arrives at process late, that is,
the virtual receive time of the message is less than the local virtual
time at the receiver process when the message arrives.

• Virtual time systems are subject to two semantic rules similar to

Lamport‟s clock conditions:
• Rule 1:Virtual send time of each message < virtual receive time of
that message.
• Rule 2: Virtual time of each event in a process < Virtual time of
next event in that process.
The above two rules imply that a process sends all messages in increasing
order of virtual send time and a process receives (and processes) all messages
in the increasing order of virtual receive time.
Causality of events is an important concept in distributed systems and is also
a major constraint in the implementation of virtual time.
• It is important an event that causes another should be completely
executed before the caused event can be processed.
• The constraint in the implementation of virtual time can be stated
as follows:
• “If an event A causes event B, then the execution of A and B must
be scheduled in real time so that A is completed before B starts”
• If event A has an earlier virtual time than event B, we need execute
A before B provided there is no causal chain from A to B.
• Better performance can be achieved by scheduling A concurrently
with B or scheduling A after B.
• If A and B have exactly the same virtual time coordinate, then there
is no restriction on the order of their scheduling.
• If A and B are distinct events, they will have different virtual space
coordinates (since they occur at different processes) and neither
will be a cause for the other.
• To sum it up, events with virtual time < „t‟ complete before the
starting of events at time „t‟ and events with virtual time >„t‟ will
start only after events at time „t‟ are complete.

Characteristics of Virtual Time
• Virtual time systems are not all isomorphic; it may be either

discrete orcontinuous.
• Virtual time may be only partially ordered.
• Virtual time may be related to real time or may be independentof

it.
• Virtual time systems may be visible to programmers and

manipulatedexplicitly as values, or hidden and manipulated
implicitlyaccording to somesystem-defined discipline
• Virtual times associated with events may be explicitly calculated
by userprograms or they may be assigned by fixed rules.
Comparison with Lamport’s Logical Clocks

In Lamport‟s logical clock, an artificial clock is created one or each
processwith unique labels from a totally ordered set in a manner consistent
withpartial order.
In virtual time, the reverse of the above is done by assuming that every eventis
labeled with a clock value from a totally ordered virtual time scalesatisfying
Lamport‟s clock conditions.
• Thus the Time Warp mechanism is an inverse of Lamport‟s
scheme.
• In Lamport‟s scheme, all clocks are conservatively maintained so
that theynever violate causality.
• A process advances its clock as soon as it learns of new
causaldependency.
• In the virtual time, clocks are optimisticaly advanced and
corrective actionsare taken whenever a violation is detected.

• Lamport‟s initial idea brought about the concept of virtualtime but

themodel failed to preserve causal independence.
Physical Clock Synchronization: NTP

Motivation
• In centralized systems, there is only single clock. A processs gets
the time bysimply issuing a system call to the kernel.
• In distributed systems, there is no global clock or common
memory. Eachprocessor has its own internal clock and its own
notion of time.
• These clocks can easily drift seconds per day,
accumulatingsignificant errorsover time.
• Also, because different clocks tick at different rates, they may not
remainalways synchronized although they might be synchronized
when they start.
• This clearly poses serious problems to applications that depend on
asynchronized notion of time.
• For most applications and algorithms that run in a distributed
system, we need to know time in one or more of the following
contexts:
o The time of the day at which an event happened on a specific machine

in the network.◮
o The time interval between two events that happened on different
machines in the network.
o The relative ordering of events that happened on different machines in

the network.

Unless the clocks in each machine have a common notion of time , timebased
queries cannot be answered.
• Clock synchronization has a significant effect on many problems
like secure systems, fault diagnosis and recovery, scheduled
operations, database systems, and real-world clock values.
• Clock synchronization is the process of ensuring that physically
distributed processors have a common notion of time.
• Due to different clocks rates, the clocks at various sites may
diverge with time and periodically a clock synchronization must be
performed to correct this clock skew in distributed systems.
• Clocks are synchronized to an accurate real-time standard like
UTC (Universal Coordinated Time).
• Clocks that must not only be synchronized with each other but also
have to adhere to physical time are termed physical clocks.
Definitions and Terminology

Let Ca and Cb be any two clocks.
Time:
The time of a clock in a machine p is given by the function Cp(t), where Cp(t)
=t for a perfect clock.
Frequency:
Frequency is the rate at which a clock progresses. The frequency at time t of
clock Ca is C′a(t).
Offset:
Clock offset is the difference between the time reported by a clock and the
real time . The offset of the clock Cais given by C a (t) − t. The offset of clock
Ca relative to Cb at time t ≥ 0 is given by Ca(t)−Cb(t).
Skew:
The skew of a clock is the difference in the frequencies of the clockand the
perfect clock. The skew of a clockCarelative to clockCbat timetis
(C′a(t)−C′b(t)). If the skew is bounded byρ, then as per Equation (1),clock
values are allowed to diverge at a rate in the range of 1 –ρto 1 +ρ.
Drift (rate):

The drift of clockCais the second derivative of the clock valuewith respect to
time, namely,C′′a(t). The drift of clockCarelative to
clockCbat timetisC′′a(t)−C′′b(t).
Clock Inaccuracies
Physical clocks are synchronized to an accurate real-time standard like
UTC(Universal Coordinated Time).
• However, due to the clock inaccuracy discussed above, a time r (clock) is
saidto be working within its specification if (where constantρis the
maximumskew rate specified by the manufacturer.)
Figure 3.5 illustrates the behavior of fast, slow, and perfect clocks withrespect to
UTC.
Figure 3.5: The behavior of fast, slow, and perfect clocks with respect to UTC.
Offset delay estimation method

• TheNetwork Time Protocol (NTP)which is widely used for

clocksynchronization on the Internet uses the TheOffset Delay
Estimationmethod.
• The design of NTP involves a hierarchical tree of time servers.
1. The primary server at the root synchronizes with the UTC.

2. The next level contains secondary servers, which act as a backup to
theprimary server.
3. At the lowest level is the synchronization subnet which has the clients.
Clock offset and delay estimation:
In practice, a source node cannot accurately estimate the local time on the targetnode
due to varying message or network delays between the nodes.
• This protocol employs a common practice of performing several trials
andchooses the trial with the minimum delay.
• Figure 3.6 shows how NTP timestamps are numbered and exchangedbetween
peersAandB.
• LetT1,T2,T3,T4be the values of the four most recent timestamps as shown.
Assume clocksAandBare stable and running at the same speed.
Figure 3.6: Offset and delay estimation.
• Let a=T1−T3andb=T2−T4.

• If the network delay difference from AtoBand fromBtoA, calleddifferential

delay, is small, the clock offsetθand roundtrip delayδofBrelative toAat
timeT4are approximately given by the following.
• Each NTP message includes the latest three timestamps

T1,T2andT3,whileT4is determined upon arrival.
• Thus, both peers AandBcan independently calculate delay and offset usinga
single bidirectional message stream as shown in Figure 3.7
.

Figure 3.7: Timing diagram for the two servers.
The Network Time Protocol synchronizationprotocol.

• A pair of servers in symmetric mode exchange pairs of timing messages.
• A store of data is then built up about the relationship between the twoservers
(pairs of offset and delay).
Specifically, assume that each peer maintains pairs (Oi,Di), whereOi-
measure of offset (θ)
Di- transmission delay of two messages (δ).
• The offset corresponding to the minimum delay is chosen. Specifically, the
delay and offset are calculated as follows.Assume thatmessagemtakes timetto
transfer andm′takest′to transfer.
• The offset between A‟s clock and B‟s clock isO. IfA‟s local clock time isA(t)
andB‟s local clock time isB(t), we
have
• The eight most recent pairs of (Oi,Di) are retained.

• The value ofOithat corresponds to minimumDiis chosen to estimateO.

Unit I Introduction

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit I Introduction

Uploaded by

Copyright:

Available Formats

CS8603 – Distributed Systems

Introduction: Definition –Relation to computer system components –

Autonomous processors communicating over a communication network Some

• No common physical clock

Figure 1.1 A distributed system connects processors by a communication network.

A typical distributed system is shown in Figure 1.1. Each computer has a

Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 1

RELATION TO COMPUTER SYSTEM COMPONENTS

Figure 1.2: Interaction of the software components at each process.

• Inherently distributed computation

Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 2

• Access to remote resources

Characteristics of parallel systems

A parallel system may be broadly classified as belonging to one of three types:

Multiprocessor systems (direct access to shared memory, UMA model)

• Interconnection network - bus, multi-stage sweitch

• bus, ring, mesh (w w/o wraparound), hypercube topologies

• Niche market, e.g., DSP applications

A multiprocessor system usually corresponds to a uniform memory access (UMA)

Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 3

Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 4

techniques such as buffering or more elaborate interconnection designs can address

Omega interconnection function nprocessors,

log n stages: with n/2 switches of size 2x2 in each stage

Interconnection function: Output i of a stage connected to input j of next stage:

Butterfly interconnection function

A multicomputer parallel system is a parallel system in which the multiple

Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 5

Array processors belong to a class of parallel computers that are physically

Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 6

SISD: Single Instruction Stream Single Data Stream (traditional)

SIMD: Single Instruction Stream Multiple Data Stream

• scientific applicaitons, applications on large arrays

• distributed systems, vast majority of parallel systems

• Interdependency/binding among modules, whether hardware or

• Function of program and system

• Measures productive CPU time vs. waiting for synchronization

Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 7

• Amt. of computation vs. amt. of communication

passing vs. Shared Memory

Emulating MP over SM:

• Partition shared address space

Emulating SM over MP:

• Model each shared object as a process

• Read from shared object emulated by sending query to owner of shared

Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 8

• Multiprocessor system, the processors communicate via shared

• Blocking/non-blocking, synchronous/asynchronous primitives

• the destination, and

• The source from which the data is to be received (this could be a

• Handshake between sender and receiver

• Control returns to process when data copied out of user-specified buffer

• Control returns to invoking process after processing of primitive

Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 9

• Control returns to process immediately after invocation

• Return parameter returns a system-generated handle

◦ asynchronous blocking, and

◦ the blocking synchronous

◦ non-blocking synchronous versions.

Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 10

 Blocking synchronous Send

 non-blocking synchronous Send

 Blocking asynchronous Send