You are on page 1of 14


Blank Homework  
Notes   Labs   Scores   Blank

Lecture Notes
Dr. Tong Lai Yu, March 2010

0. Review and Overview 7. Distributed OS Theories

1. B-Trees 8. Distributed Mutual Exclusions
2. An Introduction to Distributed Systems 9. Agreement Protocols
3. Deadlocks 10. Distributed Scheduling
4. Distributed Systems Architecture 11. Distributed Resource Management
5. Processes 12. Recovery and Fault Tolerance
6. Communication 13. Security and Protection

In a few hundred years, when the history of our time

is written from a long-term perspective, it is likely

that the most important event those historians will see

is not technology, not the Internet, not e-commerce. It

is an unprecedented change in human condition. For the

first time, they will have to manage themselves.

Peter Drucker

Distributed OS Theories
1. Inherent Limitations of a Distributed System

Absence of Global clock

difficult to make temporal order of events
difficult to collect up-to-date information on the
state of the entire system

Absence of Shared Memory

no up-to-date state of the entire system to any individual process
as there's no shared memory
coherent view -- all observations of different processes ( computers )
are made at the same physical time

we can obtain a coherent but partial view of the system or

incoherent view of the system

complete view ( global state ) -- local views ( local states )

+ messages in transit

difficult to obtain a coherent global state

4. Clock Synchronization

Physical Clocks


Sometimes we simply need the exact time, not just an ordering.


Universal Coordinated Time (UTC):

Based on the number of transitions per second of the cesium 133 atom

(pretty accurate).
At present, the real time is taken as the average of some 50

cesium-clocks around the world.

Introduces a leap second from time to time to compensate that days are

getting longer.


UTC is broadcast through short wave radio and satellite. Satellites can give

an accuracy of about ±0.5 ms.


Suppose we have a distributed system with a UTC-receiver

somewhere in it => we still have to distribute its time to each machine.

Basic principle

Every machine has a timer that generates an interrupt H times per

There is a clock in machine p that ticks on each timer interrupt.

Denote the value of that clock by Cp(t), where t is UTC time.

Ideally, we have that for each machine p, Cp(t) = t, or, in other

words, dC/dt = 1.
In practice: 1 - r ≤ dC / dt ≤ 1 + r.


Never let two clocks in any system differ by more than δ time units =>
synchronize at least every δ/(2r) seconds.

Global positioning system

Basic idea

You can get an accurate account of time as a side-effect of GPS.


Assuming that the clocks of the satellites are accurate and


It takes a while before a signal reaches the receiver

The receiver's clock is definitely out of synch with the satellite

Principal operation

Δr : unknown deviation of the receiver's clock.

xr , yr , zr : unknown coordinates of the receiver.
Ti : timestamp on a message from satellite i.
Δi = ( Tnow - Ti ) +
Δr : measured delay of the message sent by satellite i.
Measured distance to satellite i: c x Δi

( c is speed of light )
Real distance is

4 satellites => 4 equations in 4 unknowns ( with Δr as one of them )

Clock Synchronization Principle

Principle I

Every machine asks a time server for the accurate time at least once

every δ/(2r) seconds (Network Time Protocol).


Okay, but you need an accurate measure of round trip delay, including

interrupt handling and processing incoming messages.

Principle II

Let the time server scan all machines periodically, calculate an

average, and inform each machine how it should adjust its time relative

to its present time.


Okay, you'll probably get every machine in sync. You don't even need

to propagate UTC time.


You'll have to take into account that setting the time back is never

allowed => smooth adjustments.

19. Lamport's Logical Clock


We first need to introduce a notion of ordering before we can order anything.

The happened before → relation

a → b , if a and b are events in the same process and a occurred before b

a → b , if a is the event of sending a message m in a process
and b is the event of receipt of the same message m by another
if a → b and b → c, then a → c ( transitive )

event a causally affects b if a → b

concurrent: a || b if !( a → b ) and !( b → a )
for any two events in a system, either a → b or b → a or a || b

e11 → e12   , e12 → e22

e21 → e13   , e14 || e24


To realize the relation → we need a clock Ci at each

process Pi in the system, and adjust the clock according

to the following rules.

Ci(a) -- timestamp of event a at Pi

if a → b, then C(a) < C(b)

Condition requirements:

1. for any two events a and b in a process Pi,

if a occurs before b, then

Ci(a) < Ci(b)

2. if a is the event of sending a message m in Pi

and b is the event of receiving the same message m

at process Pj, then

Ci(a) < Cj(b)

Implementation rules:

1. two successive events in Pi

Ci = Ci + d ( d > 0 )

if a and b are two successive events in Pi and

a → b then

Ci(b) = Ci(a) + d ( d > 0 )

2. event a: sending of message m by process Pi,

timestamp of message m : tm = Ci(a )


Cj = max ( Cj, tm + d )    d > 0

→ is irreflixive, defines partial order among events

Totally ordering relation ( => ) can be defined by ( on top of the above )

a is any event in process Pi

b is any event in process Pj

a => b iff
either Ci(a) < Cj(b)
or Ci(a) = Cj(b) and Pi Pj ( e.g. Pi Pj if i ≤ j, to break ties )

Limitation of Lamport's Clocks

if a → b then C(a) < C(b)

but C(a) < C(b) does not necessarily imply a → b

Positioning of Lamport's logical clocks in distributed systems:

Example: Totally Ordered Multicasting

See Figure of inconsistent database update below.

28. Vector Clocks

n = number of processes in a distributed system

vector of length n )
Each event in process Pi ~ vector clock Ci ( integer


C [2]

Ci = i



Ci[i] ~ Pi's own logical clock

Ci[j] ~ Pi's best guess of logical time at Pj. More precisely, the time of occurrence of the last event at Pj which "happenned before" the
current point in time at Pj
Ci(a) is referred to as the timestamp of event a at Pi

Comparing two vector timestamps of events a and b

Equal    ta = tb   iff   all i, ta[i] = tb[i]

Not Equal    ta ≠ tb   iff   some i,    ta[i] ≠ tb[i]
Less Than or Equal    ta ≤ tb   iff   all i,    ta[i] ≤ tb[i]
Not Less Than or Equal To    ta tb   iff   some i,    ta[i] > tb[i]
Less Than    ta < tb   iff     ta ≤ tb and ta ≠ tb )
Not Less Than    ta tb   iff     !(ta ≤ tb
and tb ≠ tb );
Concurrent    ta || tb   iff      ta tb and
tb ta

Implementation Rules:

1. two successive events a, b in process Pi:

Ci(b)[i] =
Ci(a)[i] + d    ( d > 0 )

2. event a at Pi sending message m to process Pj

with receiving event b; vector timestamp tm = Ci(a) is assigned to m;
on receiving m, Pj
updates Cj as follows:

all k, Cj(b)[k] = max(Cj(b)[k],tm[k])


At any instant

all i, all j : Ci[i] ≥ Cj[i]

Events are causally related if ta < tb or
tb < ta

Now, a → b   iff   ta < tb

32. Global State

no global clock, no global memory

To determine a global system state, a process p must
enlist the cooperation of other processes that must record their
states and send the
recorded local states to p

processes cannot record their local states at precisely the same

instant unless they have access to a common clock

the global-state-detection algorithm is to be superimposed

on the underlying computation; it must run concurrently with
but not alter
the underlying computation


Distributed system

finite set of processes

finite set of channels

process state, channel state

Example: Updating a replicated database and

leaving it in an inconsistent state.
Update 1 : Add $100 to $1000
Update 2 : Calcalate interest
At San Francisco ( Update 1 first ): Add $100 to $1000, then calculate interest.
At New York ( Update 2 first ): Calcalate interest of $1000, then add $100.

42. Some definitions

LSi -- local state of Si ( site ) (Collection of events occurred.)

events -- send( mij ), recv( mij )

time ( x ) -- time at which state x was recorded

e.g. time ( LSi )

send ( mij ) ∈ LSi iff time ( send ( mij ) ) < time ( LSi )

recv ( mij ) ∈ LSj iff time ( recv ( mij ) ) < time ( LSj )

transit ( LSi, LSj ) =

{ mij | send( mij ) ∈ LSi
Λ recv( mij ) !∈ LSj }

i.e. message in channel

inconsistent ( LSi, LSj ) =

{ mij | send( mij) !∈ LSi
Λ recv( mij ) ∈ LSj }

Global State GS = { LS1, LS2, ..., LSn }

i.e. collection of local states ( may be consistent or inconsistent )

Consistent Global State: A global state GS = { LS1, LS2, ..., LSn }
is consistent iff

all i, all j: 1 ≤ i, j ≤ n :: inconsistent( LSi, LSj ) = Φ

Transitless global state: A global state is transitless iff

all i, all j: 1 ≤ i, j ≤ n :: transit( LSi, LSj ) = Φ

Strongly consistent global state: A global state is strongly

consistent if it is consistent and transitless.

54. Causal Ordering of Messages

if Send( M1 ) → Send( M2 )

then the receipient should receive M1 before


i.e. Send( M1 ) → Send( M2 )

requires Receive( M1 ) → Receive( M2 )

Figure: Violation of causal ordering of messages

Applications: database replication management, monitoring distributed computations, simplifying distributed algorithms,...

Solution idea: upon arrival of a message at a process, buffer (delay delivery) the message until the message immediately preceding it
is delivered

Birman-Schiper-Stephenson Protocol: Enforcing Causal Ordering of Messages

Assumes broadcast communication channels that do not lose or corrupt messages. ( i.e. everyone talks to everyone ). Use vector clocks to
"count" number of messages ( i.e. set d = 1 ). n processes.
Vector Time:

1. When Pi begins to execute, Ci is initialized to zeros.

2. For each event send( m ) at Pi, Ci[i] is incremented by 1.
3. Time stamp tm = Ci is sent along with m.
4. When process Pj delivers a message m from Pi,
Pj updates its vector clock:

all k ∈ {1, 2, ..n} : Cj[k] = max ( Cj[k], tm[k] )

( Note: Recv ( m ) -> Deliver ( m ) )

The Protocol:

1. Process Pi updates vector time Ci and

broadcasts message m with timestamp tm = Ci.

So Ci[i] - 1 is the number of messages sent before m.

(Note: A process updates its value

of the vector clock
only when it sends a message.
It doesn't update its own value when receiving a message;
it adjusts the vector clock when it delivers the message. )
2. Process Pj ( j ≠ i ) upon receiving message m with
timestamp tm, Pj buffers the message until
all messages sent by Pi preceding m have arrived

i.e. Cj[i] = tm[i] - 1

Pj has received all messages that Pi
had received before sending m.

i.e. Cj[k] ≥ tm[k]   

k = 1, 2, .. n, k ≠ i

3. When the message is finally delivered at Pj, vector

time Cj is adjusted according to vector clock rule 2.
Do not use rule 1 here.

Schiper-Eggli-Sandoz were able to solve the problem without broadcasting channels

57. Global-State-Detection Algorithm

Send a special message called marker

Chandy-Lamport Global State Recording Protocol ( Snapshot Algorithm )

The goal of this distributed algorithm is to capture a consistent global state. It assumes all communication channels are FIFO. It uses a
distinguished message called a marker to start the algorithm.

Pi sends marker
1. Pi records its local state
2. For each channel Cij on which Pi has not already sent a marker, Pi sends a marker before sending other messages.

Pj receives marker from Pi

1. If Pj has not recorded its state:
a) Records the state of Cij as empty
b) Sends the marker as described above ( Note: it records local state before sending out marker )

2. If Pj has recorded its state local state LSj

a) Record the state of Cij to be the sequence of messages received between the computation of LSj and the marker from


In this example, all processes are connected by communications channels Cij.

Messages being sent over the channels are represented
by arrows between the processes.

Snapshot s1:

P1 records LS1, sends markers on C12

and C13
P2 receives marker from P1 on C12; it records its state LS2, records state of C12 as empty, and sends marker on C21 and C23
P3 receives marker from P1 on C13; it records its state LS3, records state of C13
as empty, and sends markers on C31 and C32.
P1 receives marker from P2 on C21; as LS1 is recorded, it records the state of C21 as empty.
P1 receives marker from P3 on C31; as LS1 is recorded, it records the state of C31 as empty.
P2 receives marker from P3 on C32; as LS2 is recorded, it records the state of C32 as empty.
P3 receives marker from P2 on C23; as LS3 is recorded, it records the state of C23 as empty.

Snapshot s2: now a message is in transit on C12 and C21.

P1 records LS1, sends markers on C12 and C13

P2 receives marker from P1 on C12 after the message from P1 arrives; it records its state LS2, records state of C12 as empty, and
sends marker on C21 and C23
P3 receives marker from P1 on C13; it records its state LS3, records state of C13 as empty, and sends markers on C31 and C32.
P1 receives marker from P2 on C21; as LS1 is recorded, and a message has arrived since LS1 was recorded, it records the state of
C21 as containing that message.
P1 receives marker from P3 on C31; as LS1 is recorded, it records the state of C31 as empty.
P2 receives marker from P3 on C32; as LS2 is recorded, it records the state of C32 as empty.
P3 receives marker from P2 on C23; as LS3 is recorded, it records the state of C23 as empty.
The recorded process states and channel states must be collected and assembled to form
the global state. ( e.g. send G.S. to all
processes in finite time )


each process must ensure that

no marker remains forever in an incident input channel

it records its state within finite time of initiation of the algorithm

60. Cuts of a distributed Computation

Graphical representation of GS

C = { c1, c2, ... ,cn }

ci -- cut event, local state of

site ( or process ) Si at that instant

Consistent Cut:

all Si, all Sj, no ei, no ej

such that

( ei → ej ) and
( ei → cj ) and
( ei ci )
    : (This is inconsistent cut!)

i.e. every message received before a cut event

was sent before the cut event

at the sender site in the cut.

Inconsistent Cut    (C1 and C2 are not concurrent)

A cut C = { c1, c2, ... ,cn }
is a consistent cut iff no two cut events are causally related.
( i.e. every pair of cut events are
concurrent )

Time of a cut

C = { c1, c2, ... ,cn }

Ci -- vector clock of ci

TC = sup ( C1,
C2, ... ,

TC[k] = max ( C1[k],

C2[k], ... ,

if C = { c1, c2, ... ,cn } is

a cut with vector time TC, then the cut
is consistent iff


TC = .   -------------- (1)


If C is a consistent cut, then all its events are concurrent.

Thus Ci[i] ≥ Cj[i] for all i, j and hence


TC = sup ( C1,
C2, ... ,
) = .

On the other hand if (1) is true

we have Ci[i] ≥ Cj[i] for all i, j. This implies that the the events ci are concurrent
and the cut is consistent.

64. Termination Detection

System Model

A process may either be in active or inactive state.

An idle process becomes active upon receiving a computation message.
If all process idle => computation terminated.

Huang's Termination Detection Protocol:

The goal of this protocol is to detect when a distributed computation

n processes
Pi process; without loss of generality, let P0
be the controlling agent
Wi. weight of process Pi; initially, W0 = 1 and Wi = 0 for all other i.
B(W) computation message with assigned weight W
C(W) control message sent from process to controlling agent with assigned
weight W


an active process Pi sends a computation message to Pj

1. Set Wi' and Wij to values such that Wi' + Wij = Wi,
Wi' > 0, Wij > 0. (Wi' is the new weight of Pi.)
2. Send B(Wij) to Pj
Pj receives a computation message B(Wij) from Pi
1. Wj = Wj + Wij
2. If Pj is idle, Pj becomes active

Pi becomes idle by:

1. Send C(Wi) to P0 ( or to another Process )
2. Wi = 0
3. Pi becomes idle

Pi receives a control message C(W):

1. Wi = Wi + W
2. If Wi = 1, the computation has completed.

The picture shows a process P0, designated the controlling agent, with W0 = 1. It asks P1 and P2 to do some computation. It sets
W01 = 0.2
W02 = 0.3

W0 = 0.5
P2 in turn asks P3 and P4 to do some computations. It sets
W23 = 0.1
W24 = 0.1

When P3 terminates, it sends C(W3) = C(0.1) to P2, which changes W2 to 0.1 + 0.1 = 0.2.

When P2 terminates, it sends C(W2) = C(0.2) to P0, which changes W0 to 0.5 + 0.2 = 0.7.

When P4 terminates, it sends C(W4) = C(0.1) to P0, which changes W0 to 0.7 + 0.1 = 0.8.

When P1 terminates, it sends C(W1) = C(0.2) to P0, which changes W0 to 0.8 + 0.2 = 1.

P0 thereupon concludes that the computation is finished.

Total number of messages passed: 8 (one to start each computation, one to return the weight).

You might also like