Lecture Notes: Distributed OS Theories

Syllabus
Blank Homework
Notes Labs Scores Blank
Lecture Notes
Dr. Tong Lai Yu, March 2010
0. Review and Overview 7. Distributed OS Theories

1. B-Trees 8. Distributed Mutual Exclusions
2. An Introduction to Distributed Systems 9. Agreement Protocols
3. Deadlocks 10. Distributed Scheduling

4. Distributed Systems Architecture 11. Distributed Resource Management
5. Processes 12. Recovery and Fault Tolerance
6. Communication 13. Security and Protection

In a few hundred years, when the history of our time
is written from a long-term perspective, it is likely
that the most important event those historians will see
is not technology, not the Internet, not e-commerce. It
is an unprecedented change in human condition. For the
first time, they will have to manage themselves.
Peter Drucker
Distributed OS Theories
1. Inherent Limitations of a Distributed System
Absence of Global clock

difficult to make temporal order of events
difficult to collect up-to-date information on the
state of the entire system
Absence of Shared Memory

no up-to-date state of the entire system to any individual process
as there's no shared memory
coherent view -- all observations of different processes ( computers )
are made at the same physical time
we can obtain a coherent but partial view of the system or
incoherent view of the system
complete view ( global state ) -- local views ( local states )

+ messages in transit
difficult to obtain a coherent global state

4. Clock Synchronization
Physical Clocks
Problem
Sometimes we simply need the exact time, not just an ordering.
Solution
Universal Coordinated Time (UTC):
Based on the number of transitions per second of the cesium 133 atom
(pretty accurate).
At present, the real time is taken as the average of some 50
cesium-clocks around the world.

Introduces a leap second from time to time to compensate that days are
getting longer.
Note
UTC is broadcast through short wave radio and satellite. Satellites can give
an accuracy of about ±0.5 ms.
Problem
Suppose we have a distributed system with a UTC-receiver
somewhere in it => we still have to distribute its time to each machine.
Basic principle
Every machine has a timer that generates an interrupt H times per
second.
There is a clock in machine p that ticks on each timer interrupt.
Denote the value of that clock by Cp(t), where t is UTC time.

Ideally, we have that for each machine p, Cp(t) = t, or, in other
words, dC/dt = 1.
In practice: 1 - r ≤ dC / dt ≤ 1 + r.
Goal
Never let two clocks in any system differ by more than δ time units =>
synchronize at least every δ/(2r) seconds.
Global positioning system
Basic idea
You can get an accurate account of time as a side-effect of GPS.
Problem
Assuming that the clocks of the satellites are accurate and
synchronized:
It takes a while before a signal reaches the receiver

The receiver's clock is definitely out of synch with the satellite
Principal operation
Δr : unknown deviation of the receiver's clock.

xr , yr , zr : unknown coordinates of the receiver.
Ti : timestamp on a message from satellite i.
Δi = ( Tnow - Ti ) +
Δr : measured delay of the message sent by satellite i.
Measured distance to satellite i: c x Δi
( c is speed of light )
Real distance is
Observation
4 satellites => 4 equations in 4 unknowns ( with Δr as one of them )
Clock Synchronization Principle
Principle I
Every machine asks a time server for the accurate time at least once
every δ/(2r) seconds (Network Time Protocol).
Note
Okay, but you need an accurate measure of round trip delay, including
interrupt handling and processing incoming messages.
Principle II
Let the time server scan all machines periodically, calculate an
average, and inform each machine how it should adjust its time relative
to its present time.
Note
Okay, you'll probably get every machine in sync. You don't even need
to propagate UTC time.
Fundamental
You'll have to take into account that setting the time back is never
allowed => smooth adjustments.
19. Lamport's Logical Clock
Problem
We first need to introduce a notion of ordering before we can order anything.
The happened before → relation
a → b , if a and b are events in the same process and a occurred before b

a → b , if a is the event of sending a message m in a process
and b is the event of receipt of the same message m by another
process
if a → b and b → c, then a → c ( transitive )
event a causally affects b if a → b

concurrent: a || b if !( a → b ) and !( b → a )
for any two events in a system, either a → b or b → a or a || b
e11 → e12 , e12 → e22
e21 → e13 , e14 || e24
Realization
To realize the relation → we need a clock Ci at each

process Pi in the system, and adjust the clock according
to the following rules.
Ci(a) -- timestamp of event a at Pi
if a → b, then C(a) < C(b)
Condition requirements:
1. for any two events a and b in a process Pi,
if a occurs before b, then
Ci(a) < Ci(b)
2. if a is the event of sending a message m in Pi
and b is the event of receiving the same message m
at process Pj, then
Ci(a) < Cj(b)
Implementation rules:
1. two successive events in Pi
Ci = Ci + d ( d > 0 )
if a and b are two successive events in Pi and

a → b then
Ci(b) = Ci(a) + d ( d > 0 )
2. event a: sending of message m by process Pi,
timestamp of message m : tm = Ci(a )

then
Cj = max ( Cj, tm + d ) d > 0
→ is irreflixive, defines partial order among events
Totally ordering relation ( => ) can be defined by ( on top of the above )
a is any event in process Pi
b is any event in process Pj

a => b iff
either Ci(a) < Cj(b)
or Ci(a) = Cj(b) and Pi Pj ( e.g. Pi Pj if i ≤ j, to break ties )
Limitation of Lamport's Clocks
if a → b then C(a) < C(b)

but C(a) < C(b) does not necessarily imply a → b
Positioning of Lamport's logical clocks in distributed systems:
Example: Totally Ordered Multicasting
See Figure of inconsistent database update below.
28. Vector Clocks
n = number of processes in a distributed system

vector of length n )
Each event in process Pi ~ vector clock Ci ( integer
Ci[1]
C [2]
Ci = i
..
Ci
[n]
Ci[i] ~ Pi's own logical clock

Ci[j] ~ Pi's best guess of logical time at Pj. More precisely, the time of occurrence of the last event at Pj which "happenned before" the
current point in time at Pj
Ci(a) is referred to as the timestamp of event a at Pi
Comparing two vector timestamps of events a and b
Equal ta = tb iff all i, ta[i] = tb[i]

Not Equal ta ≠ tb iff some i, ta[i] ≠ tb[i]
Less Than or Equal ta ≤ tb iff all i, ta[i] ≤ tb[i]
Not Less Than or Equal To ta tb iff some i, ta[i] > tb[i]
Less Than ta < tb iff ta ≤ tb and ta ≠ tb )
Not Less Than ta tb iff !(ta ≤ tb
and tb ≠ tb );
Concurrent ta || tb iff ta tb and
tb ta
Implementation Rules:
1. two successive events a, b in process Pi:
Ci(b)[i] =
Ci(a)[i] + d ( d > 0 )
2. event a at Pi sending message m to process Pj

with receiving event b; vector timestamp tm = Ci(a) is assigned to m;
on receiving m, Pj
updates Cj as follows:
all k, Cj(b)[k] = max(Cj(b)[k],tm[k])
Assertion.
At any instant
all i, all j : Ci[i] ≥ Cj[i]

Events are causally related if ta < tb or
tb < ta
Now, a → b iff ta < tb
32. Global State
no global clock, no global memory

To determine a global system state, a process p must
enlist the cooperation of other processes that must record their
states and send the
recorded local states to p
processes cannot record their local states at precisely the same

instant unless they have access to a common clock
the global-state-detection algorithm is to be superimposed

on the underlying computation; it must run concurrently with
but not alter
the underlying computation
diagrams
Distributed system
finite set of processes

finite set of channels
process state, channel state
Example: Updating a replicated database and

leaving it in an inconsistent state.
Update 1 : Add $100 to $1000
Update 2 : Calcalate interest
At San Francisco ( Update 1 first ): Add $100 to $1000, then calculate interest.
At New York ( Update 2 first ): Calcalate interest of $1000, then add $100.
42. Some definitions
LSi -- local state of Si ( site ) (Collection of events occurred.)
events -- send( mij ), recv( mij )
time ( x ) -- time at which state x was recorded
e.g. time ( LSi )
send ( mij ) ∈ LSi iff time ( send ( mij ) ) < time ( LSi )
recv ( mij ) ∈ LSj iff time ( recv ( mij ) ) < time ( LSj )
transit ( LSi, LSj ) =

{ mij | send( mij ) ∈ LSi
Λ recv( mij ) !∈ LSj }
i.e. message in channel
inconsistent ( LSi, LSj ) =

{ mij | send( mij) !∈ LSi
Λ recv( mij ) ∈ LSj }
Global State GS = { LS1, LS2, ..., LSn }
i.e. collection of local states ( may be consistent or inconsistent )

Consistent Global State: A global state GS = { LS1, LS2, ..., LSn }
is consistent iff
all i, all j: 1 ≤ i, j ≤ n :: inconsistent( LSi, LSj ) = Φ
Transitless global state: A global state is transitless iff
all i, all j: 1 ≤ i, j ≤ n :: transit( LSi, LSj ) = Φ
Strongly consistent global state: A global state is strongly

consistent if it is consistent and transitless.
54. Causal Ordering of Messages
if Send( M1 ) → Send( M2 )
then the receipient should receive M1 before

M2
i.e. Send( M1 ) → Send( M2 )

requires Receive( M1 ) → Receive( M2 )
Figure: Violation of causal ordering of messages
Applications: database replication management, monitoring distributed computations, simplifying distributed algorithms,...
Solution idea: upon arrival of a message at a process, buffer (delay delivery) the message until the message immediately preceding it
is delivered
Birman-Schiper-Stephenson Protocol: Enforcing Causal Ordering of Messages
Assumes broadcast communication channels that do not lose or corrupt messages. ( i.e. everyone talks to everyone ). Use vector clocks to
"count" number of messages ( i.e. set d = 1 ). n processes.
Vector Time:
1. When Pi begins to execute, Ci is initialized to zeros.

2. For each event send( m ) at Pi, Ci[i] is incremented by 1.
3. Time stamp tm = Ci is sent along with m.
4. When process Pj delivers a message m from Pi,
Pj updates its vector clock:
all k ∈ {1, 2, ..n} : Cj[k] = max ( Cj[k], tm[k] )
( Note: Recv ( m ) -> Deliver ( m ) )
The Protocol:
1. Process Pi updates vector time Ci and

broadcasts message m with timestamp tm = Ci.
So Ci[i] - 1 is the number of messages sent before m.
(Note: A process updates its value

of the vector clock
only when it sends a message.
It doesn't update its own value when receiving a message;
it adjusts the vector clock when it delivers the message. )
2. Process Pj ( j ≠ i ) upon receiving message m with
timestamp tm, Pj buffers the message until
all messages sent by Pi preceding m have arrived
i.e. Cj[i] = tm[i] - 1
and
Pj has received all messages that Pi
had received before sending m.
i.e. Cj[k] ≥ tm[k]

k = 1, 2, .. n, k ≠ i
3. When the message is finally delivered at Pj, vector

time Cj is adjusted according to vector clock rule 2.
Do not use rule 1 here.
Example
Schiper-Eggli-Sandoz were able to solve the problem without broadcasting channels
57. Global-State-Detection Algorithm
Send a special message called marker
Chandy-Lamport Global State Recording Protocol ( Snapshot Algorithm )
The goal of this distributed algorithm is to capture a consistent global state. It assumes all communication channels are FIFO. It uses a
distinguished message called a marker to start the algorithm.
Pi sends marker
1. Pi records its local state
2. For each channel Cij on which Pi has not already sent a marker, Pi sends a marker before sending other messages.
Pj receives marker from Pi

1. If Pj has not recorded its state:
a) Records the state of Cij as empty
b) Sends the marker as described above ( Note: it records local state before sending out marker )
2. If Pj has recorded its state local state LSj

a) Record the state of Cij to be the sequence of messages received between the computation of LSj and the marker from
Cij.
Example
In this example, all processes are connected by communications channels Cij.

Messages being sent over the channels are represented
by arrows between the processes.
Snapshot s1:
P1 records LS1, sends markers on C12

and C13
P2 receives marker from P1 on C12; it records its state LS2, records state of C12 as empty, and sends marker on C21 and C23
P3 receives marker from P1 on C13; it records its state LS3, records state of C13
as empty, and sends markers on C31 and C32.
P1 receives marker from P2 on C21; as LS1 is recorded, it records the state of C21 as empty.
Snapshot s2: now a message is in transit on C12 and C21.
P1 records LS1, sends markers on C12 and C13

P2 receives marker from P1 on C12 after the message from P1 arrives; it records its state LS2, records state of C12 as empty, and
sends marker on C21 and C23
P3 receives marker from P1 on C13; it records its state LS3, records state of C13 as empty, and sends markers on C31 and C32.
P1 receives marker from P2 on C21; as LS1 is recorded, and a message has arrived since LS1 was recorded, it records the state of
C21 as containing that message.
The recorded process states and channel states must be collected and assembled to form
the global state. ( e.g. send G.S. to all
processes in finite time )
Termination
each process must ensure that
no marker remains forever in an incident input channel

it records its state within finite time of initiation of the algorithm
60. Cuts of a distributed Computation
Graphical representation of GS
C = { c1, c2, ... ,cn }
ci -- cut event, local state of

site ( or process ) Si at that instant
Consistent Cut:
all Si, all Sj, no ei, no ej

such that
( ei → ej ) and
( ei → cj ) and
( ei ci )
: (This is inconsistent cut!)
i.e. every message received before a cut event

was sent before the cut event
at the sender site in the cut.
Inconsistent Cut (C1 and C2 are not concurrent)
Theorem
A cut C = { c1, c2, ... ,cn }
is a consistent cut iff no two cut events are causally related.
( i.e. every pair of cut events are
concurrent )
Time of a cut
C = { c1, c2, ... ,cn }
Ci -- vector clock of ci
TC = sup ( C1,
C2, ... ,
Cn
)
TC[k] = max ( C1[k],

C2[k], ... ,
Cn[k]
)
Theorem
if C = { c1, c2, ... ,cn } is

a cut with vector time TC, then the cut
is consistent iff
C1[1]
C2[2]
TC = . -------------- (1)
.
Cn[n]
Proof:
If C is a consistent cut, then all its events are concurrent.

Thus Ci[i] ≥ Cj[i] for all i, j and hence
C1[1]
C2[2]
TC = sup ( C1,
C2, ... ,
Cn
) = .
.
Cn[n]
On the other hand if (1) is true

we have Ci[i] ≥ Cj[i] for all i, j. This implies that the the events ci are concurrent
and the cut is consistent.
64. Termination Detection
System Model
A process may either be in active or inactive state.

An idle process becomes active upon receiving a computation message.
If all process idle => computation terminated.
Huang's Termination Detection Protocol:
The goal of this protocol is to detect when a distributed computation

terminates.
n processes
Pi process; without loss of generality, let P0
be the controlling agent
Wi. weight of process Pi; initially, W0 = 1 and Wi = 0 for all other i.
B(W) computation message with assigned weight W
C(W) control message sent from process to controlling agent with assigned
weight W
Protocol
an active process Pi sends a computation message to Pj

1. Set Wi' and Wij to values such that Wi' + Wij = Wi,
Wi' > 0, Wij > 0. (Wi' is the new weight of Pi.)
2. Send B(Wij) to Pj
Pj receives a computation message B(Wij) from Pi
1. Wj = Wj + Wij
2. If Pj is idle, Pj becomes active
Pi becomes idle by:

1. Send C(Wi) to P0 ( or to another Process )
2. Wi = 0
3. Pi becomes idle
Pi receives a control message C(W):

1. Wi = Wi + W
2. If Wi = 1, the computation has completed.
Example
The picture shows a process P0, designated the controlling agent, with W0 = 1. It asks P1 and P2 to do some computation. It sets
W01 = 0.2
W02 = 0.3
W0 = 0.5
P2 in turn asks P3 and P4 to do some computations. It sets
W23 = 0.1
W24 = 0.1
When P3 terminates, it sends C(W3) = C(0.1) to P2, which changes W2 to 0.1 + 0.1 = 0.2.
When P1 terminates, it sends C(W1) = C(0.2) to P0, which changes W0 to 0.8 + 0.2 = 1.
P0 thereupon concludes that the computation is finished.
Total number of messages passed: 8 (one to start each computation, one to return the weight).

Lecture Notes: Distributed OS Theories

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture Notes: Distributed OS Theories

Uploaded by

Copyright:

Available Formats

Syllabus

0. Review and Overview 7. Distributed OS Theories

In a few hundred years, when the history of our time

is written from a long-term perspective, it is likely

that the most important event those historians will see

is not technology, not the Internet, not e-commerce. It

is an unprecedented change in human condition. For the

first time, they will have to manage themselves.

Absence of Global clock

Absence of Shared Memory

we can obtain a coherent but partial view of the system or

incoherent view of the system

complete view ( global state ) -- local views ( local states )

difficult to obtain a coherent global state

Sometimes we simply need the exact time, not just an ordering.

Universal Coordinated Time (UTC):

cesium-clocks around the world.

an accuracy of about ±0.5 ms.

Suppose we have a distributed system with a UTC-receiver

somewhere in it => we still have to distribute its time to each machine.

Every machine has a timer that generates an interrupt H times per

Denote the value of that clock by Cp(t), where t is UTC time.

Global positioning system

You can get an accurate account of time as a side-effect of GPS.

Assuming that the clocks of the satellites are accurate and

It takes a while before a signal reaches the receiver

Δr : unknown deviation of the receiver's clock.

4 satellites => 4 equations in 4 unknowns ( with Δr as one of them )

Clock Synchronization Principle

every δ/(2r) seconds (Network Time Protocol).

interrupt handling and processing incoming messages.

Let the time server scan all machines periodically, calculate an

to its present time.

to propagate UTC time.

allowed => smooth adjustments.

19. Lamport's Logical Clock

We first need to introduce a notion of ordering before we can order anything.

The happened before → relation

a → b , if a and b are events in the same process and a occurred before b

event a causally affects b if a → b

e11 → e12 , e12 → e22

e21 → e13 , e14 || e24

To realize the relation → we need a clock Ci at each

to the following rules.

Ci(a) -- timestamp of event a at Pi

if a → b, then C(a) < C(b)

1. for any two events a and b in a process Pi,

if a occurs before b, then

Ci(a) < Ci(b)

2. if a is the event of sending a message m in Pi

and b is the event of receiving the same message m

at process Pj, then

Ci(a) < Cj(b)

1. two successive events in Pi

if a and b are two successive events in Pi and

Ci(b) = Ci(a) + d ( d > 0 )

2. event a: sending of message m by process Pi,

timestamp of message m : tm = Ci(a )

Cj = max ( Cj, tm + d ) d > 0

→ is irreflixive, defines partial order among events

Totally ordering relation ( => ) can be defined by ( on top of the above )

a is any event in process Pi

b is any event in process Pj