Professional Documents
Culture Documents
385
386 CHAPTER 10 TIME AND GLOBAL STATES SECTION 10.2 CLOCKS. EVENTS AND PROCESS STATES 387
whether they contain references) and of the communication channels between processes
10.1 Introduction (in case messages containing references are in transit).
In the first half of this chapter, we examine methods whereby computer clocks can
This chapter introduces fundamental concepts and algorithms related to monitoring be approximately synchronized, using message passing. V'le go on to introduce logical
distributed systems as their execution unfolds. and to timing the events that occur in clocks, induding vector clocks. which are used to define an order of events without
their executions. measuring the physical time at which they occurred.
Time is an imporram and interesting issue in distributed systems. for several In the second half. we describe algorithms whose purpose is to capture global
reasons. First. time is a quantity we often \vant to measure accurately. In order to know states of distributed systems as they execute.
at what time of day a particular event occurred at a particular computer it is necessary
to synchronize its clock \vith an authoritative. external source of time. For example. an
'e-commerce" transaction involves events at a merchant's computer and at a bank 10.2 Clocks, events and process states
computer. It is important. for auditing purposes, that those events are timestamped
accurately. Chapter 2 presented an introductory model of interaction betv..·een the processes \vithin
Second. algorithms that depend upon clock synchronization have been developed a distributed system. We shall refine that mode! in order to help us to understand how to
for se\'eral problems in distribution [Liskov \993]. These include maintaining the characterize the system's evolution as it executes. and how to timestamp the events in a
consistency of distributed data (the use of timestamps to serialize transactions is system's execution that interest users. We begin by considering how to order and
discussed in Section 12.6): checking the authenticity of a request sent to a server (a timestamp the events that occur at a single process.
version of the Kerb-eros authentication protocol. discussed in Chapter 7, depends On We take a distributed system to consist of a collection &0 of N processes
loosely synchronized clocks): and eliminating the processing of duplicate updates (see. Pi' i :::; L 2 . ... N. Each process executes on a single processor. and the processors do
for example, Ladin e: al. [1992]). not share memory (Chapter 16 considers the case of processes that share memory). Each
Einstein demonstrated. in his Special Theory of Relativity. the intriguing process Pi in SO has a state .\ which. in general. it transforms as it executes. The
consequences that :follow from the observation that the speed of light is constant for all process's state includes the values of all the \'ariables within it. Its state may also include
obseryers. regardk~s of their relative velocity. He proved from this assumption, among the values of an)' objects in its local operating system environment that it affects. such
other things. that I\' (' e\'ents that are judged to be simultaneolls in one frame of reference as files. We assume that processes cannot communicate with one another in any way
are not necessarily ~imultaneous according to obsef\'ers in other frames of reference that except by' sending messages through the network. So, for example, if the processes
are mO\'ing relati\'e to it. For example. an obsef\'er on the Earth and an observer operate robot anns connected to their respecti\"e nodes in the system. then they are nOl
travelling away from the Earth in a spaceship will disagree on the time interval between allowed to communicate by shaking one anOlher"s robot hands!
eventS. the more 500 ~s their relative speed increases. As each process Pi executes it takes a series of actions, each of \vhich is either a
message Send or Receive operation. or an operation that transforms Pi's state - one that
\Ioreover. tn:: relative order of two events can even be reversed for two different
changes one or more of the values in Si' In practice. we may choose to use a high-lew I
obsef\·ers. But thi~ .:annot happen if one e\'ent could have caused the other to occur. In
description of the actions, according to the application. For example. if the processes in
that case. the ph~ si':J.l effect follows the physical cause for all observers, although the
SO are engaged in an e-commerce application. then the actions may be ones such as
time elap~ed bet\, eeD cause and effect can \·ar),. The timing of physical events was thus
'client dispatched order message' or 'merchant sen'er recorded transaction to log'.
praYed to be relati; e co the obsef\'er, and Newton' s nOlion of absolute physical time was
We define an event to be the occurrence of a single action that a process carries
discredited. There l~ DO special physical clock in the universe to which we can appeal
out as it executes - a communication action or a state-transforming action. The sequence
when we want to measure intervals of time.
of e\'ents \vithin a single process Pi can be placed in a single. total ordering. which \\ e
The notiOn of physical time is also problematic in a distributed system. This is not shall denote by the relation --7 i between the events. That is. e--7 je' if and only if the
due to the effects 0;' ~pccial relativity. which are negligible or nOD-existent for normal e\'eOl e occurs before e' at Pi' This ordering is \\'ell defined. whether or nO! the process
computers (unless ,:>D:: counts computers travelling in space~hips!). The problem is is multi-threaded. since we ha\'e assumed that the process executes on a single
based on a ~imila; ilmitation in our ability to timestamp e\'ents at different nodes
processor.
sufficiently accura,e:> to know the order in \\"hich any pair of events occurred, or :\ow we can define the history of process p; tv be the series of events that take
\vhether they occur:-ed simultaneously. There is no absolute. global time that \\'e can place within it. ordered as we have described by th~ relation ""'""7 i :
appeal to. And yet ",,'e sometimes need to observe distributed systems and establish
\vhether certain Sl;:.Ies of affairs occurred at the same time. For example. in object- IlIsroryfpi
. ) = Ili:= <ei,ei,c
0 I 2• ••• >
1
oriented systems \'. e need to be able to establish whether references to a particular object
no longer exist - \, helher the object has become garbage (in which case we can free its Clocks 0 We have seen how to order the e\"Cnt~ at a process but not how to timestamp
memory). Estabiishlr:g this requires observations of the state::.. of processes (to find out them - to assign to them a date and time of day. Computers each contain their 0\\ n
388 CHAPTER 10 T!ME AND GLOBAL STATES
SECTION 10.3 SYNCHRONIZING PHYSICAL CLOCKS 389
Figure 10.1 Skew between computer clocks in a distributed system standard for elapsed real time, known as Inremational Atomic Time. Since 1967, the
a I
~I
Network
(;) I
C9 I
standard second has been defined as 9.192.631.770 periods of transition between the
two hyperfine levels of the ground state of Caesium-I 33 (C5 133 ),
Seconds and years and other time units that we use are rooted in astronomical
time. They were originally defined in tenus of the rotation of the Earth on its axis and
its rotation about the Sun. However. the period of the Earth' s rotation about its axis is
gradua!Jy getting longer. primarily because of tidal friction: atmospheric effects and
convection currents within the Earth's core also cause short-term increases and
decreases in the period. So astronomical time and atomic time have a tendency to get out
phYSical clock. These clocks are electronic devices that count oscillations occurring in of step,
a crystal at a definite frequency. and that typically divide this count and s[Qre the result Coordinated Unil'crsal Time - abbreviated as UTC (from the French equivalent)
in a counter register. Clock devices can be programmed to generate interrupts at regular - is an international standard for timekeeping, It is based on atomic time. but a so-called
intervals in order that for example. timeslicing can be implemented: however \ve shall leap second is inserted - or. more rarely, deleted...- occasionally to keep in step with
not concern ourselves with this aspect of clock operation. astronomical time, LTC signals are 5ynchronized and broadcast regularly from land-
The operating system reads the node's hardware clock value H/tl scales it and based radio stations and satellites covering many parts of the world. For example. in the
adds an offset so as to produce a software clock Ci(t) = uHi(t) + ~ that approximately USA. the radio station WWV broadcasts time signals on several shortwave frequencies.
measures real, physical time [for process Pi' In other words. when the real time in an Satellite sources include rhe Global Posirioning System (GPS).
absolute frame of reference is r. C i ( r) is the reading on the software clock. For example, Receivers are a\'ailable commercially. Compared with 'perfect' UTC the signals
Ci(r) could be the 64-bit value of the number of nanoseconds that ha\'e elapsed allime
received from land-based stations haw an accuracy in the order of 0.1-1 0 milliseconds,
r since a convenient reference time. In general. the clock is not completely accurate. and depending on the station used. Signals received from GPS are accurate to about 1
microsecond. Computers with recei\·cr::. attached can synchronize their clocks with these
so CJt) will differ from r. 7\"onetheless. if C i behaves sufficiemly well (we shall
timing signals. Computers may als0 recei\-e the time to an accuracy of a few
examine the notion of clock correctness shon!y). we can use its value to timestamp any
milliseconds O\'er a telephone line. from organizations such as the National Institute for
event at Pi' Note that Successive e\"ems will correspond to different timestamps only if
Standards and Technology in the CSA.
the clock resolution - the period between updates of the clock value - is smaller than the
time interval between successjye evems. The rate at which eVents occur depends on such
factors as the length of the processor instruction cycle.
Clock skew and clock drift 0 Computer clocks. like any others. tend not 10 be in perfect
10.3 Synchronizing physical clocks
agreement (Figure 10.1). The instantaneous difference between the readings of any two
clocks is called their skew. Also. the cr: 5tal-based clocks used in compUlers are. like any In order to kno\\- at what lime of da:- events occur at the processes in our distributed
other clocks. subject to clock dr!(r. which means that they count time at different rates. system fJ - for example. for aCCOUDt2.D.CY purposes - it is necessary to synchronize the
and so diverge. The underlying osciilmors are subject to physical Yariations. with the processes' clocks C,- with an authoritative. external source of time. This is exremal
consequence that their frequencies of oscillation differ. Moreover. eyen the same sYllchroni-;ation. And if the clocks C. are synchronized with one another to a known
c1ock's frequency varies with temperature. Designs ex.ist that attempt to compensate for degree of accuracy. then we can mea;ure th~ interval between two events occurring at
this variation. but they cannot eliminate it. The difference in the oscillarion period different computers by appealing to rheir local clocks - even though they are not
between {\vo clocks might be ex.tremely small. but the difference accumulated o\-er necessarily synchronized to an external .:.ource of time. This is internal synchronizarion.
mallY oscillations leads to an observable difference in the counters regi:,tered by two We define these rwo modes of synchronization more closely as follows. over an interval
clocks. no matter how accurately they \'.-cre initialized to the same \·alue. :\ c1ock's dr(fr of real time I:
rare is the change in the offset (difference in reading) between the clock and a nominal EXiemal synchrolli::'(liion: For a:,: ".:hronization bound D > O. and for a source 5 of
perfect reference clock per unit of time measured by the reference clock. For ordinar;. CTC time. S(r) - C I ( n < D. for i := 1.1 . ....\I and for all real times [in I. Another
clocks based on a quartz cr:ystaL this is about 10-6 seconds/second _ gi\ ing a difference way of saying this is that the clock~ C. are accurare to within rhe bound D.
of 1 second every 1.000,000 seconds. or 11.6 days. The drift rate of 'high precision'
quartz clocks is about 10- 7 or 10-8. !mernal sYllchroni-:.arion: For a s:- n..::hronization bound D > O. iC/r) - Cj(t)! < D
for i, j = L 1, . . N. and for all real times r in I. Another way of saying this is that
Coordinated Universal Time 0 Computer clocb can be synchronized to external sources the clocks C 1 agree \\-ithin the bound D.
of highly accurate time. The most accurate physical clocks use atomic o~cillaton;. whose Clocks that are internally synchronized are nm necessarily externally synchronized.
drift rate is about one pan in 10 13 . The output of these atomic clock" j" used as the since they may drift collecti\"ely from an external source of time. even though they agree
390 CHAPTER 10 TIME AND GLOBAL STATES SECTION 10.3 SYNCHRONIZING PHYSICAL CLOCKS 391
with one another. However, it foHows from the definitions that if the system ,so is Figure 10.2 Clock synchronization using a time server
externally synchronized with a bound D then the same system is internally synchronized
o :
\vith a bound of 2D.
Various notions of correcmess for clocks have been suggested. It is common to m, ~DDO
define a hardware clock H to be correct if its drift rate faUs within a known bound p > 0
(a value derived from one supplied by the manufacturer, such as 10-6 seconds/second). ... m
This means that the error in measuring the interval between real times t and t' ([' > t) is t
p Time server.S
bounded:
Cristian observed that while there is no upper bound on message transmission delays in faulty clocks is partially addressed by the Berkeley algorithm. which is described next.
an asynchronous system, the round-trip times for messages exchanged between pairs of The problem of malicious interference with time synchronization can be dealt with by
processes are often reasonably short - a small fraction of a second. He describes the authentication techniques,
algorithm as probabilistic: the method achieves synchronization only if the observed
round-trip times between client and server are sufficiently short compared with the
required accuracy. 10.3.3 The Berkeley algorithm
A process p requests the time in a message m r. and receives the time value [in a Gusella and Zatti [1989J describe an algorithm for internal synchronization that they
message m t (t is inserted in Ill! at the last possible poim before transmission from S's developed for collections of computers running Berkeley UNIX, In it. a coordinator
computer). Process p records the total round-trip time T rO(llid taken to send the request computer is chosen to act as the maSler. Unlike Cristian's protocol. this computer
ml" and receive the reply m f • It can measure this time with reasonable accuracy ifi!s rate periodically polls the other computers whose clocks are to be synchronized, called
of clock drift is small. For example. the round-trip time should be in the order of 1-10 slaves. The slaves send back their dock values to it. The master estimates their local
milliseconds on a LAN. over which time a clock with a drift rate of 10-6 seconds/second clock times by observing the round-trip times (similarly to Cristian's technique), and it
varies by at most 10-5 milliseconds. averages the values obtained (including its own clock's reading), The balance of
A simple estimate of the time to which p should set its clock is [+ T rO/md/2. probabilities is that this average cancels out the individual clocks' tendencies to run fast
which assumes that the elapsed time is split equally before and after 5 placed I in mr or slow, The accuracy of the protocol depends upon a nominal maximum round-trip time
This is normally a reasonably accurate assumption. unless the two messages are between the master and the slaves, The master eliminates any occasional readings
transmined over different networks. If the value of the minimum transmission time min associated with larger times than this maximum,
is known or can be conservatively estimated. then we can detennine the accuracy of this Instead of sending the updated current time back to the other computers - which
result as follows. would introduce further uncertainty due to the message transmission time - the master
The earliest point at which 5 could have placed the time in m f \\"as min after p sends the amount by \vhich each individual sla\'e's clock requires adjustment. This can
dispatched m r . The latest poim at which it could have done this was min before m f be a positive or negative value,
arrived at p. The time by 5' s clock when the reply message arrives is therefore in the The algorithm eliminates readings from faulty cloch, Such clocks could have a
range [t + min. t + T r(lund - mill). The width of this range is T rOlllld - 1min, so the significant adverse effect if an ordinary average was taken, The master takes a fault·
accuracy is ±( T rO/md/2 - mill), toleram (I'.'erage, That is, a subset of clocks is chosen that do not differ from one another
Variability can be dealt with to some extent by making several requests to 5 by more than a specified amount. and the average is taken of readings from only these
(spacing the requests so that transitory congestion can dean and taking the minimum clocks,
value of T roulld to give the most accurate estimate, The greater is the accuracy required, Gusella and Zani describe an experiment involving 15 computers whose clocks
the smaller is the probability of achie\'ing it. This is because the most accurate results were synchronized to \vithin about 20-25 milliseconds using their protocol. The local
are those in which both messages are transmined in a time close to min - an unllkely clocks' drift rate was measured to be less than 2x 10-), and the maximum round-trip
event in a busy network. time was taken to be 10 milliseconds,
Should the master fail. then another can be elected to take over and function
Discussion of Cristian's algorithm 0 As described, Cristian' s method suffers from the exactly as its predecessor. Section 1 1.3 di:.-cusses some general-purpose election
problem associated with all services implemented by a single seryer. that the single time algorithms, ;\;ote that these are not guaranteed to dect a new master in bounded time -
server might fail and thus render synchronization impossibie temporarily. Cristian and so the difference between two clocks \>,:ould be unbounded if they were used.
suggested, for this reason, that time should be pro\-ided by a group of synchronized time
servers. each with a receiH.'r for eTC lime signals. For exampk, a client could multicast
its request to all servers and use only the first reply obtained, 10.3.4 The Network Time Protocol
Note that a faulty time server that replied \\-ilh spurious time values. or an imposter
time sen'er that replied with deliberately incorrect rimes, could wreak havoc in a Cristian's method and the Berkeley algorithm are intended primarily for use within
computer system. These problems were beyond the scope of the work described by intranets, The Network Time Protocol (NTP) [\1ills 1995J defines an architecture for a
Cristian [1989], \vhieh ass-urnes that s-ources of external lime ~ignals are ~elf-ehecking, time service and a protocol to distribute time information over the Internet.
Cristian and Fetzer [1994} describe a family of probabilistic protocols for internal clock NTP's chief design aims and features are a:.- follows.
synchronization. each of which tolerates certain failures. Srikanth and Toueg [1987] To prodde (I sadce enabling diems across fhe 'mane! TO be synchroniz.ed
first described an algorithm that is optimal \\ ith respect to the accuracy of the accuraTely to UTe: Despite the large and \-ariable message delays encountered in
synchronized clocks. while tolerating some failures, Dole\' er af. [1986J showed that if Internet communication, NTP employs qatis!ical teChniques for the filtering of
fis the number of faulty clocks out of a total of N, then we must have N > 3 f if the other. timing data and it discriminates bet\veen the quality of timing data from differenl
correct. clocks are stilt to be able to achieve agreement, The problem of dealing with servers,
394 CHAPTER 10 TIME AND GLOBAL STATES I SECTION 10.3 SYNCHRONIZING PHYSICAL CLOCKS 395
Figure 10.3
I
An example synchronization subnet in an NTP implementation Figure 10.4 Messages exchanged between a pair of NTP peers
/
2
/
~
~
2
~
I
I \' l '\ r:::
Server A Ti_3 T,
3 3 3
a stratum 2 secondary server. If a ~e.:ondary seryer'" normal Source of synchronizati~'r:
fails Or becomes unreachable. then it may synchronize with another sen"er.
Note: Arrows denote synchronization control, numbers denote strata. NTP seryers synchronize \\ ith one another in one of three modes: mul!icasL
procedure-call and symmetric mode . .\fulricas[ mode is intended for use on a high-spc;'.'d
LA:\". One or more servers periodically multica:<-!:, the time to the seryers running in
other computers connected by the L..\:--':. \\ihich :<-el their clocks assuming a small dela.\
To provide a reNabie service [har can survi\'e lengthy losses of cOllnecri'.·iry: There This mode can achieve only relati\ely low accuracies. but ones that nonetheless :lIe
are redundant sen-crs and redundant paths between the servers. The servers can considered sufficient for many pU!T'0ses.
reconfigure so as to continue to provide the service jf one of them becomes
Procedllre"wl! mode is simil2I to the operation of Cristian' s algorithm. descri0;'.'j
unreachable.
above. In this mode. one sener accepts requesl5 from other computers. ""'hich i:.:
To enable diems fO resynchronize sufjiciemly frequently to offset the rates of drift processes by replying with its time'iJ.mp (current .:lock reading). This mooe is suitJ.b!e
fOllnd in mas[ computers: The service is designed to scale to large numbers of clients where higher accuracies are required than can D-.? :i.chieved with multic2.:'[ - or \\'he~2
and servers. multicast is not supported in harj", are. For eX.lmple. file sem::rs on the same or ::.
neighbouring L.\N. which need !0 keep accurate timing information for file acce5~e~.
To provide prorecrfon against fmc/jerence It'irh rhe lime sen:ice. w/zeJher maliciOlls
could conwn a local server in pro-.::edure-call mode.
or accidenlal: The time service uses authentication techniques to check that timing
Finally. symmelric mode i~ intended for u~e by the servers thaI :,uppl;:' ti~e
data originate from the claimed trusted sources. It also validates the return addresses
information in LANs and by tht h;gher !c\'ch ,10\'.er strata) of the s: nchronizat;..:':"
of messages sent to it.
subnet. where the highest accuracie:, are to be 2chitwd. A pair of Sef\'eF operating ::-.
The NTP service is provided by a network of servers located across the Internet. symmetric mode exchange me<'s3ges bearing timing information. Timing data ::.:-e
Primary servers are connected directly to a time source such as a radio clock receiving retained as part of an association r:.clween the sen'ers that is maintained in order :<~
UTe: secondwy seners are synchronized. ultimately. \.... ith primary servers. The servers improve the accuracy of their synchronization 0\ cr time.
are connected in a logical hierarchy called a synchronization subnel (see Figure 10.3). In all modes. messages arc deli\ ered unreiiabl:. using the standard l'DP Inter:e:
whose levels are called straw. Primary servers occupy stratum 1; they are at the rool. transport protocol. In procedure-c;:tll mode and s;. mmetric mode. proce.:.~e~ exchar:;o-
Stratum 2 server::; are secondary servers that are synchronized directly with the primary pairs of mes<';"ges. Each message ix-J.J" timestaml>' ot recent message e\ ents: the k'...:'..:.:
sen'ers: stratum 3 ~ervers are synchronized with stratum 2 servers, and so on. The times \vhen the previous NTP me~<fe between the pair was scm and reccl\'ed, anJ :.::e
lowest-level (leaf) sen'ers execute in users' workstations. local time when the current mes:;':'gc \'. as transmmeJ. The recipient of the \'TP mes:-...:.f'"
The clocks beionging to servers with high stratum numbers are liable to be Jess notes the 10c3.1 time v,'hen it r('cei"e, the message. The four limes T 1 _ T;}. T
accurate than those with low stratum numbers. because errors are introduced at each and T i arc sho\'. n in Figure 10.-+ for ,he mcssage~ 1'1 and Ill' sent between ~en·er<.; A ::r:~
level of synchronization. NTP also takes into account the total message round-trip B. 1'\ote that in symmetric mode. uollke Cristian·..; 3.1gorithm described abo\e. there ..:.::."
delays to the root in assessing the quality of timekeeping data held by a panicular server. be a non-negligible delay bel\veen the arrival of ~"'ne message and the di~patch of in;::
The synchronization subnet can reconfigure as sen'ers become unreachable or next. Also. me~sages may be lost. bm the three timeqamps carried by each me<;sage.ifo-
failures occur. If. for example. a primary server's liTC source fails. then it can become nonethele~s valid.
396 CHAPTER 10 TIME AND GLOBAL STATES SECTION 10.4 LOGICAL TIME AND LOGICAL CLOCKS 397
For each pair of messages sent between two servers the NTP calculates an offset Figure 10.5 Events occurring at three processes
OJ' which is an estimate of the actual offset between the two clocks. and a delay d i ,
which is the total transmission time for the two messages. If the true offset of the clock Pl >----------e
at B relative to that at A is 0, and if the actual transmission times for m and 111' arc t and
t' respectively. then we have:
l~ c~:;- -- Physical
time
of message III l' and similarly d --') f Combining these relations, we may also say that, for Figure 10.6 Lamport timestamps for the events shown in Figure 10.5.
example. a --? f
It can also be seen from Figure 10.5 that not all events are related by the relation 1 1
--7. For example, a 17 e and e -17 a. since they occur at different processes. and there is
Pl~-·~·
no chain of messages intervening between them. We say that events such as a and e that
are not ordered by -7 are CO/lCllrrenf and write this (l II e.
~
m~.
~
The relation ---7 captures a flow of data intervening bet\veen two events. Note. P2 _______ 3 4 _.. _ ..... _ Physical
time
however. that in principle data can flow in ways other than by message passing. For c d m2
example, if Smith enters a command to his process to send a message. then telephones
Jones. who commands her process to issue another message. then the issuing of the first
message clearly happened-before that of the second. Unfortunately. since no network
messages were sent between the issuing processes. we cannot model this type of
P3-
e
; _ ......••
(b) On r.ecei\"ing (III. I). a process Pi computes L j ::::: ilia.\"( L}" r) and then VC3: Pi includes the value [ :::: Vi in e\"ery message it sends.
appbes LC I before times tamping the event receive(ml.
VC+: When Pi recei\'es a timestamp r in a message. it sets
Although we increment clocks by l. we could have chosen any positive value. It can 'v)J]::::: IIwx{VjfJ),t[)l).for) == L2 .. "N.Takingtnecomponent-
easily be shown. b)' induction on the length of any sequence of e\"enls relating two wise maximum of two vector timestamps in this way is known as a
events e and c'. that c ---7 e':::::} L(e) < L(e'). merge operation.
Note that the converse is not true. If L(e) < L(e'). then we cannot infer that
e ---7 e'. In Figure 10.6 we illustrate the use of logical clocks for the example given in For a \'ector clock Vi' Vili] is the number of events that Pi has timeswmped. and
Figure 10.5. Each of the processes Pl' P2 and P3 has its logical clock initialized to O. Vil)] () -:;:. i) is the number of cvents that have occurred at Pj that Pi has potentially
The clock value~ given are those immediately after the event to which they are adjacent. been affected by. (Process P I may have timestamped more events by [his point. but no
Note thaL for example. L{ b) > L(e) but h I! e. information has flowed to p; about them in messages as yet.)
400 CHAPTER 10 TIME AND GLOBAL STATES SECTION 10.5 GLOBAL STATES 401
Figure 10.7 Vector timestamps fortheevents shown in Figure 10.5 Figure 10.8 Detecting global properties
p, P2
(1.0.0) (2.0,0)
P1 ~ .... object
~~.
reference
-~-
a
P2 _ _ _ _ (2,1,0) (2,2,0) a. Garbage collection
.....; : - - - Physical
time
c d
m2
(0,0,1) ~2,2,2)
p, •e •
b. Deadlock
We may compare vector timestamps as follo\\:s:
v = V'
V ~ V'
V<V'
iff V[j] =
iff
V'[j] for
iffV~V'I\V:;::V'
i
i = 1,2 ... , N
1. 2 ... , N
C. Termination
P8'
passive +-i
activate gP2passive
Let V(e) be the vector timestamp applied by the process at \vhich e occurs. It is
straightforward to show. by induction on the length of any sequence of events relating Distrihured garbage coiil'Crio/!: An object is considered to be garbage if there are no
two events e and e', that e -7 e':=> Vee) < V(e'), Exercise 10.13 leads the reader to longer any reference~ ,~~ it anywhere in the distributed system. The memory taken up
show the converse: if \'( e) < V(e'), then e ~ e', by thar object can b-e reclaimed once it is knO\vn to be garbage. To check that an
Figure 10.7 shows the \'ector timestamps of the eyems of Figure 10.5. It can be objec! is garbage. \, e must \'erify that there are no reference:-. to it anywhere in the
seen. for example. that i' (a) < V(f), which reflects the fact that a -'} f Similarly. we system. In Figure 10.S:!. proce:-.s PI has t\VO objects that both ha\'e references - onc
can tell \vhen t\VO ewms are concurrent by comparing their timestamps. For example. has a reference within p\ it<,elf. and p, has a reference to the other. Process p, has
that c II e can be seen from the facts that neither V(c):::; F(e) nor V(e):::; V(c), one garbage object. \'. iIi>' no references to it anywhere in the S) 5-tem. It also has an
Vector timestamps ha\-e the disadvantage. compared with Lamport timestamps. of object for which neither P I nor p~ has a reference. but there j<, a reference to it in a
taking up an amount of "wrage and message payload that is proportional (Q N. the message ihat is in transit between the processe<,. This shows that \'ihen we consider
propenie~ of a s:;:stem. \, e must include the state of communication channels as well
number of processes. Charron-Bost [1991J showed thaI. if we are to be able to tell
whether or not two eyems are concurrent by inspecting their timestamps. then the as the <..lale of the prC\:";,:~~e:...
dimension N is una\"oidab1e. However. techniques exist for storing and transmitting Di.Hrihwed deadlock ,i'decrion: A distributed deadkx:k occur~ \vhcn each of a
smaller amounts of data. at the expense of the processing required to reconstruct collection of proce:-.~e~ \\ aits for another process IO ~end it a message. and wh<.'!re
complete vectors. Rayna] and Singhal [1996] give an account of some of these there i~;:, Lycle in the i;T3f-"'h of this 'waits-for' relationship. Figure 10.8b $ho\\'s that
techniques. They also describe the notion of matrix docks. whereby processes keep each of rroccsses p . .inc p, \;;aits for a message from the other. ~o this system will
estimates of other processes' vector times as \vell as their own. ncn;;r m:ike progrC'-.~.
Dislrilmi("d {crmil1{1w,': de[cuioll: The problem here i'- to detect that a diqributcd
algorithm has termin;::.,:::J. Detecting termination i~ a problem that ~ounds deCepli\'eJy
10.5 Global states easy to ,oiw: it seern'- Jt first only necessary to test \\ hether each proce~:'. has halted.
To see thalthi~ i~ no; >\:"'. con:-.idcr a distributed algorithm executed by lWO procc~"e~
In this and the next section we shall examine the problem of finding out whether a Pi and [I" each of \:.h;.:h may requeq values from the other. Instantaneously. wc
particular property is true of a distributed system as it executes. We begin by giying the may find t~l"tat a proce;.' l~ either acti\'c or passi\'e - a pa~~i\'e proce"s is not engaged
examples of distributed garbage collection. deadlock detection. termination detection in any ,letivi!y of ib l~\,n but is prepared to respond \\ith a yalue requested by the
and debugging. other. Suppose we di«l\er that PI j" passive and that p: is pa:-.<,ivc (Figure IO.8e).
CHAPTER 10 TIME AND GLOBAL STATES SECTION 10.5 GLOBAL STATES 403
402
To see that we may not conclude that the algorithm has tenninated. consider the Figure 10.9 Cuts
following scenario: when we tested PI for passivity. a message was on its way from
P2' which became passive immediately after sending it. On receipt of the message.
Pj became active again - after we found it to be passive. The algorithm had not P, . .e:
e~ I
..;;:
e?j
.~,,-.-~--.
t:j,-
tenninated.
The phenomena of tennination and deadlock are similar in some ways. but they
are different problems. First. a deadlock may affect only a subset of the processes in
a system, whereas all processes must have terminated. Second, process passivity is
not the same as waiting in a deadlock cycle: a deadlocked process is attempting to
P2 ;~- ~..
m,
eg \
:;- . i\ --Pht~;al
Inconsistent cut I
perform a further action, for which another process waits: a passive process is not Consistent cut
engaged in any activity.
Distributed debugging: Distributed systems are complex to debug [Bonnaire et al. Each event is either an internal action of the process (for example, the updating of onc
1995J, and care needs to be taken in establishing what occurred during the execution. of its variables) or it is the sending or receipt of a message over the communication
For example. Smith has written an application in which each process Pi contains a channels that connect the processes.
variable Xi (i = 1, 2 .... N). The variables change as the program executes, but they In principle. we can record what occurred in So's execution, Each process can
are required always [0 be within a value 0 of one another. Unfortunately, there is a record the events that take place there. and the succession of stales it passes through. We
bug in the program, and she suspects that under certain circumstances Ix; -
xjl > 0 denote by s? the state of process Pi immediately before the kth event OCcurs. so thaI ,,;!
for some i and j, breaking her consistency constraints. Her problem is that this is the initial state of 1\. We noted in the examples abo\'e thai the state of the
relationship must be evaluated for values of the variables that occur at the same time. communication channels is sometimes relevant. Rather than introducing a ne\v ly~ of
state. we make the processes record the sending or receipt of aU messages as part of their
Each of the problems above has specific solutions tailored to it: but they all illustrate the slate. If we find that process Pi has recorded that it sent a message In to process
need to observe a global state. and so motivate a general approach. p i(i 7: j). then by examining whether Pi has received that message we can infer
\\-hcther or not III is part of the state of the channel between Pi and P.r'
Vie can also form the global history of fJ as the union of the indi\'idual proces:;
10.5.1 Global states and consistent Guts histories:
It is possible in principle to observe the succession of states of an individual process, but
the question of how to ascertain a global state of the system - the state of the collection H hOuhlu ... uh N _ 1
of processes - is much harder to address.
:-'1athematical!y. we can take any set of states of the individual processes to form a g10bal
The essential problem is the absence of global time. If all processes had perfectly Slate S :::: (St ..\"::,' ... s:v). But which global states are meaningful - that is_ which
synchronized clocks then we could agree on a time at which each process would record process states could ha\'e occurred at the same time'? A global state corresponds to initial
its state - the result would be an actual global state of the system. From the collection of prefixc~ of the individual process historic<-. A cur of the system's execution is a :;UQ:'-Cl
process states \ve could tell. for example, whether the processes v,,'ere deadlocked. But of its global history that is a union of prefixes of process histories:
we cannot achieve perfect clock synchronization. so this method is not available to us.
So \ve might ask \vhether we can assemble a meaningful global state from local C I/tul,"\...'
1 ::' ... '_he.:"
v ,\
states recorded at different real times. The answer is a qualified 'yes'. but in order to see
this \Ve first introduce some definitions. The state .\ in the global state 5 corresponding to the cut C is that of p; immed13lel~
Let us return to our general system ,fJ of N processes Pi (i = 1. 2. .. ., 1\'). whose after the last event processed by Pi in the CuI - e:', (i ::;:: I. 2,. N). The set of e\ en;>
execution we wi<;h to study, We said above that a series of events occurs at each process, (e;':i = 1,2., .. _NJ is called thefrolHierofthe cut.
and that we may characterize the execution of each process by its history: Consider the eYcnts occurring at processes PI and P.., shown in Figure 10.9_ Tn.;;-
figure sho\vs two cuts, one with frontier <ej), e~> and ano~ther with frontier <e~. t'~>.
I .
llsrorylpi )
I)1 0 I 2
<ei,ei,e .. ·>
The leftmost cut is inconsistent. This is because at P::, it includes the receipt 01 the
i,
message II! I' but al P I it does not include the sending of that message. Thi" is sho\, lng
Similarly, \ve may consider any finite prefix of the process's history: an 'effect' without a ·cause-. The actual execution nc\'er was in a global ::.tate
corresponding to the process states at that frontier. and we can in principie tell thi:-- b~
i; 0 I /.:. examining the ...---'J- relation between events. By contrast. the rightmost cut is COl1sis!enr.
hi <ei,el,· .. e i >
404 CHAPTER 10 TIME AND GLOBAL STATES SECTION 10.5 GLOBAL STATES 405
It includes both the sending and the receipt of message m I' It includes the sending but Figure 10.10 Chandy and Lamport's 'snapshot' algorithm
not the receipt of message m," That is consistent with the actual execution - after alL
to
the message took some time arrive. Marker receiving rule for process Pi
A cut C is consistent if. for each event it contains. it also contains all the events On Pi'S receipt of a marker message over channel c:
that happened-before that event: if(Pi has not yet recorded its state) it
reCords its process state now:
For all events eE C f-te:::::>fe C records the state of c as the empty set:
turns on recording of messages arriving over other incoming channels:
A consistelll global stale is one that corresponds to a consistent cut. We may else
characterize the execution of a distributed system as a series of transitions between Pi records the state of c as the set of messages it has received over c
global states of the system: since it saved its state.
end !f
5 0 -75 1 -+5 2 -7.
Marker sending rule for process Pi
In each transition. precisely one event occurs at some single process in the system. This After Pi has recorded its state, for each outgoing channel c:
event is either the sending of a message. the receipt of a message. or an internal eyent. Pi sends one marker message over c
If two events happened simultaneously. we may nonetheless deem them to haye (before it sends any other me.<;sagc over c).
occurred in a definite order ~ say ordered according to process identifiers. (Events th;:n
OCCur simultaneously must be COncurrent: neither happened-before the other.) A system
evolves in this \~-ay through consistent global stares.
A 1"1111 is a total ordering of all the events in a global history that is consistent \\ ith So be the original state of the system. S(!fery with respect to a is the assenion that a
each local history" s ordering, -t i (i = 1, 2, .... .Y 1. A. lincari:;.arioIJ or COl1siSfenf rim is eyaluates to False for all states 5 reachable from SO, Conversely, let !3 be a desirable
an ordering of the events in a global history that is .:onsistent with this happened-before propeny of a system's global state - for example, the property of reaching termination.
relation -t on H. Note that a linearization is also a run. LiI'eness with respect to !3 is the propen.Y that. for any linc,,1.rization Lstarting in the state
Not all runs pass through consistent global states. but all linearizations pass only SQ. !3 e\'aluates to True for some state 5 L reachable from SQ.
through consistent global states. We say that a stat~ 5' is reachable from a state 5 if there
is a linearization that passes through 5 and then 5'.
Sometimes we may alter the ordering of concurrent events within a linearization. 10.5.3 The 'snapshot' algorithm of Chandy and Lamport
and derive a run that still passes through only consistent global states. For exampk. if Chandy and Lamport! 1985J describe a 'snapshot" algorithm for derennining global
{\\'O successiw eyents in a linearization are the re.:eipt of messages by two proce,>se~.
states of distributed systems. which we now present. The goal of the algorithm is to
then \ve may s\\-ap the order of these two events. record a set of process and channel states (a 'snapshot') for a set of processes Pi
(i ::: i. 2.. ,X) such that. even though the combination of recorded states may never
have occurred at the same time, the recorded global Slate is consistent.
10.5.2 Global state predicates, stability. safety and liveness
We shall see that the state that the snapshot algorithm records has convenient
Detecting a condition such as deadlock or termim:tion amounts to evaluating a globed properties for e\'a!uating stable global predicates.
srate predicarc. A global state predicate is a funclion that maps from the set of global The algorithm records state localiy at processes: it does not give a method for
states of processes in the system SO to {True, Fw'-,<,). One of the useful characteri<..tics gathering the global state at one site. An obvious method for gathering the state is for all
of the predicates associated with the state of an obje.:t being garbage. of the system being processes to send the state they recorded to a designated collector process, but we shall
deadlocked or the system being terminated is thm they arc all srablc: once the sy"wm not address this issue further hcre.
enlers a state in which the predicate is True, it rerm:ins True in all future states reachable The algorithm assumes that:
from that state. By contrast. \\-hen we monitor Of debug an application we are often
interested in non-stable predicates. such as that 10 our example of variables \\ hose neither channels nor processes tail: communication is reliable so that every
difference is supposed to be bounded. Even if the application reaches a state in \,.-hich message sent is eventually recein'd intact. exactly once:
the bound obtains, it need not stay in that state.
channels are unidirectional and pro\'ide FIFO-ordered message delivery:
We also note here two further notions re!e\ant to global state predicates: safet~
and liveness. Suppose there is an undesirable property (J. that is a predicate of the the graph of processes and channels is strongly connected (there is a path bet\veen
system's global state ~ for example, a could be the property of being deadlocked. Let any two processes):
406 CHAPTER 10 TIME AND GLOBAL STATES SECTION 10.5 GLOBAL STATES 407
Figure 10.11 Two processes and their initial states Figure 10.12 The execution of the processes in Figure 10.11
Q
VI(
c,
C1 V
'1;;\ 1. Global stale So
<S1000, 0> 8 •
~
C,
(empty)
(empty)
,(;:'
~ <550. 2000,
the processes may continue their execution and send and receive nonnal messages
3. Global state $z
<5900. 0> 8 •
C2
C,
(Order 10. 5100). M
(five widgets)
'~
P2 <550.1995>
For each process Pi' let the incoming chaJlnels be those at Pi over \",hieh other processes
send it messages: similarly. Pi's outgoing channels are those on \\-hich it sends
4. Global state ~
<5900. 5> 8 Cz
•c,
(Order 10. 5100)
(emptyl
.~
P2 <S50. 1995>
messages to other processes. The essential idea of the algorithm is as follows. Each
process records its state and also for each incoming channel a set of messages sent to it. (M '" marker message)
The process records. for each channel. any messages that arrived after it recorded its
state and before the sender recorded its own state. This arrangement al\ows us to record
proce%es have the initial slates shown in Figure 10.11. Process P2 has alre3.d~ receiwd
the states of processes at different times but to account for the differentials bet\veen
an order for five widgets. which it \,'ill shortly dispatch to Pl.
process states in terms of messages transmitted but not yet received. If process 1\ has
sent a message 111 to process {IF but Pj has not received it. then we account for 111 as Figure 10. I 1 shows an execution of the system while the state is recorded. Process
belonging 10 the state of the channel between them. Pi records its state in the actual global state So. when Pj'S state i~ <5l000. 0>.
Following the marker sending rule. process (II then emit::. a marker me~~J.g:e O\'er its
The algorithm proceeds through use of special marker messagcs. which are
outgoing channel c~ before it send:;. the next application-level mes:;ag2: IOrder 10.
distinct from any other messages the processes send. and which the proce"ses may send
and recei\'e while they proceed with their normal execution. The marker has a dual role:
S 100) O\"l;::r channelc: .The system enters actual global state 5 j •
as a prompt for the recei\'er to save its o\vn state. if it has not already done so: and as a Before {J.., receives the marker. it emits an application message '~-r,e \\idgets)
means of determining which messages to include in the channel state. o\"("r e] in response to PI 's pn~\"iou:; order. yie Iding a new actual gJobai ~;:.::te S:.
The algorithm is defined through two rules. the marker reeeh'ing rule and the Now process PI receives p:'~ message (the widgets). and p: re..::ei\·es the
marker sending rule (Figure 10.10). The marker sending rule obligates processes to send marker. Following the marker receiYing rule. p: records it:; state as <S::O. 1995> and
a marKer after they h3se recorded their state. but before they send any other messages. lhat of channel c-, as the empty sequence. Following the marker sending ,ule. it sends a
marker message ;\"er C I .
The marker receiving rule obligates a process that has not recorded its state to do
so. In {hat case, this is the first marker that it has recei\·ed. It notes \\'hich messages When process PI receives P~ . ~ marker message. it records the ~!2,e of channel
c 1 as the single message (five \\."idgc\:,l that it received after it first rec0r.Jed i,s "tate.
subsequently arrive on the other incoming channels. \Vhen a process that has already
The final actual global state is S:,.
saved its state recei\"es a marker (on another channel). it records the state of thm channel
as the ~et of messages it received on it since it saved its state. The final recorded state is 1'.' <SI000. 0>: p_: <S50, 1995>: c·' <ifl\'e
\\ idgeb »: c: : < >. \"ote that this stal~ differs from all th-e global state~ ::-:,o<Jgh \, hich
.-\n;; process may begin the algorithm at any time. It acts as though it has received
the syqem actually pa:;sed.
a marker (O\'er a non-existent channel) and fol1o\',.·s the marker recei\'ing rule. Thus it
record~ its state and begins to record messages arriving over all its incoming channels. Termination of the snapshot algorithm we assume that £l process that :-:::~ r.?..::el' ed J.
Sewrai processes may initiate recording concurrently in this way (as long::,> the markers marker message records its state withm a finite time and ~ends marker :-:-.e~sages O\'er
they u~e can be distinguished). each outgoing channel within a finile lime (e\'en when it no longer :-:2e.j~ !0 ~end
We illustrate the algorithm for a system of two processes. P j and p: connected application messages over these channels). If there is a path of communic:ion cn2.nnels
by {\\ 0 unidirectional channels, C I and c 2 • The two processes trade in ·\\·idgets·. Process and processes from a process Pi to a process Pj (j'" i I. then it is -:i2U on ,hese
PI sends orders for widgets over c:; to P2' enclosing pa:yment at the rate of SIO per assumptions that Pi will record its state a finite lime after p. recorded ]:~ "tate. Since
\vidgel. Some time later. process P2 sends \vidgets along channel c: to p!' The \\"C are assuming the graph of processe:, and channels to be :,>tro'ngly conne-::ed. 1t {oilo\\"
408 CHAPTER 10 TIME AND GLOBAL STATES SECTION 10.6 DISTRIBUTED DEBUGGING 409
Figure 10.13 Reachability between states in the snapshot algorithm event at a different process. It cannot be that e j ----') e j + I" For then these t\\lO events
\vould be the sending and receiving of a message, respectively. A marker message
actual execution eO.e1 ,.. would have to have preceded the message, making the reception of the message a post-
snap event. but by assumption e j + 1 is a pre-snap event. We may therefore swap the {\A'Q
events ""ithout violating the happened-before relation (that is. the resultant sequence of
recording events remains a linearization). The swap does not introduce new process states, since
~S;nit recording Sfinal -------...
begins ends we do not alter the order in \vhich events occur at any individual process.
e~ ~ We continue s\vapping pairs of adjacent events in this way as necessary uotil \\.:e
pre·snap: e
0, 1· .. · R-l
$snap
post-snap: e'R.e'R+ k
have ordered all pre-snap events eo.e~. c2' .... eR_ 1 prior to all post-snap events
c R' L'R+ I' e R+::,· \vith Sys' the resulting execution. For each process, the set of
events in eo' e;, L';, .... e
R_I that occurred at it is exactly the set of events that it
experienced before it recorded its state. Therefore the state of each process at that point.
that all processes will have recorded their states and thc states of incoming channels a and thc state of the communlcation channels. is that of the global state 5.mop recorded
finite time after some process initially records its state, by the algorithm. We have disturbed neither of the states 5illil or 5 fino! with whlch the
linearization begins and ends, So \ve have established the reachability relationship.
Characterising the observed state 0 The snapshot algorithm selects a cut from the
history of the execution. The cut. and therefore the state recorded by this algorithm, is Stability and the reacnability of the observed state 0 The reachability propcrty of thc
consistent. To see this, let e i and e j be events occurring at Pi and p., respectively, such snapshot algorithm is useful for detecting stable predicates. In general. any non-stable
that e i --7 e j' We assert that if e j is in the cut then e i is in the cut. That is, if e j occurred predicate we establish as being True in the state S,WiP mayor may not havc bcen True
before p j recorded its state, then ei must have occurred before Pi recorded its state. in the actual execution whose global state \ve recorded. HowewL if a stable predicate is
This is obvious if the two processes are the same, so we shall assume that j -:;t i. Assume. TruL' in the state S,,~, . :) then we may conclude that the predicate is Tme in the Slate
for thc moment. the opposite of what we wish to prove: that Pi recorded its state before since by dcfinition a' stable predicate that is True of a state S is also TruL' of any state
e i occurred. Consider the sequence of H messages m l' 1n 2.. " m H (H :2: 1), giving rise reachable from 5. Similarly. if the predicate evaluates to Falsc for S'Ui,n' then it mU~l
to the relation e i --7 e j . By FIFO ordering over the channels that these messages also be False for S:1:ir' .
traverse, and by the marker sending and receiving rules. a marker message would have
reached Pj ahead of each of m I' m l ···, III H" By the marker receiving rule, Pj would
therefore have recorded its state before the event c j . This contradicts our assumption
that ej is in the cut, and we are done.
10.6 Distributed debugging
We may further establish a reachability relation between the observed global state
\Ve now examine ,he problem of recording a system's global state so that we may make
and the initial and final global states when the algorithm runs. Let S.vs = eO' e 1,. be
useful statements about whether a transitory state - as opposed to a ,>table state -
the linearization of the system as it executed (where two events occurred at exactly the
occurred in an actual execution. This is what we require. in generaL when debugging a
same time. we order them according to process identificrs). Let Sil1il be the global state
distributed system. We gave an example above in which each of a set of processes fl,
immediately before the first process recorded its state; let 5j"illa! be the global state when
has a variable x.' The safety condition required in this example is :x i - x; :::; 6
the snapshot aJgorithm terminates. immediately after the last state-recording action: and
(i. j = 1.2, . . .\" I: this constraint is 10 be met even though a process may change the
let 5 snap be the recorded global state.
\"alue of its variable at any timc. Another example is a distributed system controlling a
We shall find a permutation of 5,Ys. Sys' = eo'
ej, el' ... such that all three states system of pipes in a factory \vhere we are interested in whether all the \'ah'es (controlled
Sinil' Ssnap and Sfinal occur in 5.'"s', 5.uW{! is reachablc from Smir in S:rs'. and Sfinal is by different processe.;,) \,,'ere open at some time. In these examples. \ve cannot in general
reachable from S."wp in S:rs'. Figure J 0.13 shows this situation, in which the upper obsen'e the value~ of thc \'ariab1es or the :;tates of the \'ah'es simult.aneously. The
linearization is Sys. and the lower linearization is 5:,;,.'0'.
challenge is to monitor the system' '> execution overtime -to capture 'trace' infonnauon
We derive Sys' from 5ys by first categorising all events in 5ys as pre-snap events rather than a singk snapshot - so that \\'c can establish pOST hoc whethc:r [he required
or post-snap events. A pre-snap event at process Pi is one that occurred at Pi before it safety condition \\ as or may have been violated,
recorded its state; all other events are post-snap events. It is important to understand that Chandy and Lamport's snapshot algorithm collects state in a distributed fashion,
a post-snap event may occur before a pre-snap event in 5ys, if the events occur at and we pointed out how the processes in the system could send the state they gather to
different processes. (Of course no post-snap event may occur before a pre-snap event at a monitor proces~ for collection. The algorithm we shal! describe (due to \larzul1o and
the same process.) Neiger [1991]) is centralized. The observed processes send their states to a proce" . .
We shall show how we may order alJ pre-snap events before post-snap events to called a monitor. \,hich assembles globally consistent :"tales from what it receives. We
obtain Sys', Suppose that e j is a post-snap event at one process, and e j + 1 is a pre-snap consider the moniwr to lie outside the s:y:"tem. observing its execution.
410 CHAPTER 10 TIME AND GLOBAL STATES SECTION 10.6 DISTRIBUTED DEBUGGING 411
OUf aim is to determine cases where a given global state predicate $ was definitely
Figure 10.14 Vector timestamps and variable values for the execution of Figure 10.9
True at some point in the execution we observed, and cases where it \\'3S possibly True.
The notion 'possibly' arises as a natural concept because we may extract a consistent (1.0) (2.0) (3.D) (4,3)
global state 5 from an executing system and find that 6(5) is True. No single
observation of a consistent global state allows us to conclude whether a non-stable
P1 ~1: 1\~l=:OO x~:.:~~._ . _~.~ ~[_
Xl"90~.~
.. ...... ~
predicate ever evaluated to True in the actual execution. Nevertheless, we may be
m,
interested to know whether they might have occurred, as far as we can tell by obser\"ing
P2 ~__ m2
the execution. ~.~
x2="0100 \\ x;= 95~"~;·'~ 90----- ~----~.. Physical
The notion 'definitely' does apply to the actual execution and not to a run that we
(2,l) \1 1 2 \ time
have extrapolated from it. It may sound paradoxical for us to consider what happened in 22
, (2.3) i
an actual execution. However, it is possible to evaluate whether ID \vas definitely True Cut C, Cut C,
by considering all linearizations of the observed events.
We now define the notions of possibly <:> and definirely 6 for a predicate ¢ in terms
For example, in the example system of processes P I that arc supposed to obey the
of linearizations of H, the history of toe system's execution.
constraint j.\ ~ xji ::; 8. (i, j = L 2..... N). the processes need only notify the monitor
possibly 6 The statement possibly 0 means that there is a consistent global state when the values of their own variable Xi changes. When they send their state. they
S through which a linearization of Hpasses such that 0(5) is True. supply the value of Xi but do not need to send any other \·ariables.
definitely cD The statement definiTely 6 means that for all linearizations L of H,
there is a consistent global state S through \vhich L passes such that
¢l(S) is True. 10.6.1 Observing consistent global states
When we use Chandy and Lamport's snapshot algorithm and obtain the global state The monitor must assemble consistent global states against which it evaluates 6. Recall
5.'Tlap we may assert possibly 0 if 6(Ssnap) happens to be True. But in general e\'aluating that a cut C is consistent if and only if for all events e in the cut C. f ~ e ::;::} fEe.
possibly 0 entails a search through all consistent global states derived from the obser.ed For example. Figure 10.14 shows t\VO processes p I and p: with variables x 1 and
execution. Only if 0(5) evaluates to False for all consistent global states 5 is it not the x 2 • respectively. The events shown on the timelines (with vector timestamps) are
case that possibl.v ¢l. Note also that while we may conclude definitely (-,6) from adjustments to the values of the two variables. InitiaUy, x I == x~ == O. The requirement
--possibly 6. \ve may not conclude -possibly ¢l from definiTely (-,¢l). The latter is the is !xl-x,,: ::;50. The processes make adjustments to their variables. but 'large'
assertion that -,6 holds at some state on ewry linearization: 6 may hold for other states. adj~stments cause a message containing the new value to be sent to the other process.
We now describe: When either of the processes receives an adjustment message from the other. it sets its
ho\\, the process states are collected: \'ariable equal 10 the value contained in the message.
Whene\·er one of the processes p I or P2 adjusts the \'alue of its variable (\\ hether
how the monitor extracts consistent global states:
it is a 'small' adjustment or a 'large' one), it sends the ,'alue in a state message to the
how the monitor e\,aluates possibly 0 and definitely 0 in both asynchronous and monitoring process. The latter keeps the state messages in the per-process queues for
synchronous systems. analysis. If the monitor processes used values from the incon~iqent cut C I in Figure
Collecting the state 0 The observed processes Pi (i = I. 2 ..... N) send their initial state 10.14. then it would find that x! = 1, Xl = 100, breaking the constraint I - x:;. ::; 50. Ix
to the monitor process initially. and thereafter from time to lime, in state messages. The But this state of affairs never occurred. On the other hand. \'alues from the consistent cut
monilor process records the state messages from process Pi in a separate queue Q;. for C::! show Xl = 105, x2 = 90.
each i = l. 2 ... " ,V. In order that the monitor can distinguish consistent global states from inconsistent
The acti\'ity of preparing and sending state messages may delay the normal global states. the observed processes enclose their vector clock values \vith their ::;tate
execution of the observed processes. but it does not otherwise interfere with it. Thert' is messages. Each queue Q; is kept ordered in sending order. \\-hich can immediate I> be
no need to "end the state excepl initially and when il changes. There are (wo established by examining the ith component of the vector timestamps. Of course. the
optimizations to reduce the state-message traffic to the monitor. First. the global state monitor process may deduce nothing about the ordering of states sent by differeD!
predicate may depend only on certain parts of the processes' states. For example. it may processes from their arrival order. because of variable message iatencies. It must inStead
depend only on the states of particular variables. So the observed processes need only examine the Yector timestamps of {he state messages.
send the rele\'ant state to the monitor process. Second. they need only send their Slare at Let 5 = (sl' s2' .... ss) be a global state dra\vn from the state messages that the
rimes when the predicate 0 may become True or cease to be True. There is no point in monitor process has received. Let V(si) be the vector timestamp of the state .'Ii recei\'ed
seflding changes to the state that do not affect the predicate' s value. from Pi' Then it can be SOO\\'n that 5 is a consistent global state if and only if:
SECTION 10.6 DISTRIBUTED DEBUGGING 413
412 CHAPTER 10 TIME AND GLOBAL STATES
Figure 10.15 The lattice of global states for the execution of Figure 10.14 Figure 10.16 Algorithms to evaluate possibly~ and definitely~
3 F/ " T
requirement by contrast is to consider only those global states that the actual execution
sending process. the largest message timestamp it has seen. Assume that clocks are
10.7 Summary synchronized (0 within 100 ms. and that messages can arrive at most 50 ms after
transmission.
This chapter began by describing the importance of accurate timekeeping for distributed
systems. It then described algorithms for synchronizing clocks despite the drift between (i) When maya process ignore a message bearing a timestamp T. if it has recorded
them and the variability of message delays between computers. the last message received from that process as having timestamp T ?
The degree of synchronization accuracy that is practically obtainable fulfils many (ii) When ma~- a receiver remove a timestamp 175.000 (ms) from its table? (Hint: use
requirements but is nonetheless not sufficient to determine the ordering of an arbitrary the recei\·er·s local clock value.)
pair of events occurring at different computers. The happened-before relation is a partial
order on events that reflects a flow of information between them - \vithin a process, or (iii) Should the clocks be internally synChronized or externally synchronized?
via messages between processes. Some algorithms require events to be ordered in page 391
happened-before order, for example successive updates made at separate copies of data.
lOA A client auempts to synchronize \vith a time ser.·er. It records the round-trip times and
Lamport clocks are counters that are updated in accordance with the happened-before
timestamps returned by the server in the table below.
relationship between events. Vector clocks are an improvement on Lamport clocks,
because it is possible to determine by examining their vector timestamps \vhether two Which or these times should it use to set its clock·? To what time should it set it? Estimate
events are ordered by happened-before or are concurrent. the accuracy of the setting with respect to the ser.·er's clock. If it is known that the time
We introduced the concepts of eyents. local and global histories. cuts. local and between sending and receiving a message in the system concerned is at least 8 ms. do
global states. runs. consistent states. linearizations (consistent runs). and reachability. A your answers change·:
consistent state or run is one that is in accord \\-ith the happened-before relation.
ROllnd-rrip (ms) Time (hr:min:sec)
We went on to consider the problem of recording a consistent global state by
observing a system's execution. Our objectiye \\-as to evaluate a predicate on this state.
An important class of predicates are the s(able predicates. V'/e described the snapshot
" 10:5-4-:23.67-4-
algorithm of Chandy and Lamport. which captures a consistent global state and allows 25 10:5-4-:25.-+50
us to make assertions about whether a stable predicate holds in the actual execution. We ::0 10:5-4-:283-+2
went on to gi,·e :\-1arzullo and Neiger's algorithm for deriving assertions about whether
a predicate held or may have held in the actual run. The algorithm employs a monitor page 39/
process to collect states. The monitor examines \"Cctor timestamps to extract consistent
global states. and it constructs and examines the lattice of all consistent global states. 10.5 In the system or Excrci"e lOA it is required to synchronize a file server's clock to within
This algorithm in\'oh"es great computational complexity but is valuable for ±I milliseconc._ Dis.:-u~s this in relation 1:0 Cristian's algorithm. page 391
understanding and can be of some practical benefit in real systems where relatively fe\v iO.6 What reconiigur2tions \\ould you expect to occur in the :-':TP synchronization subnet?
events change ,he global predicate's \·alue. The algorithm has a more efficient variant
page 394
in synchronous systems. where clocks may be synchronized.
1O.! An ~TP ser.e~ B recei\"es server A's message al 16:3.+:23.480 bearing a timestamp
16:34:13.43CJ ;,nc repile~ to it. A recei\es the message at 16:34:15.725. bearing S':,-
EXERCISES timestamp [6::'-'.:::'.7. Estimate the offset be[\\een B and A and the accuracy of the
estimate. page 395
10.1 Why is computer clock synchronization necessary? Describe the design requirements 10.8 Discuss the fa~·,~"'~' to be taken into account \\-hen deciding to which :\'TP server a client
for a system to synchronize the clock'> in a distributed system. page 386 should synchr0rllZe ib .:-lock. pagc 396
10.2 A clock is reading 10:27:54.0 (hr:min:sec} \,hen it is discovered to be 4 seconds fast. )0.9 Di5cuss ho\\ .;~ i'(l,~iQle to compensate for clock drift between synchronization points
Explain why i, i'> unde::.irable to set it back to the right time at that point and sho\~ by obsen·ing t~.e Jrifl rate oyer time. Discu~s an: limitations to your method. page 397
(numericall: ) how it should be adjusted so as to be correct after S seconds has elapsed.
iO.IO By consider:n~ :: ~'hain of zero or more me:-.sages connecling events e and e' and llsing
page 390
induction. sho\:. ,r,;,l (' -> e' 0=:> L(e) < L{ e·) . rage 398
10.3 A scheme for implementing at-most-once reHable message de1i\·er:· uses synchronized
10.1! Sho\vthat \. "$1' page 399
clocks to reject duplicate messages. Processes place their local clock value (a
'timestamp') in the me'>sages they send. Each receiver keeps a table giving. for each 10.11 In a similar fasnton to E \ercise ! 0.10. show that e -> ('. 0=:> V( e) < V( c') . page 400
418 GHAPTER 10 TIME AND GLOBAL STATES
10.13 Using the result of Exercise 10.11, show that if events e and e' are concurrent then
neither V(e)::; V(e') nor V(e')::; V(e). Hence show that if V(e) < V(e') then e -7 e'.
page 400
10.14 T\vo processes P and Q are connected in a ring using two channels, and they constantly
rotate a message m. At anyone time. there is only one copy of m in the system. Each
process's state consists of the number of times it has received In, and P sends m first. At
a certain point. P has the message and its state is 10 I. Immediately after sending m. P
initiates the snapshot algorithm. Explain the operation of the algorithm in this case.
giving the possible global state(s) reported by it. page 405
Pl--7--~'~~--;~---
-..- COORDINATION AND AGREEMENT
~\ Time
/
P2------~~·---'-,,-··· ~
11.1
11.2
Introduction
Distributed mutual exclusion
10.15 The figure abo\"c shows events occurring for each of two processes. PI and P2' Arrows
between processes denote message transmission.
11.3 Elections
Dra\v and label the lattice of consistent states (PI state, P2 state). beginning \vith the
11.4 Multicast communication
initial state (0.0). page 412 11.5 Consensus and related problems
10.16 Jones is running a collection of processes Pl' P2' .... p;\'o Each process Pi contains a
11.6 Summary
variable v(' She wishes to determine whether all the variables \.'1'1'1 ..... v,'Ii vv'ere ever
equal in the course of the execution,
In this chapter, we introduce some topics and algorithms related to the issue of how
0) Jones' processes run in a synchronous system. She uses a monitor process to processes coordinate their actions and agree on shared values in distributed systems.
detennine \\'hether the variables \\'ere ever equaL When should the application despite failures. The chapter begins with algorithms to achieve mutual exclusion among
processes communicate with the monitor process. and what should their messages a collection of processes. so as to coordinate their accesses to shared resources. It goes
contain'~ on to examine how an election can be implemented in a distributed system. That is. it
Oi) Explain the statement possibly (\'1 = \'2 = I'X)' How can Jones determine describes how a group of processes can agree on a new coordinator of their activities after
whether this statement is true of her execution? page 413 the previous coordinator has failed.
The second half examines the related problems of multicast communication.
consensus, byzantine agreement and interactive consistency. In multicast, the issue is
how to agree on such matters as the order in which messages are to be delivered.
Consensus and the other problems generalize from this: how can any collection of
processes agree on some value, no matter what the domain of the values in Question? We
encounter a fundamental result in the theory of distributed systems: that under ce,1ain
conditions - including surprisingly benign failure conditions - it is impossible to
guarantee that processes will reach consensus.
419