You are on page 1of 18

f

DISTRIBUTED TIME AND GLOBAL STATES


SYSTEMS 10.1 Introduction
10.2 Clocks, events and process states
CONCEPTS AND DESIGN 10.3 Synchronizing physical clocks
10.4 Logical time and logical clocks
Third edition 10.5 Global states
10.6 Distributed debugging
10.7 Summary
GEORGE COULOURIS
Queen Mary and Westfield College, University of London
In this chapter. we introduce some topiCS related to the issue of time in distributed
and Cambridge Un[v~rsity systems. Time is an important practical issue. For example. we require computers around
the world to timestamp electronic commerce transactions consistently. Time is also an
JEAN DOLLIMORE important theoretical construct in understanding how distributed executions unfold. But
time is problematic in distributed systems. Each computer may have its own physical
Queen Mary and Westfield College, University of London clock, but the clocks typically deviate, and we cannot synchronize them periectly. We shall
examine algorithms for synchronizing physical clocks approximately. and then go on to
explain logical clocks, including vector clocks, which are a tool for ordering events without
TIM KIND BERG
knowing precisely when they occurred.
Hewlett-Packard laboratories, Palo Alto The absence of global physical time makes it difficult to find out the state of our
distributed programs as they execute. We often need to know what state process A is in
QA when process B is in a certain state, but we cannot rely on physical clocks to know what
76 is true at the same time. The second half of the chapter examines algorithms to determine
global states of distributed computations despite the lack of global time .
.9
.DS
C68
2001 ndon. ~ew York. Boston. San Francisco· Toronto
'9"pore • Hong Kong. Seoul. Taipei. New Delhi
159277 • Me7;ico City. Amsterdam. Munich· Paris· Milan
ULL

385
386 CHAPTER 10 TIME AND GLOBAL STATES SECTION 10.2 CLOCKS. EVENTS AND PROCESS STATES 387

whether they contain references) and of the communication channels between processes
10.1 Introduction (in case messages containing references are in transit).
In the first half of this chapter, we examine methods whereby computer clocks can
This chapter introduces fundamental concepts and algorithms related to monitoring be approximately synchronized, using message passing. V'le go on to introduce logical
distributed systems as their execution unfolds. and to timing the events that occur in clocks, induding vector clocks. which are used to define an order of events without
their executions. measuring the physical time at which they occurred.
Time is an imporram and interesting issue in distributed systems. for several In the second half. we describe algorithms whose purpose is to capture global
reasons. First. time is a quantity we often \vant to measure accurately. In order to know states of distributed systems as they execute.
at what time of day a particular event occurred at a particular computer it is necessary
to synchronize its clock \vith an authoritative. external source of time. For example. an
'e-commerce" transaction involves events at a merchant's computer and at a bank 10.2 Clocks, events and process states
computer. It is important. for auditing purposes, that those events are timestamped
accurately. Chapter 2 presented an introductory model of interaction betv..·een the processes \vithin
Second. algorithms that depend upon clock synchronization have been developed a distributed system. We shall refine that mode! in order to help us to understand how to
for se\'eral problems in distribution [Liskov \993]. These include maintaining the characterize the system's evolution as it executes. and how to timestamp the events in a
consistency of distributed data (the use of timestamps to serialize transactions is system's execution that interest users. We begin by considering how to order and
discussed in Section 12.6): checking the authenticity of a request sent to a server (a timestamp the events that occur at a single process.
version of the Kerb-eros authentication protocol. discussed in Chapter 7, depends On We take a distributed system to consist of a collection &0 of N processes
loosely synchronized clocks): and eliminating the processing of duplicate updates (see. Pi' i :::; L 2 . ... N. Each process executes on a single processor. and the processors do
for example, Ladin e: al. [1992]). not share memory (Chapter 16 considers the case of processes that share memory). Each
Einstein demonstrated. in his Special Theory of Relativity. the intriguing process Pi in SO has a state .\ which. in general. it transforms as it executes. The
consequences that :follow from the observation that the speed of light is constant for all process's state includes the values of all the \'ariables within it. Its state may also include
obseryers. regardk~s of their relative velocity. He proved from this assumption, among the values of an)' objects in its local operating system environment that it affects. such
other things. that I\' (' e\'ents that are judged to be simultaneolls in one frame of reference as files. We assume that processes cannot communicate with one another in any way
are not necessarily ~imultaneous according to obsef\'ers in other frames of reference that except by' sending messages through the network. So, for example, if the processes
are mO\'ing relati\'e to it. For example. an obsef\'er on the Earth and an observer operate robot anns connected to their respecti\"e nodes in the system. then they are nOl
travelling away from the Earth in a spaceship will disagree on the time interval between allowed to communicate by shaking one anOlher"s robot hands!
eventS. the more 500 ~s their relative speed increases. As each process Pi executes it takes a series of actions, each of \vhich is either a
message Send or Receive operation. or an operation that transforms Pi's state - one that
\Ioreover. tn:: relative order of two events can even be reversed for two different
changes one or more of the values in Si' In practice. we may choose to use a high-lew I
obsef\·ers. But thi~ .:annot happen if one e\'ent could have caused the other to occur. In
description of the actions, according to the application. For example. if the processes in
that case. the ph~ si':J.l effect follows the physical cause for all observers, although the
SO are engaged in an e-commerce application. then the actions may be ones such as
time elap~ed bet\, eeD cause and effect can \·ar),. The timing of physical events was thus
'client dispatched order message' or 'merchant sen'er recorded transaction to log'.
praYed to be relati; e co the obsef\'er, and Newton' s nOlion of absolute physical time was
We define an event to be the occurrence of a single action that a process carries
discredited. There l~ DO special physical clock in the universe to which we can appeal
out as it executes - a communication action or a state-transforming action. The sequence
when we want to measure intervals of time.
of e\'ents \vithin a single process Pi can be placed in a single. total ordering. which \\ e
The notiOn of physical time is also problematic in a distributed system. This is not shall denote by the relation --7 i between the events. That is. e--7 je' if and only if the
due to the effects 0;' ~pccial relativity. which are negligible or nOD-existent for normal e\'eOl e occurs before e' at Pi' This ordering is \\'ell defined. whether or nO! the process
computers (unless ,:>D:: counts computers travelling in space~hips!). The problem is is multi-threaded. since we ha\'e assumed that the process executes on a single
based on a ~imila; ilmitation in our ability to timestamp e\'ents at different nodes
processor.
sufficiently accura,e:> to know the order in \\"hich any pair of events occurred, or :\ow we can define the history of process p; tv be the series of events that take
\vhether they occur:-ed simultaneously. There is no absolute. global time that \\'e can place within it. ordered as we have described by th~ relation ""'""7 i :
appeal to. And yet ",,'e sometimes need to observe distributed systems and establish
\vhether certain Sl;:.Ies of affairs occurred at the same time. For example. in object- IlIsroryfpi
. ) = Ili:= <ei,ei,c
0 I 2• ••• >
1
oriented systems \'. e need to be able to establish whether references to a particular object
no longer exist - \, helher the object has become garbage (in which case we can free its Clocks 0 We have seen how to order the e\"Cnt~ at a process but not how to timestamp
memory). Estabiishlr:g this requires observations of the state::.. of processes (to find out them - to assign to them a date and time of day. Computers each contain their 0\\ n
388 CHAPTER 10 T!ME AND GLOBAL STATES
SECTION 10.3 SYNCHRONIZING PHYSICAL CLOCKS 389

Figure 10.1 Skew between computer clocks in a distributed system standard for elapsed real time, known as Inremational Atomic Time. Since 1967, the

a I
~I
Network
(;) I
C9 I
standard second has been defined as 9.192.631.770 periods of transition between the
two hyperfine levels of the ground state of Caesium-I 33 (C5 133 ),
Seconds and years and other time units that we use are rooted in astronomical
time. They were originally defined in tenus of the rotation of the Earth on its axis and
its rotation about the Sun. However. the period of the Earth' s rotation about its axis is
gradua!Jy getting longer. primarily because of tidal friction: atmospheric effects and
convection currents within the Earth's core also cause short-term increases and
decreases in the period. So astronomical time and atomic time have a tendency to get out
phYSical clock. These clocks are electronic devices that count oscillations occurring in of step,
a crystal at a definite frequency. and that typically divide this count and s[Qre the result Coordinated Unil'crsal Time - abbreviated as UTC (from the French equivalent)
in a counter register. Clock devices can be programmed to generate interrupts at regular - is an international standard for timekeeping, It is based on atomic time. but a so-called
intervals in order that for example. timeslicing can be implemented: however \ve shall leap second is inserted - or. more rarely, deleted...- occasionally to keep in step with
not concern ourselves with this aspect of clock operation. astronomical time, LTC signals are 5ynchronized and broadcast regularly from land-
The operating system reads the node's hardware clock value H/tl scales it and based radio stations and satellites covering many parts of the world. For example. in the
adds an offset so as to produce a software clock Ci(t) = uHi(t) + ~ that approximately USA. the radio station WWV broadcasts time signals on several shortwave frequencies.
measures real, physical time [for process Pi' In other words. when the real time in an Satellite sources include rhe Global Posirioning System (GPS).
absolute frame of reference is r. C i ( r) is the reading on the software clock. For example, Receivers are a\'ailable commercially. Compared with 'perfect' UTC the signals
Ci(r) could be the 64-bit value of the number of nanoseconds that ha\'e elapsed allime
received from land-based stations haw an accuracy in the order of 0.1-1 0 milliseconds,
r since a convenient reference time. In general. the clock is not completely accurate. and depending on the station used. Signals received from GPS are accurate to about 1
microsecond. Computers with recei\·cr::. attached can synchronize their clocks with these
so CJt) will differ from r. 7\"onetheless. if C i behaves sufficiemly well (we shall
timing signals. Computers may als0 recei\-e the time to an accuracy of a few
examine the notion of clock correctness shon!y). we can use its value to timestamp any
milliseconds O\'er a telephone line. from organizations such as the National Institute for
event at Pi' Note that Successive e\"ems will correspond to different timestamps only if
Standards and Technology in the CSA.
the clock resolution - the period between updates of the clock value - is smaller than the
time interval between successjye evems. The rate at which eVents occur depends on such
factors as the length of the processor instruction cycle.

Clock skew and clock drift 0 Computer clocks. like any others. tend not 10 be in perfect
10.3 Synchronizing physical clocks
agreement (Figure 10.1). The instantaneous difference between the readings of any two
clocks is called their skew. Also. the cr: 5tal-based clocks used in compUlers are. like any In order to kno\\- at what lime of da:- events occur at the processes in our distributed
other clocks. subject to clock dr!(r. which means that they count time at different rates. system fJ - for example. for aCCOUDt2.D.CY purposes - it is necessary to synchronize the
and so diverge. The underlying osciilmors are subject to physical Yariations. with the processes' clocks C,- with an authoritative. external source of time. This is exremal
consequence that their frequencies of oscillation differ. Moreover. eyen the same sYllchroni-;ation. And if the clocks C. are synchronized with one another to a known
c1ock's frequency varies with temperature. Designs ex.ist that attempt to compensate for degree of accuracy. then we can mea;ure th~ interval between two events occurring at
this variation. but they cannot eliminate it. The difference in the oscillarion period different computers by appealing to rheir local clocks - even though they are not
between {\vo clocks might be ex.tremely small. but the difference accumulated o\-er necessarily synchronized to an external .:.ource of time. This is internal synchronizarion.
mallY oscillations leads to an observable difference in the counters regi:,tered by two We define these rwo modes of synchronization more closely as follows. over an interval
clocks. no matter how accurately they \'.-cre initialized to the same \·alue. :\ c1ock's dr(fr of real time I:
rare is the change in the offset (difference in reading) between the clock and a nominal EXiemal synchrolli::'(liion: For a:,: ".:hronization bound D > O. and for a source 5 of
perfect reference clock per unit of time measured by the reference clock. For ordinar;. CTC time. S(r) - C I ( n < D. for i := 1.1 . ....\I and for all real times [in I. Another
clocks based on a quartz cr:ystaL this is about 10-6 seconds/second _ gi\ ing a difference way of saying this is that the clock~ C. are accurare to within rhe bound D.
of 1 second every 1.000,000 seconds. or 11.6 days. The drift rate of 'high precision'
quartz clocks is about 10- 7 or 10-8. !mernal sYllchroni-:.arion: For a s:- n..::hronization bound D > O. iC/r) - Cj(t)! < D
for i, j = L 1, . . N. and for all real times r in I. Another way of saying this is that
Coordinated Universal Time 0 Computer clocb can be synchronized to external sources the clocks C 1 agree \\-ithin the bound D.
of highly accurate time. The most accurate physical clocks use atomic o~cillaton;. whose Clocks that are internally synchronized are nm necessarily externally synchronized.
drift rate is about one pan in 10 13 . The output of these atomic clock" j" used as the since they may drift collecti\"ely from an external source of time. even though they agree
390 CHAPTER 10 TIME AND GLOBAL STATES SECTION 10.3 SYNCHRONIZING PHYSICAL CLOCKS 391

with one another. However, it foHows from the definitions that if the system ,so is Figure 10.2 Clock synchronization using a time server
externally synchronized with a bound D then the same system is internally synchronized

o :
\vith a bound of 2D.
Various notions of correcmess for clocks have been suggested. It is common to m, ~DDO
define a hardware clock H to be correct if its drift rate faUs within a known bound p > 0
(a value derived from one supplied by the manufacturer, such as 10-6 seconds/second). ... m
This means that the error in measuring the interval between real times t and t' ([' > t) is t
p Time server.S
bounded:

(1 - p)(r' -I) S H(r') -H(I) S (1 + p)(r' - r) 10.3.1 Synchronization in a synchronous system


This condition forbids jumps in the value of hardware clocks (during normal operation), Vie begin by considering the simplest possible case: of internal synchronization between
Sometimes we also require our software clocks to obey the condition. But a weaker t\\'o processes in a synchronous distributed system. In a synchronous system, bounds are
condition of monotonicity may suffice. Monotonicity is the condition that a clock Conly known for the drift rate of clocks, the maximum message transmission delay, and the
ever advances: time to execute each step of a process (see Section 2.3,1 l.
One process sends the time I on its local clock !O the other in a message m. In
r'>[~C(t'»C(t) principle, the receiving process could set its clock to the time r + T nolls ' where T rran, is
the time taken to transmit m benveen them. The t\VO clocks would then agree (since the
For example, the UNIX make facility is a tool that is used to compile only those source aim is internal synchronization, it does not matter ","hether the sending process's clock
files that have been modified since they were last compiled. The modification dates of is accurate).
each corresponding pair of source and object files are compared to determine this Unfortunately. TIrans is subject to variation and is unkno\vn. In generaL other
condition. If a computer whose clock was running fast set its clock back after compiling processes are competing for resources with the processes to be synchronized at their
a source file but before the file \,as changed, the source file might appear to have been respecti\'e nodes, and other messages compete with m for the network. :\"onetheless.
modified prior to the compilation. Erroneously, make will n01 recompile the source file. there is always a minimum transmission time min that would be obtained if no other
We can achieve monotonicity despite the fact that a clock is found to be running processes executed and no other network traffic existed: min can be measured or
fast. We need only change the iate at \vhich updates are made to the time as gi\'en to conservatively estimated.
In a synchronous system, by definition, there is also an upper bound max on the
applications, This can be achie\'cd in soft\vare without changing the rate at which the
time taken to transmit any message, Let the uncertaimy in the message transmission time
underlying hardware clock ticks - recall that Ci(r) = aHi(r) +~, where we are free to
be [(, so that u = (max - min). If the receiver sets its clock to be [ + min, then the clock
choose the values of a and ~,
skew may be as much as Il, since the message may in fact have taken time max to arrive.
A hybrid correcmess condition that is sometimes applied is to require that a clock
Similarly, if it sets its clock to t + max, the skew may again be as large as It. If. however.
obeys the monotonicity condition, and that its drift rale is bounded bwveen it sets its clock to the half-way point t + (max + min )/2. then the skew is at most II 11.
synchronization points, but to allow the clock \'alue to jump ahead at synchronization In generaL for a synchronous system. the optimum bound that can be achie\'ed on clock
points. skew when synchronizing N clocks is lIe I - 1/ N) ILundelius and Lynch 198-1-].
A clock that does not keep to whatever correctness conditions apply is defined to Most distributed systems found in practice are a~ynchronous: the factors leading
bejal/lty, A clock's CTashjaiiure is said to occur when the clock stops ticking altogether: to message delays are not bounded in their effect. and there is no upper bound max on
any other clock failure is an arbirraryfailure. An example of an arbitrary failure is that message transmission delays. This is particularly so for the Internet. For an
of a clock with the 'Y2K bug', \\ hich breaks the monotonicity condition by registering asynchronous system, \ve may say only that T l/"(ins = min + x, where x ?: O. The value
the date after 31 December 1999 as I January 1900 instead of ~OOO: another example is of x is not known in a particular case, although a distribution of values may be
a clock whose batteries are very low and whose drift rate suddenly becomes \'ery large, measurable for a particular installation,
);'ote that clocks do nOt ha\'c to be accurate to be correct. according to the
definitions. Since the goal may be internal rather than external synchronization, the
criteria for correcmess are only concerned wilh the proper functioning of the clock's
10.3.2 Cristian's method for synchronizing clocks
'mechanism', not its absolute setting, Cristian [19891 suggested the lise of a time server. connected to a device that receive:-
We now describe algorithms for external synchronization and for internal signals from a source of UTe to synchronize computers .::xternally, Upon request. the
synchronization. server process S supplies the time according to its clocK. as shown in Figure 10.1.
392 CHAPTER 10 TIME AND GLOBAL STATES SECTION 10.3 SYNCHRONIZING PHYSICAL CLOCKS 393

Cristian observed that while there is no upper bound on message transmission delays in faulty clocks is partially addressed by the Berkeley algorithm. which is described next.
an asynchronous system, the round-trip times for messages exchanged between pairs of The problem of malicious interference with time synchronization can be dealt with by
processes are often reasonably short - a small fraction of a second. He describes the authentication techniques,
algorithm as probabilistic: the method achieves synchronization only if the observed
round-trip times between client and server are sufficiently short compared with the
required accuracy. 10.3.3 The Berkeley algorithm
A process p requests the time in a message m r. and receives the time value [in a Gusella and Zatti [1989J describe an algorithm for internal synchronization that they
message m t (t is inserted in Ill! at the last possible poim before transmission from S's developed for collections of computers running Berkeley UNIX, In it. a coordinator
computer). Process p records the total round-trip time T rO(llid taken to send the request computer is chosen to act as the maSler. Unlike Cristian's protocol. this computer
ml" and receive the reply m f • It can measure this time with reasonable accuracy ifi!s rate periodically polls the other computers whose clocks are to be synchronized, called
of clock drift is small. For example. the round-trip time should be in the order of 1-10 slaves. The slaves send back their dock values to it. The master estimates their local
milliseconds on a LAN. over which time a clock with a drift rate of 10-6 seconds/second clock times by observing the round-trip times (similarly to Cristian's technique), and it
varies by at most 10-5 milliseconds. averages the values obtained (including its own clock's reading), The balance of
A simple estimate of the time to which p should set its clock is [+ T rO/md/2. probabilities is that this average cancels out the individual clocks' tendencies to run fast
which assumes that the elapsed time is split equally before and after 5 placed I in mr or slow, The accuracy of the protocol depends upon a nominal maximum round-trip time
This is normally a reasonably accurate assumption. unless the two messages are between the master and the slaves, The master eliminates any occasional readings
transmined over different networks. If the value of the minimum transmission time min associated with larger times than this maximum,
is known or can be conservatively estimated. then we can detennine the accuracy of this Instead of sending the updated current time back to the other computers - which
result as follows. would introduce further uncertainty due to the message transmission time - the master
The earliest point at which 5 could have placed the time in m f \\"as min after p sends the amount by \vhich each individual sla\'e's clock requires adjustment. This can
dispatched m r . The latest poim at which it could have done this was min before m f be a positive or negative value,
arrived at p. The time by 5' s clock when the reply message arrives is therefore in the The algorithm eliminates readings from faulty cloch, Such clocks could have a
range [t + min. t + T r(lund - mill). The width of this range is T rOlllld - 1min, so the significant adverse effect if an ordinary average was taken, The master takes a fault·
accuracy is ±( T rO/md/2 - mill), toleram (I'.'erage, That is, a subset of clocks is chosen that do not differ from one another
Variability can be dealt with to some extent by making several requests to 5 by more than a specified amount. and the average is taken of readings from only these
(spacing the requests so that transitory congestion can dean and taking the minimum clocks,
value of T roulld to give the most accurate estimate, The greater is the accuracy required, Gusella and Zani describe an experiment involving 15 computers whose clocks
the smaller is the probability of achie\'ing it. This is because the most accurate results were synchronized to \vithin about 20-25 milliseconds using their protocol. The local
are those in which both messages are transmined in a time close to min - an unllkely clocks' drift rate was measured to be less than 2x 10-), and the maximum round-trip
event in a busy network. time was taken to be 10 milliseconds,
Should the master fail. then another can be elected to take over and function
Discussion of Cristian's algorithm 0 As described, Cristian' s method suffers from the exactly as its predecessor. Section 1 1.3 di:.-cusses some general-purpose election
problem associated with all services implemented by a single seryer. that the single time algorithms, ;\;ote that these are not guaranteed to dect a new master in bounded time -
server might fail and thus render synchronization impossibie temporarily. Cristian and so the difference between two clocks \>,:ould be unbounded if they were used.
suggested, for this reason, that time should be pro\-ided by a group of synchronized time
servers. each with a receiH.'r for eTC lime signals. For exampk, a client could multicast
its request to all servers and use only the first reply obtained, 10.3.4 The Network Time Protocol
Note that a faulty time server that replied \\-ilh spurious time values. or an imposter
time sen'er that replied with deliberately incorrect rimes, could wreak havoc in a Cristian's method and the Berkeley algorithm are intended primarily for use within
computer system. These problems were beyond the scope of the work described by intranets, The Network Time Protocol (NTP) [\1ills 1995J defines an architecture for a
Cristian [1989], \vhieh ass-urnes that s-ources of external lime ~ignals are ~elf-ehecking, time service and a protocol to distribute time information over the Internet.
Cristian and Fetzer [1994} describe a family of probabilistic protocols for internal clock NTP's chief design aims and features are a:.- follows.
synchronization. each of which tolerates certain failures. Srikanth and Toueg [1987] To prodde (I sadce enabling diems across fhe 'mane! TO be synchroniz.ed
first described an algorithm that is optimal \\ ith respect to the accuracy of the accuraTely to UTe: Despite the large and \-ariable message delays encountered in
synchronized clocks. while tolerating some failures, Dole\' er af. [1986J showed that if Internet communication, NTP employs qatis!ical teChniques for the filtering of
fis the number of faulty clocks out of a total of N, then we must have N > 3 f if the other. timing data and it discriminates bet\veen the quality of timing data from differenl
correct. clocks are stilt to be able to achieve agreement, The problem of dealing with servers,
394 CHAPTER 10 TIME AND GLOBAL STATES I SECTION 10.3 SYNCHRONIZING PHYSICAL CLOCKS 395

Figure 10.3

I
An example synchronization subnet in an NTP implementation Figure 10.4 Messages exchanged between a pair of NTP peers

/
2
/
~
~
2

~
I
I \' l '\ r:::
Server A Ti_3 T,

3 3 3
a stratum 2 secondary server. If a ~e.:ondary seryer'" normal Source of synchronizati~'r:
fails Or becomes unreachable. then it may synchronize with another sen"er.
Note: Arrows denote synchronization control, numbers denote strata. NTP seryers synchronize \\ ith one another in one of three modes: mul!icasL
procedure-call and symmetric mode . .\fulricas[ mode is intended for use on a high-spc;'.'d
LA:\". One or more servers periodically multica:<-!:, the time to the seryers running in
other computers connected by the L..\:--':. \\ihich :<-el their clocks assuming a small dela.\
To provide a reNabie service [har can survi\'e lengthy losses of cOllnecri'.·iry: There This mode can achieve only relati\ely low accuracies. but ones that nonetheless :lIe
are redundant sen-crs and redundant paths between the servers. The servers can considered sufficient for many pU!T'0ses.
reconfigure so as to continue to provide the service jf one of them becomes
Procedllre"wl! mode is simil2I to the operation of Cristian' s algorithm. descri0;'.'j
unreachable.
above. In this mode. one sener accepts requesl5 from other computers. ""'hich i:.:
To enable diems fO resynchronize sufjiciemly frequently to offset the rates of drift processes by replying with its time'iJ.mp (current .:lock reading). This mooe is suitJ.b!e
fOllnd in mas[ computers: The service is designed to scale to large numbers of clients where higher accuracies are required than can D-.? :i.chieved with multic2.:'[ - or \\'he~2
and servers. multicast is not supported in harj", are. For eX.lmple. file sem::rs on the same or ::.
neighbouring L.\N. which need !0 keep accurate timing information for file acce5~e~.
To provide prorecrfon against fmc/jerence It'irh rhe lime sen:ice. w/zeJher maliciOlls
could conwn a local server in pro-.::edure-call mode.
or accidenlal: The time service uses authentication techniques to check that timing
Finally. symmelric mode i~ intended for u~e by the servers thaI :,uppl;:' ti~e
data originate from the claimed trusted sources. It also validates the return addresses
information in LANs and by tht h;gher !c\'ch ,10\'.er strata) of the s: nchronizat;..:':"
of messages sent to it.
subnet. where the highest accuracie:, are to be 2chitwd. A pair of Sef\'eF operating ::-.
The NTP service is provided by a network of servers located across the Internet. symmetric mode exchange me<'s3ges bearing timing information. Timing data ::.:-e
Primary servers are connected directly to a time source such as a radio clock receiving retained as part of an association r:.clween the sen'ers that is maintained in order :<~
UTe: secondwy seners are synchronized. ultimately. \.... ith primary servers. The servers improve the accuracy of their synchronization 0\ cr time.
are connected in a logical hierarchy called a synchronization subnel (see Figure 10.3). In all modes. messages arc deli\ ered unreiiabl:. using the standard l'DP Inter:e:
whose levels are called straw. Primary servers occupy stratum 1; they are at the rool. transport protocol. In procedure-c;:tll mode and s;. mmetric mode. proce.:.~e~ exchar:;o-
Stratum 2 server::; are secondary servers that are synchronized directly with the primary pairs of mes<';"ges. Each message ix-J.J" timestaml>' ot recent message e\ ents: the k'...:'..:.:
sen'ers: stratum 3 ~ervers are synchronized with stratum 2 servers, and so on. The times \vhen the previous NTP me~<fe between the pair was scm and reccl\'ed, anJ :.::e
lowest-level (leaf) sen'ers execute in users' workstations. local time when the current mes:;':'gc \'. as transmmeJ. The recipient of the \'TP mes:-...:.f'"
The clocks beionging to servers with high stratum numbers are liable to be Jess notes the 10c3.1 time v,'hen it r('cei"e, the message. The four limes T 1 _ T;}. T
accurate than those with low stratum numbers. because errors are introduced at each and T i arc sho\'. n in Figure 10.-+ for ,he mcssage~ 1'1 and Ill' sent between ~en·er<.; A ::r:~
level of synchronization. NTP also takes into account the total message round-trip B. 1'\ote that in symmetric mode. uollke Cristian·..; 3.1gorithm described abo\e. there ..:.::."
delays to the root in assessing the quality of timekeeping data held by a panicular server. be a non-negligible delay bel\veen the arrival of ~"'ne message and the di~patch of in;::
The synchronization subnet can reconfigure as sen'ers become unreachable or next. Also. me~sages may be lost. bm the three timeqamps carried by each me<;sage.ifo-
failures occur. If. for example. a primary server's liTC source fails. then it can become nonethele~s valid.
396 CHAPTER 10 TIME AND GLOBAL STATES SECTION 10.4 LOGICAL TIME AND LOGICAL CLOCKS 397

For each pair of messages sent between two servers the NTP calculates an offset Figure 10.5 Events occurring at three processes
OJ' which is an estimate of the actual offset between the two clocks. and a delay d i ,
which is the total transmission time for the two messages. If the true offset of the clock Pl >----------e
at B relative to that at A is 0, and if the actual transmission times for m and 111' arc t and
t' respectively. then we have:

T I _ 2 = T I _ 3 +(+0 andY; T I _ 1 +(-0 P2---·---


a

l~ c~:;- -- Physical
time

This leads to:


P3 ----------e-.- ~. .~.~
.. f
e
d j = r+r' Tj _ 2 - Tj _ 3 + Tj - Tj _ I

Also 10.4 Logical time and logical clocks


0= 0j+(r'-1)/2, where OJ (T;_2-TI_3+T;_I-T)12
From the point of view of any single process. events are ordered uniquely by times
sho\',:n on the local clock. However, as Lamport [1978] pointed out. since v,'e cannot
Using the fact that !. r';::: 0 it can be shown that 0 1 - d/2 '5 0 '5 OJ + d/2. Thus 0i is an
synchronize clocks perfectly across a distributed system. we cannot in general use
estimate of the offset, and d i is a measure of the accuracy of this estimate.
physical time to find out the order of any arbitrary pair of events occurring within it.
NTP servers apply a data filtering algorithm to successive pairs <oi' d j >. which In general. we can use a scheme that is similar to physical causality, but that
estimates the offset 0 and calculates the quality of this estimate as a statistical quantity applies in distributed systems. to order some of the events that occur at different
cailed the filrer dispersion. A relatively high filter dispersion represents relatiwly processes. This ordering is based on two simple and intuitively ob\'ious points:
unreliable daw. The eight most recent pairs <oi' d J > are retained. As \vith Cristian's
If two events occurred at the same process Pi (i = I. 2 ..... N). then they occurred
algorithm. the yaJue of OJ that corresponds to the minimum value d j is chosen to estimate
in the order in which Pi observes them - this is the order ~i that \ve defined
o.
above.
The value of the offset derived from communication with a single source is not
necessarlly used by itself to control the local clock. however. In general. an NTP server Whenever a message is sent between processes, the event of sending the message
engages in message exchanges with several of its peers. In addition to data filtering occulTed before the e\·ent of receiving the message.
applied to exchanges with each single peer, NTP applies a peer-selection algorithm. This Lamport called the partial ordering obtained by generalizing these two relationships the
examines the ,·alues obtained from exchanges with each of several peers, looking for happened-before relation. It is also sometimes known as the relation of causal ordering
reJati'·ely unreliable values. The output from this algorithm may cause a server to or poretHial causal ordering.
change the peer that it primarily uses for synchronization. We can define the happened-before relation. denoted by~. as follows:
Peers ".. ith lower stratum numbers are more fa\·oured than those in higher strata HBI: If3 process Pi: e~ie'. then e ~ e'.
because they are 'closer" to the primary time sources. Also. those with the lowest
HB2, For any message Ill, send{nl.l ~ receiveCm)
synchroni::ariol1 dispersion are relatively favoured. This is the sum of the filter
- where send(m) is the event of sending the message. and receive(m)
dispersions measured between the server and the root of the synchronization subnet.
is the event of receiving it.
(Peers exchange synchronization dispersions in messages. allowing this total to be
calculated.) HB}, If e, e' and e" are events such that e.....-7 e' and e' ---7 e". then e ~ e"
~TPemploys a phase lock loop model [::vIili;.; 19951. which modifies the local Thu:-, if e and e' are events. and if e.....-7 e'. then we can find a series of events
clock's update frequency in accordance with obsenations of its drift rate. To take a e,. f:, .. en occurring at one or more processes such that e = e 1 and e' = <'II' and for
simple example. if a clock is discovered ahvays to gain time at the rate of. say. four i :::: 1. 2, .... IV - I either HB I or HB2 applies between e i and e i -+- i· That is. eitherthey
seconds per hour. then its frequency can be reduced slightly (in sofnvare or hardv..·are) occur in succession at the same process, or there is a message m such that ('i :::: selld(m)
to compensate for this. The clock's drift in the inten·als between sy·nchronization is thus and e""'1 :::: receh·e(m). The sequence ofe\·ents e l , ('2' .... en need not be unique.
reduced. The relation ~ is illustrated for the case of three processes PI' P1 and p, in Figure
Milb quotes synchronization accuracies in the order of tens of milliseconds o\·er 10.5. It can be seen that a ~ h. since the eyents occur in this order at process P'l ({/~ 1b)
Internet paths. and one millisecond on LANs. and similarly c ~ d. FuT1he!TI1ore b ~ c, since these events are the sending and reception
398 CHAPTER 10 TIME AND GLOBAL STATES SECTION 10.4 LOGICAL TIME AND LOGICAL CLOCKS 399

of message III l' and similarly d --') f Combining these relations, we may also say that, for Figure 10.6 Lamport timestamps for the events shown in Figure 10.5.
example. a --? f
It can also be seen from Figure 10.5 that not all events are related by the relation 1 1
--7. For example, a 17 e and e -17 a. since they occur at different processes. and there is
Pl~-·~·
no chain of messages intervening between them. We say that events such as a and e that
are not ordered by -7 are CO/lCllrrenf and write this (l II e.
~
m~.
~
The relation ---7 captures a flow of data intervening bet\veen two events. Note. P2 _______ 3 4 _.. _ ..... _ Physical
time
however. that in principle data can flow in ways other than by message passing. For c d m2
example, if Smith enters a command to his process to send a message. then telephones
Jones. who commands her process to issue another message. then the issuing of the first
message clearly happened-before that of the second. Unfortunately. since no network
messages were sent between the issuing processes. we cannot model this type of
P3-
e
; _ ......••

relationship in our system.


Another point to note is that if the happened-before relation holds between two
events, then the first might or might not actually have caused the second. For example. Totally ordered logical clocks 0 Some pairs of distinct events. generated by different
jf a server recei\'es a request message and subsequently sends a reply. then clearly the processes. ha\'e numerically identical Lamport timestamps. However. we can create a
reply transmission is caused by the request transmission. HO\vever. the relation ---7 total order on eVents - that is. one for which all pairs of distinct events are ordered - by
captures only potential causality. and t\vo events can be related by ---7 even though there
taking into account the identifiers of the processes at which events occur. If e is an event
is no real connection between them. A process might. for example. recei\'e a message
occurring at Pi with local timestamp Tj' and e' is an event occurring at Pi with local
and subsequently issue another message. but one that it issues every five minutes
any\vay and bears no specific relation to the first message. No actual causality has been
timestamp T;. , \Ve define the global logical timestamps for these events to ~ (T i . i) and
involved, but the relation ---7 would order these eyen!s. (T)' j) respectively. And we define (T i, i) < (Tj") if and only if either Ti < T j . or
Ti = T j and i < j. This ordering ha, no general physical :.:ignificance (because process
logical clocks 0 Lamport im'ented a simple mechanism by which the happened-before
identifiers are arbitrary). but it is sometimes useful. Lamport used it. for example. to
ordering can be captured numerically. called a logical clock. A Lamport logical clock is
order the entry of processes to a critical section.
a monotonically increasing soft\\:are counter. whose value need bear no particular
relationship to any physical clock. Each process PI keeps its own logical clock. L i . Vector clocks 0 Mattern [1989] and Fidge r 1991] developed \'ector clocks to overcome
which it uses to apply so-called Lampon rimesramps to events. We denote the timestamp the shortcoming of Lamport's clocks: the fact that from L( e) < L( e') \\'e cannot
of event e at Pi by L I ( e). and by L( e) we denote the timestamp of event e at whatever conclude that e ---7 e'. A Yector clock for a system of N processes is an array of N
process it occurred. integers. Each process keeps its own vector dock Vi' which it uses to timestamp local
To capture the happened-before relation ---7. processes update their logical clocks events. Like Lamport timestamps. processes piggyback vector timeqamps on the
and transmit the \'alues of their logical clocks in messages as follows:
messages they send to one another. and there are simple rules for updating the docks as
LC 1: Li is incremented before each e\'ent is issued at process Pi: follO\vs:
L i '':::: /.
~I'
-I- I
vel: initially. VifJ1 :::: OJor i.) :::: 1. 2 .... N.
LC2: (a) When a process Pi sends a message m. it piggybacks on In the value
r :::: L I • VC2: Just before Pi timestamps an event. it sets V if iJ ::::: V I[ i 1 .;. 1

(b) On r.ecei\"ing (III. I). a process Pi computes L j ::::: ilia.\"( L}" r) and then VC3: Pi includes the value [ :::: Vi in e\"ery message it sends.
appbes LC I before times tamping the event receive(ml.
VC+: When Pi recei\'es a timestamp r in a message. it sets
Although we increment clocks by l. we could have chosen any positive value. It can 'v)J]::::: IIwx{VjfJ),t[)l).for) == L2 .. "N.Takingtnecomponent-
easily be shown. b)' induction on the length of any sequence of e\"enls relating two wise maximum of two vector timestamps in this way is known as a
events e and c'. that c ---7 e':::::} L(e) < L(e'). merge operation.
Note that the converse is not true. If L(e) < L(e'). then we cannot infer that
e ---7 e'. In Figure 10.6 we illustrate the use of logical clocks for the example given in For a \'ector clock Vi' Vili] is the number of events that Pi has timeswmped. and
Figure 10.5. Each of the processes Pl' P2 and P3 has its logical clock initialized to O. Vil)] () -:;:. i) is the number of cvents that have occurred at Pj that Pi has potentially
The clock value~ given are those immediately after the event to which they are adjacent. been affected by. (Process P I may have timestamped more events by [his point. but no
Note thaL for example. L{ b) > L(e) but h I! e. information has flowed to p; about them in messages as yet.)
400 CHAPTER 10 TIME AND GLOBAL STATES SECTION 10.5 GLOBAL STATES 401

Figure 10.7 Vector timestamps fortheevents shown in Figure 10.5 Figure 10.8 Detecting global properties
p, P2
(1.0.0) (2.0,0)
P1 ~ .... object

~~.
reference

-~-
a
P2 _ _ _ _ (2,1,0) (2,2,0) a. Garbage collection
.....; : - - - Physical
time
c d
m2

(0,0,1) ~2,2,2)
p, •e •

b. Deadlock
We may compare vector timestamps as follo\\:s:

v = V'

V ~ V'

V<V'
iff V[j] =
iff
V'[j] for

Viii ~ Y'[j] for

iffV~V'I\V:;::V'
i
i = 1,2 ... , N

1. 2 ... , N
C. Termination
P8'
passive +-i
activate gP2passive

Let V(e) be the vector timestamp applied by the process at \vhich e occurs. It is
straightforward to show. by induction on the length of any sequence of events relating Distrihured garbage coiil'Crio/!: An object is considered to be garbage if there are no
two events e and e', that e -7 e':=> Vee) < V(e'), Exercise 10.13 leads the reader to longer any reference~ ,~~ it anywhere in the distributed system. The memory taken up
show the converse: if \'( e) < V(e'), then e ~ e', by thar object can b-e reclaimed once it is knO\vn to be garbage. To check that an
Figure 10.7 shows the \'ector timestamps of the eyems of Figure 10.5. It can be objec! is garbage. \, e must \'erify that there are no reference:-. to it anywhere in the
seen. for example. that i' (a) < V(f), which reflects the fact that a -'} f Similarly. we system. In Figure 10.S:!. proce:-.s PI has t\VO objects that both ha\'e references - onc
can tell \vhen t\VO ewms are concurrent by comparing their timestamps. For example. has a reference within p\ it<,elf. and p, has a reference to the other. Process p, has
that c II e can be seen from the facts that neither V(c):::; F(e) nor V(e):::; V(c), one garbage object. \'. iIi>' no references to it anywhere in the S) 5-tem. It also has an
Vector timestamps ha\-e the disadvantage. compared with Lamport timestamps. of object for which neither P I nor p~ has a reference. but there j<, a reference to it in a
taking up an amount of "wrage and message payload that is proportional (Q N. the message ihat is in transit between the processe<,. This shows that \'ihen we consider
propenie~ of a s:;:stem. \, e must include the state of communication channels as well
number of processes. Charron-Bost [1991J showed thaI. if we are to be able to tell
whether or not two eyems are concurrent by inspecting their timestamps. then the as the <..lale of the prC\:";,:~~e:...
dimension N is una\"oidab1e. However. techniques exist for storing and transmitting Di.Hrihwed deadlock ,i'decrion: A distributed deadkx:k occur~ \vhcn each of a
smaller amounts of data. at the expense of the processing required to reconstruct collection of proce:-.~e~ \\ aits for another process IO ~end it a message. and wh<.'!re
complete vectors. Rayna] and Singhal [1996] give an account of some of these there i~;:, Lycle in the i;T3f-"'h of this 'waits-for' relationship. Figure 10.8b $ho\\'s that
techniques. They also describe the notion of matrix docks. whereby processes keep each of rroccsses p . .inc p, \;;aits for a message from the other. ~o this system will
estimates of other processes' vector times as \vell as their own. ncn;;r m:ike progrC'-.~.
Dislrilmi("d {crmil1{1w,': de[cuioll: The problem here i'- to detect that a diqributcd
algorithm has termin;::.,:::J. Detecting termination i~ a problem that ~ounds deCepli\'eJy
10.5 Global states easy to ,oiw: it seern'- Jt first only necessary to test \\ hether each proce~:'. has halted.
To see thalthi~ i~ no; >\:"'. con:-.idcr a distributed algorithm executed by lWO procc~"e~
In this and the next section we shall examine the problem of finding out whether a Pi and [I" each of \:.h;.:h may requeq values from the other. Instantaneously. wc
particular property is true of a distributed system as it executes. We begin by giying the may find t~l"tat a proce;.' l~ either acti\'c or passi\'e - a pa~~i\'e proce"s is not engaged
examples of distributed garbage collection. deadlock detection. termination detection in any ,letivi!y of ib l~\,n but is prepared to respond \\ith a yalue requested by the
and debugging. other. Suppose we di«l\er that PI j" passive and that p: is pa:-.<,ivc (Figure IO.8e).
CHAPTER 10 TIME AND GLOBAL STATES SECTION 10.5 GLOBAL STATES 403
402

To see that we may not conclude that the algorithm has tenninated. consider the Figure 10.9 Cuts
following scenario: when we tested PI for passivity. a message was on its way from
P2' which became passive immediately after sending it. On receipt of the message.
Pj became active again - after we found it to be passive. The algorithm had not P, . .e:
e~ I
..;;:
e?j
.~,,-.-~--.

t:j,-
tenninated.
The phenomena of tennination and deadlock are similar in some ways. but they
are different problems. First. a deadlock may affect only a subset of the processes in
a system, whereas all processes must have terminated. Second, process passivity is
not the same as waiting in a deadlock cycle: a deadlocked process is attempting to
P2 ;~- ~..
m,

eg \
:;- . i\ --Pht~;al
Inconsistent cut I
perform a further action, for which another process waits: a passive process is not Consistent cut
engaged in any activity.
Distributed debugging: Distributed systems are complex to debug [Bonnaire et al. Each event is either an internal action of the process (for example, the updating of onc
1995J, and care needs to be taken in establishing what occurred during the execution. of its variables) or it is the sending or receipt of a message over the communication
For example. Smith has written an application in which each process Pi contains a channels that connect the processes.
variable Xi (i = 1, 2 .... N). The variables change as the program executes, but they In principle. we can record what occurred in So's execution, Each process can
are required always [0 be within a value 0 of one another. Unfortunately, there is a record the events that take place there. and the succession of stales it passes through. We
bug in the program, and she suspects that under certain circumstances Ix; -
xjl > 0 denote by s? the state of process Pi immediately before the kth event OCcurs. so thaI ,,;!
for some i and j, breaking her consistency constraints. Her problem is that this is the initial state of 1\. We noted in the examples abo\'e thai the state of the
relationship must be evaluated for values of the variables that occur at the same time. communication channels is sometimes relevant. Rather than introducing a ne\v ly~ of
state. we make the processes record the sending or receipt of aU messages as part of their
Each of the problems above has specific solutions tailored to it: but they all illustrate the slate. If we find that process Pi has recorded that it sent a message In to process
need to observe a global state. and so motivate a general approach. p i(i 7: j). then by examining whether Pi has received that message we can infer
\\-hcther or not III is part of the state of the channel between Pi and P.r'
Vie can also form the global history of fJ as the union of the indi\'idual proces:;
10.5.1 Global states and consistent Guts histories:
It is possible in principle to observe the succession of states of an individual process, but
the question of how to ascertain a global state of the system - the state of the collection H hOuhlu ... uh N _ 1
of processes - is much harder to address.
:-'1athematical!y. we can take any set of states of the individual processes to form a g10bal
The essential problem is the absence of global time. If all processes had perfectly Slate S :::: (St ..\"::,' ... s:v). But which global states are meaningful - that is_ which
synchronized clocks then we could agree on a time at which each process would record process states could ha\'e occurred at the same time'? A global state corresponds to initial
its state - the result would be an actual global state of the system. From the collection of prefixc~ of the individual process historic<-. A cur of the system's execution is a :;UQ:'-Cl
process states \ve could tell. for example, whether the processes v,,'ere deadlocked. But of its global history that is a union of prefixes of process histories:
we cannot achieve perfect clock synchronization. so this method is not available to us.
So \ve might ask \vhether we can assemble a meaningful global state from local C I/tul,"\...'
1 ::' ... '_he.:"
v ,\
states recorded at different real times. The answer is a qualified 'yes'. but in order to see
this \Ve first introduce some definitions. The state .\ in the global state 5 corresponding to the cut C is that of p; immed13lel~
Let us return to our general system ,fJ of N processes Pi (i = 1. 2. .. ., 1\'). whose after the last event processed by Pi in the CuI - e:', (i ::;:: I. 2,. N). The set of e\ en;>
execution we wi<;h to study, We said above that a series of events occurs at each process, (e;':i = 1,2., .. _NJ is called thefrolHierofthe cut.
and that we may characterize the execution of each process by its history: Consider the eYcnts occurring at processes PI and P.., shown in Figure 10.9_ Tn.;;-
figure sho\vs two cuts, one with frontier <ej), e~> and ano~ther with frontier <e~. t'~>.
I .
llsrorylpi )
I)1 0 I 2
<ei,ei,e .. ·>
The leftmost cut is inconsistent. This is because at P::, it includes the receipt 01 the
i,
message II! I' but al P I it does not include the sending of that message. Thi" is sho\, lng
Similarly, \ve may consider any finite prefix of the process's history: an 'effect' without a ·cause-. The actual execution nc\'er was in a global ::.tate
corresponding to the process states at that frontier. and we can in principie tell thi:-- b~
i; 0 I /.:. examining the ...---'J- relation between events. By contrast. the rightmost cut is COl1sis!enr.
hi <ei,el,· .. e i >
404 CHAPTER 10 TIME AND GLOBAL STATES SECTION 10.5 GLOBAL STATES 405

It includes both the sending and the receipt of message m I' It includes the sending but Figure 10.10 Chandy and Lamport's 'snapshot' algorithm
not the receipt of message m," That is consistent with the actual execution - after alL
to
the message took some time arrive. Marker receiving rule for process Pi
A cut C is consistent if. for each event it contains. it also contains all the events On Pi'S receipt of a marker message over channel c:
that happened-before that event: if(Pi has not yet recorded its state) it
reCords its process state now:
For all events eE C f-te:::::>fe C records the state of c as the empty set:
turns on recording of messages arriving over other incoming channels:
A consistelll global stale is one that corresponds to a consistent cut. We may else
characterize the execution of a distributed system as a series of transitions between Pi records the state of c as the set of messages it has received over c
global states of the system: since it saved its state.
end !f
5 0 -75 1 -+5 2 -7.
Marker sending rule for process Pi
In each transition. precisely one event occurs at some single process in the system. This After Pi has recorded its state, for each outgoing channel c:
event is either the sending of a message. the receipt of a message. or an internal eyent. Pi sends one marker message over c
If two events happened simultaneously. we may nonetheless deem them to haye (before it sends any other me.<;sagc over c).
occurred in a definite order ~ say ordered according to process identifiers. (Events th;:n
OCCur simultaneously must be COncurrent: neither happened-before the other.) A system
evolves in this \~-ay through consistent global stares.
A 1"1111 is a total ordering of all the events in a global history that is consistent \\ ith So be the original state of the system. S(!fery with respect to a is the assenion that a
each local history" s ordering, -t i (i = 1, 2, .... .Y 1. A. lincari:;.arioIJ or COl1siSfenf rim is eyaluates to False for all states 5 reachable from SO, Conversely, let !3 be a desirable
an ordering of the events in a global history that is .:onsistent with this happened-before propeny of a system's global state - for example, the property of reaching termination.
relation -t on H. Note that a linearization is also a run. LiI'eness with respect to !3 is the propen.Y that. for any linc,,1.rization Lstarting in the state
Not all runs pass through consistent global states. but all linearizations pass only SQ. !3 e\'aluates to True for some state 5 L reachable from SQ.
through consistent global states. We say that a stat~ 5' is reachable from a state 5 if there
is a linearization that passes through 5 and then 5'.
Sometimes we may alter the ordering of concurrent events within a linearization. 10.5.3 The 'snapshot' algorithm of Chandy and Lamport
and derive a run that still passes through only consistent global states. For exampk. if Chandy and Lamport! 1985J describe a 'snapshot" algorithm for derennining global
{\\'O successiw eyents in a linearization are the re.:eipt of messages by two proce,>se~.
states of distributed systems. which we now present. The goal of the algorithm is to
then \ve may s\\-ap the order of these two events. record a set of process and channel states (a 'snapshot') for a set of processes Pi
(i ::: i. 2.. ,X) such that. even though the combination of recorded states may never
have occurred at the same time, the recorded global Slate is consistent.
10.5.2 Global state predicates, stability. safety and liveness
We shall see that the state that the snapshot algorithm records has convenient
Detecting a condition such as deadlock or termim:tion amounts to evaluating a globed properties for e\'a!uating stable global predicates.
srate predicarc. A global state predicate is a funclion that maps from the set of global The algorithm records state localiy at processes: it does not give a method for
states of processes in the system SO to {True, Fw'-,<,). One of the useful characteri<..tics gathering the global state at one site. An obvious method for gathering the state is for all
of the predicates associated with the state of an obje.:t being garbage. of the system being processes to send the state they recorded to a designated collector process, but we shall
deadlocked or the system being terminated is thm they arc all srablc: once the sy"wm not address this issue further hcre.
enlers a state in which the predicate is True, it rerm:ins True in all future states reachable The algorithm assumes that:
from that state. By contrast. \\-hen we monitor Of debug an application we are often
interested in non-stable predicates. such as that 10 our example of variables \\ hose neither channels nor processes tail: communication is reliable so that every
difference is supposed to be bounded. Even if the application reaches a state in \,.-hich message sent is eventually recein'd intact. exactly once:
the bound obtains, it need not stay in that state.
channels are unidirectional and pro\'ide FIFO-ordered message delivery:
We also note here two further notions re!e\ant to global state predicates: safet~
and liveness. Suppose there is an undesirable property (J. that is a predicate of the the graph of processes and channels is strongly connected (there is a path bet\veen
system's global state ~ for example, a could be the property of being deadlocked. Let any two processes):
406 CHAPTER 10 TIME AND GLOBAL STATES SECTION 10.5 GLOBAL STATES 407

Figure 10.11 Two processes and their initial states Figure 10.12 The execution of the processes in Figure 10.11

Q
VI(
c,
C1 V
'1;;\ 1. Global stale So
<S1000, 0> 8 •
~
C,
(empty)
(empty)
,(;:'
~ <550. 2000,

I 51000 II (none) I [§J §J 2. Global state Sl


<S900, 0> 8 C2 (Order 10,5100). M • ~ <550.2000,
account widgets account widgets
•C1 (empty)

any process may initiate a global snapshot at any time:

the processes may continue their execution and send and receive nonnal messages
3. Global state $z
<5900. 0> 8 •
C2
C,
(Order 10. 5100). M
(five widgets)
'~
P2 <550.1995>

while the snapshot takes place.

For each process Pi' let the incoming chaJlnels be those at Pi over \",hieh other processes
send it messages: similarly. Pi's outgoing channels are those on \\-hich it sends
4. Global state ~
<5900. 5> 8 Cz
•c,
(Order 10. 5100)
(emptyl
.~
P2 <S50. 1995>

messages to other processes. The essential idea of the algorithm is as follows. Each
process records its state and also for each incoming channel a set of messages sent to it. (M '" marker message)
The process records. for each channel. any messages that arrived after it recorded its
state and before the sender recorded its own state. This arrangement al\ows us to record
proce%es have the initial slates shown in Figure 10.11. Process P2 has alre3.d~ receiwd
the states of processes at different times but to account for the differentials bet\veen
an order for five widgets. which it \,'ill shortly dispatch to Pl.
process states in terms of messages transmitted but not yet received. If process 1\ has
sent a message 111 to process {IF but Pj has not received it. then we account for 111 as Figure 10. I 1 shows an execution of the system while the state is recorded. Process
belonging 10 the state of the channel between them. Pi records its state in the actual global state So. when Pj'S state i~ <5l000. 0>.
Following the marker sending rule. process (II then emit::. a marker me~~J.g:e O\'er its
The algorithm proceeds through use of special marker messagcs. which are
outgoing channel c~ before it send:;. the next application-level mes:;ag2: IOrder 10.
distinct from any other messages the processes send. and which the proce"ses may send
and recei\'e while they proceed with their normal execution. The marker has a dual role:
S 100) O\"l;::r channelc: .The system enters actual global state 5 j •
as a prompt for the recei\'er to save its o\vn state. if it has not already done so: and as a Before {J.., receives the marker. it emits an application message '~-r,e \\idgets)
means of determining which messages to include in the channel state. o\"("r e] in response to PI 's pn~\"iou:; order. yie Iding a new actual gJobai ~;:.::te S:.
The algorithm is defined through two rules. the marker reeeh'ing rule and the Now process PI receives p:'~ message (the widgets). and p: re..::ei\·es the
marker sending rule (Figure 10.10). The marker sending rule obligates processes to send marker. Following the marker receiYing rule. p: records it:; state as <S::O. 1995> and
a marKer after they h3se recorded their state. but before they send any other messages. lhat of channel c-, as the empty sequence. Following the marker sending ,ule. it sends a
marker message ;\"er C I .
The marker receiving rule obligates a process that has not recorded its state to do
so. In {hat case, this is the first marker that it has recei\·ed. It notes \\'hich messages When process PI receives P~ . ~ marker message. it records the ~!2,e of channel
c 1 as the single message (five \\."idgc\:,l that it received after it first rec0r.Jed i,s "tate.
subsequently arrive on the other incoming channels. \Vhen a process that has already
The final actual global state is S:,.
saved its state recei\"es a marker (on another channel). it records the state of thm channel
as the ~et of messages it received on it since it saved its state. The final recorded state is 1'.' <SI000. 0>: p_: <S50, 1995>: c·' <ifl\'e
\\ idgeb »: c: : < >. \"ote that this stal~ differs from all th-e global state~ ::-:,o<Jgh \, hich
.-\n;; process may begin the algorithm at any time. It acts as though it has received
the syqem actually pa:;sed.
a marker (O\'er a non-existent channel) and fol1o\',.·s the marker recei\'ing rule. Thus it
record~ its state and begins to record messages arriving over all its incoming channels. Termination of the snapshot algorithm we assume that £l process that :-:::~ r.?..::el' ed J.
Sewrai processes may initiate recording concurrently in this way (as long::,> the markers marker message records its state withm a finite time and ~ends marker :-:-.e~sages O\'er
they u~e can be distinguished). each outgoing channel within a finile lime (e\'en when it no longer :-:2e.j~ !0 ~end
We illustrate the algorithm for a system of two processes. P j and p: connected application messages over these channels). If there is a path of communic:ion cn2.nnels
by {\\ 0 unidirectional channels, C I and c 2 • The two processes trade in ·\\·idgets·. Process and processes from a process Pi to a process Pj (j'" i I. then it is -:i2U on ,hese
PI sends orders for widgets over c:; to P2' enclosing pa:yment at the rate of SIO per assumptions that Pi will record its state a finite lime after p. recorded ]:~ "tate. Since
\vidgel. Some time later. process P2 sends \vidgets along channel c: to p!' The \\"C are assuming the graph of processe:, and channels to be :,>tro'ngly conne-::ed. 1t {oilo\\"
408 CHAPTER 10 TIME AND GLOBAL STATES SECTION 10.6 DISTRIBUTED DEBUGGING 409

Figure 10.13 Reachability between states in the snapshot algorithm event at a different process. It cannot be that e j ----') e j + I" For then these t\\lO events
\vould be the sending and receiving of a message, respectively. A marker message
actual execution eO.e1 ,.. would have to have preceded the message, making the reception of the message a post-
snap event. but by assumption e j + 1 is a pre-snap event. We may therefore swap the {\A'Q
events ""ithout violating the happened-before relation (that is. the resultant sequence of
recording events remains a linearization). The swap does not introduce new process states, since
~S;nit recording Sfinal -------...
begins ends we do not alter the order in \vhich events occur at any individual process.

e~ ~ We continue s\vapping pairs of adjacent events in this way as necessary uotil \\.:e
pre·snap: e
0, 1· .. · R-l
$snap
post-snap: e'R.e'R+ k
have ordered all pre-snap events eo.e~. c2' .... eR_ 1 prior to all post-snap events
c R' L'R+ I' e R+::,· \vith Sys' the resulting execution. For each process, the set of
events in eo' e;, L';, .... e
R_I that occurred at it is exactly the set of events that it
experienced before it recorded its state. Therefore the state of each process at that point.
that all processes will have recorded their states and thc states of incoming channels a and thc state of the communlcation channels. is that of the global state 5.mop recorded
finite time after some process initially records its state, by the algorithm. We have disturbed neither of the states 5illil or 5 fino! with whlch the
linearization begins and ends, So \ve have established the reachability relationship.
Characterising the observed state 0 The snapshot algorithm selects a cut from the
history of the execution. The cut. and therefore the state recorded by this algorithm, is Stability and the reacnability of the observed state 0 The reachability propcrty of thc
consistent. To see this, let e i and e j be events occurring at Pi and p., respectively, such snapshot algorithm is useful for detecting stable predicates. In general. any non-stable
that e i --7 e j' We assert that if e j is in the cut then e i is in the cut. That is, if e j occurred predicate we establish as being True in the state S,WiP mayor may not havc bcen True
before p j recorded its state, then ei must have occurred before Pi recorded its state. in the actual execution whose global state \ve recorded. HowewL if a stable predicate is
This is obvious if the two processes are the same, so we shall assume that j -:;t i. Assume. TruL' in the state S,,~, . :) then we may conclude that the predicate is Tme in the Slate
for thc moment. the opposite of what we wish to prove: that Pi recorded its state before since by dcfinition a' stable predicate that is True of a state S is also TruL' of any state
e i occurred. Consider the sequence of H messages m l' 1n 2.. " m H (H :2: 1), giving rise reachable from 5. Similarly. if the predicate evaluates to Falsc for S'Ui,n' then it mU~l
to the relation e i --7 e j . By FIFO ordering over the channels that these messages also be False for S:1:ir' .
traverse, and by the marker sending and receiving rules. a marker message would have
reached Pj ahead of each of m I' m l ···, III H" By the marker receiving rule, Pj would
therefore have recorded its state before the event c j . This contradicts our assumption
that ej is in the cut, and we are done.
10.6 Distributed debugging
We may further establish a reachability relation between the observed global state
\Ve now examine ,he problem of recording a system's global state so that we may make
and the initial and final global states when the algorithm runs. Let S.vs = eO' e 1,. be
useful statements about whether a transitory state - as opposed to a ,>table state -
the linearization of the system as it executed (where two events occurred at exactly the
occurred in an actual execution. This is what we require. in generaL when debugging a
same time. we order them according to process identificrs). Let Sil1il be the global state
distributed system. We gave an example above in which each of a set of processes fl,
immediately before the first process recorded its state; let 5j"illa! be the global state when
has a variable x.' The safety condition required in this example is :x i - x; :::; 6
the snapshot aJgorithm terminates. immediately after the last state-recording action: and
(i. j = 1.2, . . .\" I: this constraint is 10 be met even though a process may change the
let 5 snap be the recorded global state.
\"alue of its variable at any timc. Another example is a distributed system controlling a
We shall find a permutation of 5,Ys. Sys' = eo'
ej, el' ... such that all three states system of pipes in a factory \vhere we are interested in whether all the \'ah'es (controlled
Sinil' Ssnap and Sfinal occur in 5.'"s', 5.uW{! is reachablc from Smir in S:rs'. and Sfinal is by different processe.;,) \,,'ere open at some time. In these examples. \ve cannot in general
reachable from S."wp in S:rs'. Figure J 0.13 shows this situation, in which the upper obsen'e the value~ of thc \'ariab1es or the :;tates of the \'ah'es simult.aneously. The
linearization is Sys. and the lower linearization is 5:,;,.'0'.
challenge is to monitor the system' '> execution overtime -to capture 'trace' infonnauon
We derive Sys' from 5ys by first categorising all events in 5ys as pre-snap events rather than a singk snapshot - so that \\'c can establish pOST hoc whethc:r [he required
or post-snap events. A pre-snap event at process Pi is one that occurred at Pi before it safety condition \\ as or may have been violated,
recorded its state; all other events are post-snap events. It is important to understand that Chandy and Lamport's snapshot algorithm collects state in a distributed fashion,
a post-snap event may occur before a pre-snap event in 5ys, if the events occur at and we pointed out how the processes in the system could send the state they gather to
different processes. (Of course no post-snap event may occur before a pre-snap event at a monitor proces~ for collection. The algorithm we shal! describe (due to \larzul1o and
the same process.) Neiger [1991]) is centralized. The observed processes send their states to a proce" . .
We shall show how we may order alJ pre-snap events before post-snap events to called a monitor. \,hich assembles globally consistent :"tales from what it receives. We
obtain Sys', Suppose that e j is a post-snap event at one process, and e j + 1 is a pre-snap consider the moniwr to lie outside the s:y:"tem. observing its execution.
410 CHAPTER 10 TIME AND GLOBAL STATES SECTION 10.6 DISTRIBUTED DEBUGGING 411

OUf aim is to determine cases where a given global state predicate $ was definitely
Figure 10.14 Vector timestamps and variable values for the execution of Figure 10.9
True at some point in the execution we observed, and cases where it \\'3S possibly True.
The notion 'possibly' arises as a natural concept because we may extract a consistent (1.0) (2.0) (3.D) (4,3)
global state 5 from an executing system and find that 6(5) is True. No single
observation of a consistent global state allows us to conclude whether a non-stable
P1 ~1: 1\~l=:OO x~:.:~~._ . _~.~ ~[_
Xl"90~.~
.. ...... ~
predicate ever evaluated to True in the actual execution. Nevertheless, we may be
m,
interested to know whether they might have occurred, as far as we can tell by obser\"ing
P2 ~__ m2
the execution. ~.~
x2="0100 \\ x;= 95~"~;·'~ 90----- ~----~.. Physical
The notion 'definitely' does apply to the actual execution and not to a run that we
(2,l) \1 1 2 \ time
have extrapolated from it. It may sound paradoxical for us to consider what happened in 22
, (2.3) i

an actual execution. However, it is possible to evaluate whether ID \vas definitely True Cut C, Cut C,
by considering all linearizations of the observed events.
We now define the notions of possibly <:> and definirely 6 for a predicate ¢ in terms
For example, in the example system of processes P I that arc supposed to obey the
of linearizations of H, the history of toe system's execution.
constraint j.\ ~ xji ::; 8. (i, j = L 2..... N). the processes need only notify the monitor
possibly 6 The statement possibly 0 means that there is a consistent global state when the values of their own variable Xi changes. When they send their state. they
S through which a linearization of Hpasses such that 0(5) is True. supply the value of Xi but do not need to send any other \·ariables.
definitely cD The statement definiTely 6 means that for all linearizations L of H,
there is a consistent global state S through \vhich L passes such that
¢l(S) is True. 10.6.1 Observing consistent global states
When we use Chandy and Lamport's snapshot algorithm and obtain the global state The monitor must assemble consistent global states against which it evaluates 6. Recall
5.'Tlap we may assert possibly 0 if 6(Ssnap) happens to be True. But in general e\'aluating that a cut C is consistent if and only if for all events e in the cut C. f ~ e ::;::} fEe.
possibly 0 entails a search through all consistent global states derived from the obser.ed For example. Figure 10.14 shows t\VO processes p I and p: with variables x 1 and
execution. Only if 0(5) evaluates to False for all consistent global states 5 is it not the x 2 • respectively. The events shown on the timelines (with vector timestamps) are
case that possibl.v ¢l. Note also that while we may conclude definitely (-,6) from adjustments to the values of the two variables. InitiaUy, x I == x~ == O. The requirement
--possibly 6. \ve may not conclude -possibly ¢l from definiTely (-,¢l). The latter is the is !xl-x,,: ::;50. The processes make adjustments to their variables. but 'large'
assertion that -,6 holds at some state on ewry linearization: 6 may hold for other states. adj~stments cause a message containing the new value to be sent to the other process.
We now describe: When either of the processes receives an adjustment message from the other. it sets its
ho\\, the process states are collected: \'ariable equal 10 the value contained in the message.
Whene\·er one of the processes p I or P2 adjusts the \'alue of its variable (\\ hether
how the monitor extracts consistent global states:
it is a 'small' adjustment or a 'large' one), it sends the ,'alue in a state message to the
how the monitor e\,aluates possibly 0 and definitely 0 in both asynchronous and monitoring process. The latter keeps the state messages in the per-process queues for
synchronous systems. analysis. If the monitor processes used values from the incon~iqent cut C I in Figure
Collecting the state 0 The observed processes Pi (i = I. 2 ..... N) send their initial state 10.14. then it would find that x! = 1, Xl = 100, breaking the constraint I - x:;. ::; 50. Ix
to the monitor process initially. and thereafter from time to lime, in state messages. The But this state of affairs never occurred. On the other hand. \'alues from the consistent cut
monilor process records the state messages from process Pi in a separate queue Q;. for C::! show Xl = 105, x2 = 90.
each i = l. 2 ... " ,V. In order that the monitor can distinguish consistent global states from inconsistent
The acti\'ity of preparing and sending state messages may delay the normal global states. the observed processes enclose their vector clock values \vith their ::;tate
execution of the observed processes. but it does not otherwise interfere with it. Thert' is messages. Each queue Q; is kept ordered in sending order. \\-hich can immediate I> be
no need to "end the state excepl initially and when il changes. There are (wo established by examining the ith component of the vector timestamps. Of course. the
optimizations to reduce the state-message traffic to the monitor. First. the global state monitor process may deduce nothing about the ordering of states sent by differeD!
predicate may depend only on certain parts of the processes' states. For example. it may processes from their arrival order. because of variable message iatencies. It must inStead
depend only on the states of particular variables. So the observed processes need only examine the Yector timestamps of {he state messages.
send the rele\'ant state to the monitor process. Second. they need only send their Slare at Let 5 = (sl' s2' .... ss) be a global state dra\vn from the state messages that the
rimes when the predicate 0 may become True or cease to be True. There is no point in monitor process has received. Let V(si) be the vector timestamp of the state .'Ii recei\'ed
seflding changes to the state that do not affect the predicate' s value. from Pi' Then it can be SOO\\'n that 5 is a consistent global state if and only if:
SECTION 10.6 DISTRIBUTED DEBUGGING 413
412 CHAPTER 10 TIME AND GLOBAL STATES

Figure 10.15 The lattice of global states for the execution of Figure 10.14 Figure 10.16 Algorithms to evaluate possibly~ and definitely~

Level 0 500 1. Evaluating possibly tPfor global history H ofN processes


/ L :=0:
5,0 States:= { (s?, s~ . .... s~!) I:
/ ~vhile ($(5) :::: False for all 5 EStates)
2 5,0 SiF global state after i events at process 1
3 530/ "5" and j events at process 2 L:= L+ 1:
Reachable:= {5': 5' reachable in H from some S EStates 1\ level(S') L I:
4 "S" "522 / Stales := Reachable

5 " 5 32/ " 5 23


end while
output "possibly $":
6 "
/
533/
2. Evaluming definitely 6for global history H oIN processes
7 5'3 L :=0:
ij(¢(s? sg... " s~,)) then Stares:= {} else Stares:= { (s? sg. . s~,)}:
~i:hile (Slates # { })
V(siHi] ~ V(s j)[i) for i. j :::: L 2, .... f .... ~ (Condition CGS)
L:= L + 1;
Reachable:= {5': 5' reachable in H from some S E Srares 1\ level(S) L J:
This says that the number of Pi' s events knovm al P j when it sent Sj is no more than
States:= {S E Reachable: 0(5) == False)
the number of events that had occurred at Pi whe-n it sent Sf' In other words. if one
end Irhile
process's state depends upon another (according to happened-before ordering). then the output "definitely 0"
global state also encompasses the state upon which it depends.
In summary. we now possess a method whereby the monitor process may
establish whether a given global state is consistent. using the vector timestamps kept by
the observed processes and piggybacked on the Slate- messages that they send to it.
Figure 10,15 shows the lattice of consistent global states corresponding to the
execution of the two processes in Figure 10.14. This structure captures the relation of 10.6.2 Evaluating possibly",
reachability between consistent global states. The- nodes denote global states. and the
edges denote possible transitions between these states. The global state 5 00 has both To evaluate possibly 0. the monitor process must traverse the lattice of reachable states.
processes in their initial state: 5 10 has P2 still in llS initial state and P I in the next state starting from the initial state (s~_ sg ... s~.), The algorithm is sho\\'o in Figure 10.16, The
algorithm assumes that the execution is infinite. It may easily be adapted for a finite
in its local history. The slate 501 is not consistent. because of the message III I sent from
execution.
PI to P2- so it does not appear in the lattice.
The monitor process may discover the :-.et of consistem states in level L + I
The lattice is an'anged in kwls with. for example. 5 00 in !e\'ej O. 5 10 in level J.
reachable from a given consistent state in !e\'el L by the follo""'ing method. Let
In generaL Sij is in level (i + j). A linearization traverses the lattice from any global
5 = ("I' s: . .... s;\') be a consistent state. Then a consistent state in the neX( ievel
state to any global state reachable from it on the next level ~ that is. in each step some
process experiences one e\"ent. For example. 5:: 1:' reachable from 5 20 , but 5 22 is not
reachable from 5 is of the form 5' = (s l' S: . ... s: .. "SN)' \vhich differs from 5 only by
containing the next state (after a single event) of s.ome process P:, The monitor can find
reachable from 5 30 , all such stales by tra\'ersing the queues of state messages Qi (i = L 2_ . N). The s.tate
The lattice shows us all the linearization:.- ~orresponding to a history. It is now 5' is reachable from 5 if and only if:
clear in principle how a monitor process should e\ aluate possihly a and definitely Q. To
evaluate possibly o. the monitor process starts at the initial state and steps through all for j = L2 ..... .\'.j:t:i: Hs)[)j;?:VU-.:)[j)
consistent states reachable from that point evaluating 0 at each stage. It stops \vhen til
evaluates to True, To evaluate definitely 0. the monitor process must attempt to find a This condition comes from condilion CGS abo\'e and from the fact that 5 v.'as already a
set of states through which all linearizations muSt pass. and at each of \',:hich 0 evaluates consistent global state. A given state may in general be reached from several states at
to True. For example. if 6(5 30 ) and 0(S21 ) in Figure 10. J 5 are both Trlle then. since all the previous level. so the monitor process should take care to e\"aluate the consi~tency
of each state only once.
linearizations pass through these states. definitef.\ 0 holds.
CHAPTER 10 TIME AND GLOBAL STATES SECTION 10.6 DISTRIBUTED DEBUGGING 415
414

10.6.4 Evaluating possibly <D and definitely <p in synchronous systems


Figure 10.17 Evaluating definitely~
The algorithms we have given so far work in an asynchronous system: we have made no
Level 0 F
/ timing assumptions. But the price paid for this is that the monitor may examine a
F consistent global state S == (5 I' S2" s 11,,) for which any two local states S i and S j
""j
/
2 F F~ (0(S) = False); T ~ (6(S) = True) occurred an arbitrarily long time apart in the actual execution of the system. OUf

3 F/ " T
requirement by contrast is to consider only those global states that the actual execution

4 " F/ " could in principle have traversed.


In a synchronous system, suppose that the processes keep their physical clocks
"?/
5 internally synchronized within a knov>'n bound. and that the observed processes provide
physical timestamps as well as vector timestamps in their state messages. Then the
monitor process need consider only those consistent global states \\!hose local states
could possibly have existed simultaneously_ given the approximate synchronization of
the clocks. With good enough clock synchronization. these \vill number many less than
all globally consistent states.
10.6.3 Evaluating definitely 0 \Ve now gi\'c an algorithm to exploit synchronized clocks in this way. We assume
To evaluate definitely O. the monitor process again traverses the lattice of reachable that each observed process Pi (i = I, 2, .... N) and the monitor process, which \ve shall
states a level at a time. srarting from the initial state (s~, sg, ... , s~.). The algorithm call Po' keeps a physical clock C i (i = 0, I, .... N). These are synchronized to within
(shown in Figure 10. J 6) again assumes that the execution is infinite but may easily be a known bound D > 0: that is. at the same rea! time:
adapted for a finite execution. It maintains the set States. which contains those states at
the current level that may be reached on a linearization from the initial state by !Ci(r) - C/t)i < D for i. j = 0.1. .... N
traversing only states for which 0 e\'aluates to False. As long as such a linearization
exists. we may not assert defillirely 0: the execution could have taken this linearization. The observed processes send both their vector time and physical time with their state
and () would be False ar e\'e0' stage along it. If we reach a level for which no such messages to the monitor process. The monitor process now applies a condition that not
linearization exists, we rna;. conclude definitely 9· only test:; for consistency of a global state 5 == (Sj. s.=! • ...• SSI. but also tests \vhether
In Figure 10.17. at !e\'e! 3 the set Slates consists of only one state. which is each pair of states could haw happened at the same real time. gi\-en the physical clock
reachable by a linearization on which all states are False (marked in bold lines). The value~. In other \\-ords. for i. j :;::: l. 2..... ,\':
only state considered at Jeyel 4 is the one marked ·F. (The state to its right is not
considered. since it can only be reached via a state for which <1> evaluates to True.) If () V (.\) [iJ ;::: V(s j)[ ij and Si and Sj could haw occurred at rhe same real time.
eYaluates to True in the Slate at \eye I 5. then \ve may conclude definiTely o. Othef\vise.
The first clause is the condition that we used earlier. For the second clause. note that Pi
the algorithm must continue beyond this level.
is in the state -"I from the time it first notifies the monitor proce~.',. C/s i ). to some later
Cast 0 The algorithms we haw JUSt described are combinatorially explosiyc. Suppose
local time Ll',). say. when the next state transition occurs at p . For.\ and Sj to ha\'e
that k is the maximum ~umber of eWnls at a single process. Then the algorithms we haw
obtained at the ~amc:: real time we thus have. allowing for the bound on clock
described entail O( k.\) comparisons (the monitor process compares the states of each
synchronization:
of the IV observed proces~e:, \\-ith one another).
There is also a space cost to these algorithms of O(kN). However. we obserye that C,(s,) - D::; C,,(.I) ::; L/\) + D - or t'ice \'ersa (swappinf, i nlld jl.
the monitor process may delete a message containing state .\ from queue Q{ when no
other item of state arri\'ing from another process could possibl;.' be involved in a The monitor proce~s must calculate a value for L/s,). which i~ measured againsl Pi's
consistent global state containing S,' That is. when: clock. If the monitor proce,>s ha~ received a state message for p,'s next state si. then
L/),) is Crt.>;).
Otherwi:-,e. the monitor proces:, estimates L,\.\,) as CO-II1(1x+D.
V( Sjla'~')I·J I, f or J:;:::
I > III s) 1.1 . I? ~,..
._ .... ,!v.j:t:/
\\:here Cois the monitor's current local clock \·alue. and max is the maximum
is the la~t state that the monitor process has received from process P j' transmission time for a state message.
where
416 CHAPTER 10 TIME AND GLOBAL STATES EXERCISES 417

sending process. the largest message timestamp it has seen. Assume that clocks are
10.7 Summary synchronized (0 within 100 ms. and that messages can arrive at most 50 ms after
transmission.
This chapter began by describing the importance of accurate timekeeping for distributed
systems. It then described algorithms for synchronizing clocks despite the drift between (i) When maya process ignore a message bearing a timestamp T. if it has recorded
them and the variability of message delays between computers. the last message received from that process as having timestamp T ?
The degree of synchronization accuracy that is practically obtainable fulfils many (ii) When ma~- a receiver remove a timestamp 175.000 (ms) from its table? (Hint: use
requirements but is nonetheless not sufficient to determine the ordering of an arbitrary the recei\·er·s local clock value.)
pair of events occurring at different computers. The happened-before relation is a partial
order on events that reflects a flow of information between them - \vithin a process, or (iii) Should the clocks be internally synChronized or externally synchronized?
via messages between processes. Some algorithms require events to be ordered in page 391
happened-before order, for example successive updates made at separate copies of data.
lOA A client auempts to synchronize \vith a time ser.·er. It records the round-trip times and
Lamport clocks are counters that are updated in accordance with the happened-before
timestamps returned by the server in the table below.
relationship between events. Vector clocks are an improvement on Lamport clocks,
because it is possible to determine by examining their vector timestamps \vhether two Which or these times should it use to set its clock·? To what time should it set it? Estimate
events are ordered by happened-before or are concurrent. the accuracy of the setting with respect to the ser.·er's clock. If it is known that the time
We introduced the concepts of eyents. local and global histories. cuts. local and between sending and receiving a message in the system concerned is at least 8 ms. do
global states. runs. consistent states. linearizations (consistent runs). and reachability. A your answers change·:
consistent state or run is one that is in accord \\-ith the happened-before relation.
ROllnd-rrip (ms) Time (hr:min:sec)
We went on to consider the problem of recording a consistent global state by
observing a system's execution. Our objectiye \\-as to evaluate a predicate on this state.
An important class of predicates are the s(able predicates. V'/e described the snapshot
" 10:5-4-:23.67-4-

algorithm of Chandy and Lamport. which captures a consistent global state and allows 25 10:5-4-:25.-+50
us to make assertions about whether a stable predicate holds in the actual execution. We ::0 10:5-4-:283-+2
went on to gi,·e :\-1arzullo and Neiger's algorithm for deriving assertions about whether
a predicate held or may have held in the actual run. The algorithm employs a monitor page 39/
process to collect states. The monitor examines \"Cctor timestamps to extract consistent
global states. and it constructs and examines the lattice of all consistent global states. 10.5 In the system or Excrci"e lOA it is required to synchronize a file server's clock to within
This algorithm in\'oh"es great computational complexity but is valuable for ±I milliseconc._ Dis.:-u~s this in relation 1:0 Cristian's algorithm. page 391
understanding and can be of some practical benefit in real systems where relatively fe\v iO.6 What reconiigur2tions \\ould you expect to occur in the :-':TP synchronization subnet?
events change ,he global predicate's \·alue. The algorithm has a more efficient variant
page 394
in synchronous systems. where clocks may be synchronized.
1O.! An ~TP ser.e~ B recei\"es server A's message al 16:3.+:23.480 bearing a timestamp
16:34:13.43CJ ;,nc repile~ to it. A recei\es the message at 16:34:15.725. bearing S':,-
EXERCISES timestamp [6::'-'.:::'.7. Estimate the offset be[\\een B and A and the accuracy of the
estimate. page 395

10.1 Why is computer clock synchronization necessary? Describe the design requirements 10.8 Discuss the fa~·,~"'~' to be taken into account \\-hen deciding to which :\'TP server a client
for a system to synchronize the clock'> in a distributed system. page 386 should synchr0rllZe ib .:-lock. pagc 396

10.2 A clock is reading 10:27:54.0 (hr:min:sec} \,hen it is discovered to be 4 seconds fast. )0.9 Di5cuss ho\\ .;~ i'(l,~iQle to compensate for clock drift between synchronization points
Explain why i, i'> unde::.irable to set it back to the right time at that point and sho\~ by obsen·ing t~.e Jrifl rate oyer time. Discu~s an: limitations to your method. page 397
(numericall: ) how it should be adjusted so as to be correct after S seconds has elapsed.
iO.IO By consider:n~ :: ~'hain of zero or more me:-.sages connecling events e and e' and llsing
page 390
induction. sho\:. ,r,;,l (' -> e' 0=:> L(e) < L{ e·) . rage 398
10.3 A scheme for implementing at-most-once reHable message de1i\·er:· uses synchronized
10.1! Sho\vthat \. "$1' page 399
clocks to reject duplicate messages. Processes place their local clock value (a
'timestamp') in the me'>sages they send. Each receiver keeps a table giving. for each 10.11 In a similar fasnton to E \ercise ! 0.10. show that e -> ('. 0=:> V( e) < V( c') . page 400
418 GHAPTER 10 TIME AND GLOBAL STATES

10.13 Using the result of Exercise 10.11, show that if events e and e' are concurrent then
neither V(e)::; V(e') nor V(e')::; V(e). Hence show that if V(e) < V(e') then e -7 e'.
page 400

10.14 T\vo processes P and Q are connected in a ring using two channels, and they constantly
rotate a message m. At anyone time. there is only one copy of m in the system. Each
process's state consists of the number of times it has received In, and P sends m first. At
a certain point. P has the message and its state is 10 I. Immediately after sending m. P
initiates the snapshot algorithm. Explain the operation of the algorithm in this case.
giving the possible global state(s) reported by it. page 405

Pl--7--~'~~--;~---
-..- COORDINATION AND AGREEMENT
~\ Time
/
P2------~~·---'-,,-··· ~
11.1
11.2
Introduction
Distributed mutual exclusion
10.15 The figure abo\"c shows events occurring for each of two processes. PI and P2' Arrows
between processes denote message transmission.
11.3 Elections
Dra\v and label the lattice of consistent states (PI state, P2 state). beginning \vith the
11.4 Multicast communication
initial state (0.0). page 412 11.5 Consensus and related problems
10.16 Jones is running a collection of processes Pl' P2' .... p;\'o Each process Pi contains a
11.6 Summary
variable v(' She wishes to determine whether all the variables \.'1'1'1 ..... v,'Ii vv'ere ever
equal in the course of the execution,
In this chapter, we introduce some topics and algorithms related to the issue of how
0) Jones' processes run in a synchronous system. She uses a monitor process to processes coordinate their actions and agree on shared values in distributed systems.
detennine \\'hether the variables \\'ere ever equaL When should the application despite failures. The chapter begins with algorithms to achieve mutual exclusion among
processes communicate with the monitor process. and what should their messages a collection of processes. so as to coordinate their accesses to shared resources. It goes
contain'~ on to examine how an election can be implemented in a distributed system. That is. it
Oi) Explain the statement possibly (\'1 = \'2 = I'X)' How can Jones determine describes how a group of processes can agree on a new coordinator of their activities after
whether this statement is true of her execution? page 413 the previous coordinator has failed.
The second half examines the related problems of multicast communication.
consensus, byzantine agreement and interactive consistency. In multicast, the issue is
how to agree on such matters as the order in which messages are to be delivered.
Consensus and the other problems generalize from this: how can any collection of
processes agree on some value, no matter what the domain of the values in Question? We
encounter a fundamental result in the theory of distributed systems: that under ce,1ain
conditions - including surprisingly benign failure conditions - it is impossible to
guarantee that processes will reach consensus.

419

You might also like