You are on page 1of 10
Fault-Tolerant Clock Synchronization in Distributed Systems Parameswaran Ramanathan, University of Wisconsin Kang G. Shin, University of Michigan Ricky W. Butler, NASA Langley Research Center igi computers have become es Desa itera ysams-nckar seas, Ate en syn Compt ‘cred ufc sem Conon ‘alse appetite ema or inaximumeiy agheomane frm compat conrlle Ts rae tne meats sing ecm Sipe conser fate ese ap. sonst poi fare oreo ier an psd be han 1 per r-hour sion feese 2 eon Clerfulredanin Semcon co ‘ecw fsck singe regimens, trol nt or sin dvi {ono conputer comatose tn Styune, Ad ho ecg a ese Stun nr cet falas i tae een sop fo ‘Sram sub fre mde. Te coc Srtronzaton pele shown ai Uraclancexample The gue thowss the ode sem nw eth od ‘ownclod. Thecheloaesyatonrd ty adningcahotbemetanel hehe Cctber 1990 Software algorithms are suitable only where loose synchronization is acceptable, and hardware algorithms are expensive. A hybrid scheme achieves reasonably tight synchronization and is cost-effective. lock values. This “initvely correct” al- rithm works ine as long as all the clocks fare consstent in thei behavior, a8 Figure Ia illustrates. However, if one clock is faulty and misinformsthe other wo clocks, the two nonfauty clocks cannot be sy cuonize. For example, in Figure 1b the Faulty clock B reports incorecty to clocks And ©. Asaresul clocks A and C do ot ‘make ay corections becuse both tehave asi they are the median clock. Lampestand Melia Smith were thet 4 study the thee-lock synchronization problem in the presence of arbitrary fault ‘behavior They coined he term Byzantine fault to refer to the fault mode! in which 3 faulty clock can exhibit eitray behavior including, bat no limited ta, misrepresent. ing its valve other clocks inthe system, ‘They showed that inthe presence of By2- anise faults, no algorithm can guarantee ‘Synchronization ofthe nonfat clocks in 1 three-node system. They also showed that 3 [clocks are sufficient tense synchronization ofthe nonaulty clock in the presence of m Byzantine faults. This condition Iter proved not only sufiient butalsonecessry fo ensuring synchroni zation inthe presence of Byzantine faults Since he inital sty by Lampen and Mo ami the problem of eek syctroiza- tom nthe presence of Byzanine fas as ren st extersvely by sever tert archers. All his tention is mainly he 2 @ » Figure 1, Example of a Byzantine fault in lock synchronization: (a ‘cause many aplcaionsrequire aconsstent view of imeaerosallndes ofaistibuted system, andthiscanbe schievedon though lock synchronization, Solutions proposed inthe literature on clock syachronzain take citer «soft ware ort hardware approach. The software approach 8 flexible and economies, but ditional messages must be exchanged soley for synchronization. 4 Because they depend on message exchanges, the worst cease skews guaranteed by most of these Solutions are greater than the ference betwoen he maximum and minimum mes sage tans delay between any two nodes in the system. The bardware approach, on the ter hand, uses special hardware at each rodeo achieveatightsynehronization with ‘minimaltime overhead However thecost of addtional hardware precludes this ap proachinlarge distiuted systems unless ‘ery tight synchronization is essential Hardware solutions also require a separate network of clocks tht ie diferent from the Interconnection actwork between the nodes ofthe distibuted system > Because of these liitations nthe sft. ‘ware and hardware approaches, rescarch- cers have begun investigating a hybri ap proach, A hardwaresasssted software Synchronization scheme harteenpropored that strikes a hance between the hard= ware requirement at each node and the clock skews attainable? Another recently Proposed approach is probabilistic clock ynchroniztion, in which the worst-case skews can be made as small x desired. o However, depending on he desired worst- ‘ease skews thee isa nonacro probability ‘of loss of synchronization. Also. this ap- proach may induce very high afi the system. “Tis article compares and contrasts ex isting fatt-tolerant clock synchronization algorithms. The worst-case cock skews puaranied by representative algorithms ae compared, long with oer important aspets such as time, message, and cost lvernead imposed by the algorithms Special emphasis is given to more recent developments such as hardware-assised software syachronization, probabilistic lock synchconizaton, an agorithms for synchronizing lage, paially connected Aistibated systems Preliminary concepts First we define some ofthe concepts common to most clock synchronization algorithms and inroduce the notation that will be used throughout the article (see Sidebar on next page). We begin with the oton of a lock Definition 1; Time that is diretly ob- servableinsomeparicuarcock sealed fits clock time, This contrasts with the term rel time, whichis measred in an sssumed Newtonian time frame that is fot divetly abservable leis convenient o define the local clock clocks nonfauty;(b) faulty clock at node B. at anode as a mapping from rea ime to flock time. Ia etber words, let C be ‘mapping from rea time wo clock ie, then (Ci) Prmeans that when the eal times, the clock time ata particular nedeis 7. We ‘lopt the convention of using lowercase Towers to denote quanies that sepreseat teal time and uppercase lees to denote quantities that representclock time. Figure 2illusratestheconcept of acloc function! ‘mapping. A perfect clock sone in whicha unit of elock time elapse forever uit of real ime, f more o les than one uni of lock time elapses for every unit of real lime, the clock i said to be fast or slo, respectively Since a properly functioning clock is ‘monotonic increasing function, its inverse function swell defined. Let o7)=C-4) 1 denote this inverse funtion, In the literature some of the results in clock synchronization are formulated using the lock funtion, while ethers ae formula ‘sing the inverse Fonction, We wil we Sulscrips to distinguish betweea the dit Fete lock inthe system, For example G0) will denote the clock at node p, snd 0 the clock at node g. A clock is eon sidered nonfat if there is bound onthe “mount of deviation frm rea time for any ven finite ime inverval. In oer words, ‘even noataulty clocks donot always main tain prfet time Definition 2: A clock © is said to be nonfaulty during the real-ime interval [ijvnlifitisamonotonic efferent COMPUTER funetion on (7, 73] where oT) 1 2and forall 7 € (TF decry _ ar forsomeconstantp.Theconstan pissaid tobethe drfrateof the nonfaulty lock Instead of the above definition, some synchronization algorithms are defineé ‘sing one ofthe following near equivalent ‘definitions fora nonfaulty clock Definition 23: A clock ¢ is sid to be nonfaulty if there exists 2 constant ps ‘Such tha for 1 and Ps _LIV=CW Cy Definition 2b: A clock ¢ is sid to be ronfuly i there exits a constant py Such that fr f, and us)=CoH) < lea, ee en ” ‘The neat equivalence of these defi tions follows from the Taylor series ex: pansion of (1+ pI Us pr'=l-pspt—plept Foratypical value of p= 1° the second ‘order and high-order terms canbe ignored, thus implying p= py = 2 The above notion of clock does not imply any particular implementation. In fact some synchronization algorithms deal with hardware clocks, while others deal ‘ith logical clocks. Hardware clocks are theactal elk pulsesthat control circuitry timing: logical clocks are values derived from hardware clocks, For instance, a log- eal clock might be the value ofa counter that is ineemented once every predeter mined number of pulses ofthe hardware clock, In either case, we can talk about synchronization in tems of the absact notion ofthe mapping function from clock time to realtime The main difference wi ‘be inthe granularity oF skew, of synchro- Definition "Twoclockscy andes resid tobe Synchronized at aclock ime Fit and only if eT) ~ ex(T0'5 8. A set of locks re saidtobe wel schronizedit land only sf any two nonfauty clocks this set are Synchronized for some specified constat 8 October 1990 Notations used in this article Gl Clock tie at node p when the ral ine 1 GAT) Real tine when the dock ine at node pis T (Masi dit at of al locks in he system {5 _Maximum skew betwean any two nontsuty clocks inthe system ‘© Upper bound on he read etor 1m Maximum numberof aut clocks in tho system Total numberof oc inthe sytem A _-Resynctrorizaton ital U Upper bound on messace ranst delay Ag Nodes perception ots skew with respect fo clock at nade @ Because of he nonzero dit rates of al clocks asetof clocks doesnot remain wel synchronized without some periodic re ‘ynhronization. Thismeaasthat the nodes (of a dstbuted system must periodically tesynchronige thee Incl lacks to main tain global time base aross the emtite system, Synchronization eulres each node to read he other nodes clack value, The ‘ctl mechanism used by a node to read ‘other locks differs fom one algorithm to ‘another In hardware algorithms the clock Signal fom eaeh of the other nodes (or an appropriate ssc of nodes) is an input 1 thesynchronization circuitry at each node. In software algorithms cach node citer Figure 2. The clock funetion. broudcasts its clock value 10 all nodes at specified Individually to requesting nodes Regardless ofthe actol reading moch- _readngeachoter'sclockis bounded Since anism, a node can obain only an approx- the actual errors hat oceur fer from one mate view oftsskew with expecttoother algorithm to another, we wil seve them nodes in the system. Errors occur mainly further as we describe each algerithn, because it tukes a finite and unpredictable The meat which node decides o read amount of time o deliver aclock signal or the clocks at other nodes slo depends on clock message from ane nade another. the algorithm under consideration. In In hardware algorithms, erors are due hardware algorithms, syncronization cit rainy to the unpredictable propagation —cuiry continuously monitors the frequen delays for clock signals, whereas in Soft- cy and phase ofall elocks. On the basis of ‘vat algoriths errors aredue to variation theinpt signals th icity also updates im the message transit delays. Most ofthe the local clock continuously. Hardware synchronization algorithms discussed in algorithms can therefore he classified as {his article are based on the assumption Continuous-update algorithms. Software ‘haiftwonades re nonfauly,theextorin synchronization algorithms, on the other Fastcockrogen Cock ime ‘Slow-clock region Realtime 3s hand, are usually discreteupdate algo rithms thtie,coretion a a ocalelock is Computed at discrete ime iatevals. How ver software algorithms may dfferin he ‘vay they apply this corction othe local clock. In some software synchronization Slgorithins the correction is applied in stantaneousl, whereas inodhersitisspread vera time interval. “Tne time at which the conection is computed is determined by each node on the basisafitsownclock. The time interval between successive corrections is called ‘he resynchronization interval, denoted R, land Is usually & constant koown 10 all nodes, 17° denotes the ime a which the *ystembegan its operation ten the timeof the ih esynchroization 18°" = 7 +i, ‘Thotime interval R'=[T,T°* soften called the ith resyachroization interval Indiscrete-updatealgoitims, wecan view the local clock of # node ab a series of funtion one foreach esynehronization fnterval. Thats the clock at node p inthe (+ 1th resynhrontzaion i given by ON = ore BP whore represents the changeinp's clock Sinoe the start ofthe system. Software synchronization The base ies of sofware synchronize tion algorithm thar cach node has & logical clock that provides atime base for all activities on that node. This logical ck is derived from the hardwace clock ‘on that node, though usualy has a much larger granularity than the hardware clock ‘The algorithm executed by each node for synchronizing the logical clocks can be ‘iewed as clock process invoked atthe tend of every resnchronization interval. “This lock proces is esponsible fr per. cially reading the clock valves at other fds and then adjusting the correspond- {ng Yocal lock value Informally, any softwaresyachronization lgoithin must satisfy the following 180 ‘conditions Agreement: The shew between all non faulty cocks inthe system i bounded ‘Accuracy: The logical clocks Keep up with real tie, “The first condition states that there is & consistent view of time across all nodes in the system, The second condition states 6 that this view of time is consistent with ‘what is happenig inthe environment. The synchronization algorithms differ in the way these two conditions are specified, as wells inthe way the elock processes read the clock values at other nodes, In some algorithms, a clock process requests and teecives «clock vale from cach of the ‘other nodes, while in other aigorihms the clock process broadeasts its local clock ‘al when i thinks iis ime for aresyn “chronization, The other nodes receive the broadcast message and sete clock value forcorrectng themselves ata later ime. In citherease, the skew perceived by arecei ing clock process differs from the actual skew that ests between the to clocks, because ofthe eros in reading the clocks. These erors ae due mainly to unpredic le variation inthe communication delay between the two nodes. This notion is formalized inthe following dieusson Let and qbe any two nonfaulty nodes in the system. Lat T be the clock time of node q wen the clock process at node ¢ Inatesubroadcastof the local clock value Th other woud, let the clock process at rode q initiate « broadcast of the local Clock value areal te eT). This mes: “age willreachp aftersometime delay. say (Q. That is, the broadcast message is de jivered to node pat eal time c(T) +0. At that instant the clock times at nodes pan gare Gyle4{T) + Oban Cle T) + Qt Spectively” That is, the actual skew exist ing between p and ga the instant when receives this messages given by Cle) +0)—CyteT)-+0) However. node phas to way of determining Cy(c(T) + 0) ex- fly, The best it can dois estimate the ‘delay Qby. say, 1+ 0,andcomputethe skew a8 follows: yal O-Cuean +O “Thismeansthatheerrorinthe perceives shew atpis T+ 6-cyedn +0 ‘This errs small f we an obtain a good estimate of the communication delay andi the clock drifis ofp and are comparable ‘spony daring the communication delay All synchronization algorithms discussed ‘inthis article are based on the assumption that if pad g ae nontalty. then és bounded From he above discussion we can con “clude thatthe read error is nol serious problem ifthe aigovithm is used 10 syn- ‘hone a fly connected system. How ever itcanbe a serious limitation in Future Alistted systoms because system size is ‘increasing, andi is diffiult to fully con rect all the nas in a large distributed System, This problemas heen specially fMdressed by some of the recent algo its" Software synchronization slgorithnscan be classified as convergence algorithms ‘with averaging, convergence without averaging. or cons Tihs (see Figure), Te leaves othe wee in igure ae the representative software synchronization algordums surveyed inthis acl (For additonal eading reer 10 Sehneier.) ‘Convergence-averaging algorithms. CConvergence-averaping algorithms are based oa the following idea: The clock process at each node broadcasts 2"resyne™ message when the loci time equals 1+ [RS for some integer and» parameter S to be determined by the algorithm. After troadcasting the clock valve, the clock process was fr $time unis. During his Waiting prio, the clock process collects the resyne messages broadcast by other nodes, For cach rsync message, he clock process records the time, according 19 its lock, when the message was received, At the end of the waiting petod, the clock process estimates the skew of its clock tvith respect 1 etch ofthe other nes on the bass ofthe times at which it received resyne messages, then computes 3 ful Tolerant average of the estimated skews tnd uses itt correct the local clock before the star of the ext eesychronization in- terval “The convergence-averaging algorithms propose inthe iterate dfer aly in the fel olerant averaging function used to compute the conection. tn algorithm CONV," each node uses the arithmetic mean of the estimated skews as its correction. However 1 lint the impact of falty lacks om the mean, the estimated skew With respect to each node is compared ‘east threshold and skews greater han thethresholdaresettozero before comput ing the mean, In conta, with the algo tithm in Lundelivs-Weleh and Lynch* Aenesfonh refered to a agocthm LL}. txehnode limits the impacto alt clocks by fit discarding the m highest and m lowest estimated skews and then using the midpoint of the remaining skews as its ‘orreton, where mis the maximum num ‘ero faulty clocks ro be tolerated ‘One mainlimitation ofthe convergence sveraging algorithms is that they are in COMPUTER tended for a fully connected network of nodes. Asa result the algorithms are not casily scalable, Inaddtion, they all eequte Knovtn upper bounds on clock read error and iil synchronization of the clocks ‘The need for intial synchronization i not serious problem, because there ae sever- A algorithms to ensure it? However the 'sssumption about the bound onthe clock read eroris serious problem Because the ‘worst-case skews guaranteed by these con ‘vergence-averaging algorithms ae usually reater than the bound on the ead error Tn addition othe read eror the worst case skews depend on the time interval ‘between the resynchroniztins, the clock rfrates, he numberof faults to be tole. ated, the total numberof clocks in the System, andthe duration ofthe time ite ‘al during which te clock processes read the clock values at other nodes. However, ‘among al these factor, the ead etror has the greatest impact onthe worst-case skew. [Lamporand Melia Smith sow the worst ‘ase skew for algorithm CNV to be sax [INN ~ Se + pO + 210N = my Ns] 8+ pb where y isthe marimar skew after the Initial synchronization, Vi the teal num ber of clocks inthe system, ms the max ‘imum numberof faults tobe tolerated, $8 the duration of the time interval during ‘which clock processes read the clock va: tes other nodes, and is the assumed ‘bound on the read error. Smita in Lan- olay. Weleh and Lynch? the worstease ‘skew of algorithm LL was shown tobe Beer TB + U+ de) +4924 pKBeU) where isthe maximum message transit port, Shostak, and Pease provide more ‘etal on these requirements." [Note that these algorithms require no sssumption about initial synchronization, [Nor do they require divect connection ‘between al the nodes, although the algo- rithms described by Lamport and Melia Smith do require sucha conection. “The following example should clarify she basi ides ofthese algorithns. Cons cera fournode system witha maximum of fone Byzantine faut. In an interactive onssteney algorithm. a node p weld convey ils private value 1o other nodes {sing the follwing scheme: Node p would seu tsvalve every other node, which in tum would relay the value tothe wo re- raining nodes. That i, each node would receive three “copies” of p's value: one Airey fom p andthe other two from the totter two nodes. Each node would then ‘timate p's valu tobe the median ofthe ‘values inthe thre copies. This scheme can ‘be maified for lock synchronization by leting each node independently use the interaciveconssteney algorithm ioconsey itgclock valueto othernodes.Attheend af four appliations of the interactive con sistency algorithm one foreach node) the local elock at each node could be adjusted to equal the mean of the other nodes" lock vales. “This clock synchronization scheme ean befurtherestended io situations with more than one Byzantine fal just ike theme sctive consistency algoritim."” However, ‘his wil require atleast + 1 rounds of message exchange, whore m isthe maxi mum number of faults tobe tolerated, and this implies moce overhead. The worst caseskew guaranteed by consisency-based gorithms, however, is usually smaller ‘than those guaranteed bythe convergence ‘sc algorithms. For example. ifthe ot number f clocks i the system equals 3m + Is then he worst-case skew of algorithm COM in Lamport and Melia-Smih was shown tbe (6m + e+ Cm + 3p + pk ss opposclto Grr 2)e ++ 2)95 (3m {Pip foralgorthm CNY, ‘wheree.S.and ‘areas deseribed earlier under “Conver- ence averaging goth.” “The number of messages required for consiateney-based algorithms ean be = ‘doced by using digital signatures 10 a themiate the messages, This 1 te of, algorithm CSM in Lamport and Mellin Smith, Algorithms with authentication ‘quire fewer messages and fewer nodes to Tolerate agiven numberof Faults They also sohieve tighter skew than algorihms ht do not use authentication Probabilistic synchronization, The ‘main fitation ofthealgorithms discussed ove ie hat he worst-case skew depends heavily onthe manimmum read errr. This can bea serious problem in fuaze distrib ted yates beets read esa Likely to increase with system size, To remedy this problem, Cristian proposed a probse bilistic synchronization scheme in which the worst-case skews canbe made as small 1 desired. © However, the overead im posed by the syeronization algorithm Increases drsticllyaswereducetheskew. Furthermore, unlike other algorithms we Ihave discussed, this algorithm dows, not guarantee synchronization with probail- fy ne. fac, there a nonzero probabil ‘coMPUTER ity of loss of synchronization, and this probability increases with adecreaseinthe (esined show Cristian's ea is 49 assume that the probebility distribution of message wansit Aelay is Known and let each node make several allem o ead the other clocks. “Alter each attempt the node can calculate ‘the maximum error that might occur if he clock value obtained in that arempt were ‘wet determinethe conection. Byretrying foften enough. 2 node can read the ether locks to any given precision with prob ability a5 elose 10 one as desired. This ‘scheme i particularly suitable orsystems having 4 master-slave arrangement in Which one clock has been designated or lected master andthe other clocks act ss slaves "More speifically.each noe periica: ly sends a message to the maser node requesting its clock value, When the mas: ter receives this ques, i responds with rmessage saying “The time is 7." When the response reaches the requesting node. it ‘calculates the tual round tip delay 2c cording its own clock ID isthe round ip delay as measured by anode p and if 1 the master node, then Cristian proved at Faodespand are nonauly. the local time at node q when p receives the re sponse message lies i the interval 17+ Ungll =P T+ Bit +20) ~Unall +91 where Un sthe minimum message transit olay between p and q. Therefore the ‘maximum read eto s DU +2p)~ 2U ne 1 follows fom this eel tha tthe round tip delay is small (los 102 thensoisthereadrrr. Cristian’ salgoritim ‘nes this observationtolimitthe readeror Each node in allowed to read the master's clock repeatedly util the round wip delay {s such tha the maximum read eror is below agiventhreshold,Oncea node reads fhe master's lock othe desired precision, insets ts clock to that value ‘One limitation of this approach is the needtorestetthe numberof eadatemps Since these are directly related 10 the ‘overhead impose by the algorithm. This Implies that anode may notalvays be abe to read the master's clock tothe desired precision and hence could result in loss of fchroniation. A second limitation is that it no al flerant, since itis aot any to detect the master’s fllre. Fur ‘ermore lgoritin to designate or elect, fnew master are Fairly comple and tine consuming October 1980 Hardware synchronization ‘The principle of hardware synehroniea- tion algorithms is that of phase-locked loop. The hardware elockateachnodeisan output ofthe voltage-contlled oscillator ‘The voltage applied othe oscillator comes from a phase detector whose output is proportional tothe phase error betweenthe ‘haseofitsclock Ihe ourputofthe voltage ‘ontolled oscillator itiscontolling) anda feference signal generated by using the father clocks i the system. Thus, by a> justing the frequency of each individual ‘ockto he reference signal, the clocks can vas be Kept i ock-step with epee to fone anther. Algorithms. One of she fist hardware synchronization algoithans resilient to Byzantine falls was proposed originally ‘by TB. Smith This algorithm was devel ‘oped to synchronize the processors ofthe faulolerntmuliprocesor(FTMP)"The FTMP uses only four clocks and was ex: pected o tolerate a maximum of one By? Ente fault, so. Smith's algorithm seas Specifically developed for that imation In thealgorthm, each clock observes the other three clocks continuously and determines ts phase diflerence with each of them. ‘These phase differences are arranged in an ascending order, and the clock corre- ‘sponding fo the median phate diference is ‘electedas the eference signal, Eachelock ses @ phase locked loop to synchronize Stel with the reference signal selected ‘The worstcate shew in this algorithm ependvon the characterises athe pase locked loop. Smith did not characterize this worst-case skew, but experimental results on the FTMP have shown that the skews are around 50 nanoseconds, This Should be compared withthe skews in the ‘oftware synchronization algoitims, which ae onthe order of microseconds. Suro ‘ingly, this algorithm does not easily gen- tralge to situation im whieh more than fone Byzantine faultmustbe tolerated. This Is cause witha median select algorithm its possible fr Byzantine failures to par- tition the set of elocks int two oF mare Separate “cigues” that ae intemal 5) ‘hronized but are not synchronized with other cliques [Asan illstrstion, consider a system of seven clocks. Sucha system shouldbe abe tomask the failure of two clocks, Suppose thatthe Five noafaulty nodes (cocks) are nated a c,d, and the faulty nodes (clocks) are named andy the tans: ‘mission delays between clocks are negh sible, then the order of asval ofthe clock Signals from the nonfaulty nodes shoald be the same at all the nodes. Ifthe faulty clocks behave eratcally their order may bbe seen differently by different nodes. ‘Consider the following ordering of signals sas seen by the various nodes inthe system: Order seen by a: xyabede Order seen by bs xyabede Order seen by: abedexy Order seen by ds abedexy Order seen by: abedexy Now suppose each node uses themedian selecalgorthimtoselectareerence signal tosynehroize ts own clock, Inthe above ‘example this would mean that each node Selecta th third clock notcountng its ow”, 2s the reference signal thts, nodes dyande will synehronizetwnodes ya, d. van respectively. Asa result there are two nonsyachronizingcigues. a, 6) and esdvel, By noneynchronizingcligues we ‘moan that aand bare synchronized to each cotherand Se hen this Function for selecting the ‘reference signal will ensure that all the ronfaulty noes remain synchronized re- tires of ow he faulty nodes behaves Forthesitutionabove,!= 7,m=2, and “Thus, o synchronizes toh b synchronizes toc, csynchronizes tod synchronizes 10 fenande synchronizes toc. Thissequenceof Synchronization among the nodes results Ina single clique containing all five nom. faulty nodes. ‘The problem with Krishna, Shin, and Builer’s scheme is tht selection of the reference sia is based onthe position of the local clock inthe peeived sequence » : : i — om : i oo : io Figure 4, Percentage of reduction over a fully connected network versus system size. ofthe incoming clock signals. This com plicates the hardware ateach node because thchardwaremustonderihe incoming clock signals and dynamically choose reference signal by identifying the lca clocks p- ‘Stionin he ordered seauence Toovercome {his problem, Vasanhavada and Marines ‘proposed a sight variation of the above Scheme, usagtworeference sgnalsinstead fone In their scheme ifs not necessary 10 generate the complete sequence of al in oming signals. estead its sufciet to Feith n+ hand the (Nor 1th clocks inthe temporal sequence. This is casy to doin hardware Because we can ‘count the clock signals that have arived tnd record the identity ofthe (m+ hand the (Wm Ith clock signals. The aver agephase diffrence othe leal clock with respectiothese wo cock signalsisusedto correct the leat clock in the subsequent cycles, Eliminating major limitations in hardware synchronization. The main advantage of the above hardware sy ‘hroniation algoelihms i that the worst ace skews are several orders of magnitude “Smaller than those in software synehroni- ‘ationalgonithns, However these hardware "Synchronization algorithms have two ma: Jorlimitatons. Telesis thatthe hardware Syashronization algorithms as dseebed hove reguirea fully connected network of ‘locks. Because ofthe many inerconnec: 0 tions in a fully connected netork, this requirement means thatthe reliability of ‘ychronization would be determined by the failure rats of these imerconnetions rather than by the failure rat ofthe cocks. Furthermore, the large numb of intercon nections would cause problems of fain fad fan-out The second limitation is thatthe above hardware synchronization algois are based on the assumption that the tans mmission dolays are negligible compared ‘with therparameters relevant ote algo- rithms. Ina large system, the physical ep aration berween a pai of clocks can be ‘ough 10 result in non-negligible trans ‘mission delays, significant difernce in the transmission delays between two pairs ‘of monfauly clocks ean change the order in ‘hich a nonfulty clock perceives ether nonfat clocks and ths aectheseection fof the efeence signal “These twa limitation an be eliminated feom any ofthe above algorithms with the hep of he recent developments surveyed belo Inerconnection problem. Shin and Rix smanathan recenily proposed an intercon nection strategy for syhronizng a large Aistibaed system using any ofthe above hardware synchronization algorithms.® “Theirimerconnectionstategy equiresonly 20-30 pereent ofthe foal numberof in terconnectionsina ly connected network. "The idea behind their strategy’ 16 10 synchronize the locks inthe system attwo Afferent eves. The clocks ae fist part tioned into sever clsters. Each clock then synchronizes itself not only with r- spect alte clocks in its own cluster but lo with one cock from each of the other listers, Avaresaltofthis tual coupling hetweea the clusters, the clusters remain synchronized with respoct to each other, tnd the system as ¢ whole remains well synchronized. ‘Shin and Ramanathan formlate the problem of paritoning the clocks into clusters asa small integer programming problem. The objective ist minimize the {otal number of interconnections subject 0 the conseaiar thatthe fault tolerance re ‘girement is satisfied. For each clock, the Solution to the intoger-programming ebm identities al othe cocks that it Should monitor. Shin and Remanathan® ‘show that thisieterconnectionstategy does not result in loss of synchronization and the worst-case skew increases by a factor of three 3b compared with the skew in a fully connected nerwork of clocks. ‘The reduction inthe total number of Imtercomnections over fully connected network as a function of system size is shown in Figured I follows fom this plot thatthe percentage of reduction inseases ‘wih system size this is precisely the sit ation in Which the number of intercon rections is. serious problem Transmission delay problem. The of fectsofwansmission delay inphase-locked oops have been studied extensively nthe communication re but ntindhe presence ‘of Byzantine fauks, Sinan Ramanathan” ‘how that itis easy to incorporate ideas From the communication are 0 take into account the presence of Byzantine faults tnd aon-acligble transmission delays. Figure illustrates heirunderying de and shows the hardware required at node to selermine the enact phase diference between its lock andthe clock at node Instead of one phase detector for every ‘ode pita in othe algrithms, this solu tion requires two phase detectors and an _verager The inpts to the fist phase de tector PD), arethe clock sina from nodes fad . Since the clock signal fom node j ‘encounter delay in teaching node tthe hse diferene detected by PD, doesnot ‘epresent the truephaediferencebetween the to clcks, “The inputs tothe second phase detector PD, are the elock signal rom node and the clock signal from node thats recuned ftomnode Inotherwords,teclock signal COMPUTER ‘atnode jis monitoring istetamed rode ‘tobe compared with the signa rom rede 4. Because ofthe transmission delays, ne ‘hase difference detected by PDsalso does ‘ot represen the ire phase difference be tween the te clocks, However, Shin end Ramanathan prove thatthe average ofthe ‘outputs of PD) and PDs, ¢- is proportional to the exact phase diference hetwoen the clocks at nodes andj regardless of the transmission delays, This tuo phase diference can then be used for any hardware synehronization a tort, Thedsadvantagesof his approach aretheneed fortwophase detectors instead fof one andthe doubling ofthe nunber of Imercoanections. However, when this So- ation is combined with Shin and Rs rmanathan’s interconnection scheme, e {seta viable solution even fora very large sistibuted system, Hybrid synchronization Earlier wenoted thatthe main drawback ‘of software synchronization algorithms is thatthe worst-case skews are atleast as largeasthe variation inthe messagetransit delay sn the system. This canbe serious problem ina large distnbuted system be ‘ace this variation canbe substantial, Por example, insome systems worst-case mes sage transit delays lage ss 100 iesthe ‘mean message transit delay have been ob- served Inother words. the worsteaseskews insoch systems would e at east this large Ifa software clock synchronization algo: ‘ith were adopted. A the ater extere ‘hardware algorithms achieve very tight skews with minimal overhead. The skews inhardwarealgorihsare typically onthe onder of fens of nanoseconds opposed to tens of millseconds in the software algorithms. However, thehardwueschemes ae prohibitively expensive, especially in large distributed systems. To remove the inherent limitations of both the hirdvare and software approaches, Ramanathan, Kandlr, and Shin recently proposed ahybid synchronization scheme that sinkes a balance between the clock skew tainaleandthehardware require tment Theirschemeispariculaly suitable forlarge. parialy connected homogeneous Aistibuted systems. with point-to-point ‘nerconnection topologies Such 35 hyper= cubes or meshes, ‘Thishybvidschemeissinilartoalgoihm ‘ENV. As in CNV, each node has 2 clock proces thats responsible for maintaining thetimeonthatnode, Ata specified timein October 1990) Figure 5. Elimination of transmission delay effects. the resyachronization interval. @ clock proces broadeass the local clock value to other clock processes. The broadcast gorithm is such that al lock presses ‘receive multiple copes of the clock mes sige through node-disjoint paths. ‘The number of copies aaed in the broadcast algorithm depends on the maximum num: bor of faults toe tolerated and the fale ‘model forthe sytem. When a clock peo cessreceivesaclock message sen some father clock process, it records the time (according tits local clock at which the message was received. Then, inaccordance withthe Broadcast algorithm it relays the ‘message to oer clock processes Before relaying the message. a clock process appends tothe message the ime lapsed according 1 its own clock) since receipt ofthe message, At the end ofthe resynichronization interval itcomputesthe skews between the local clock and the lock of the source node foreach copy i hasrecevedTheclock process then eects the (m+ ihlargestvalucasanestimate of the skew between the two clocks, The a¥- ‘rage ofthe estimated skews overall nodes isused.as the correction tothe local lock Asin CNV. aminimum of 3m +1 nodes required to tolerate m Byzantine faults. Ramanathan, Kandla and Shin's algo rithm has severaladvantages over the algo ritimsdisevsed eater Fist, the sending of clock valus by diferent nodes occurs "hroughout the eesynchroniation interval rather than during the last S time units of the imerval. For hypercube and mesh a- hectares, they show thatthe esychro- ization interval in their algorth is 30 layge that there isa most one node broad casting its lock vale a any given time. This prevens abrupt degradation in the message deliver times, whichoccurs when All noes in the system broadcast clock ‘values almost simultaneously. Second the only hardware quirement a each nde is For time-stanping of lock messages. The ‘hae probably mestimportant advantage isha the worst-case skews ae about 0 0 Thre orders of mage tiger tan the skews insofar schemes, Furthermore be cause ofthe haware suppor the werst tase skews ae insensitive to variations in message transit delay inthe system, To compare the hybrid synchronization scheme with the variovs software and hardware synchronization schemes, con sor the worstease skew tis algoitim ‘uaranees Ramanathan Kandrand Shin ‘Show that his worst-case skew is man { 2M = + 2pNU) + 2me es ft ve Phe w+ pw) where the notation isthe same as ha used for characterizing sigorhm CNV under “Convergence-averaging algorithms." The slight difference between the above ex pression andthe corresponding expression for algonthm CNV is that fr algorithm ‘CNV the ead error incorporated the ef fects of message transit delays between any two ofthe system nodes. This is be ‘use algorithm CNV was intended fora fully connected system in wbich the mes sage transit clays are very smal. In com 4 trast the algorithm inthe hybrid scheme is imtended fora partially connected system in which the message transit delay can be fairly large. Therefore, the etfect of mes sage rani delay (Un the skew hasbeen Separated from the effect of ether read cross "The Key conclusion we can daw fom the above expression fehat the worst-case skew is not an explicit function of U usin most software synchronization algorithms uta function of p » U. AS a resul, for typical values of p= 10° and € = 20 mi croseconds, Ramanathan, Kan and Shin show thatthe worst-case skew in a S12 rode hypercube withm =? sess than 200 microseconds even when the maximum message transit delays areas large 25 50 milliseconds, These skews ae stil much larger than those achieved by hardvare synchronization algorithms, butihcostof Inardwate syncronization fora system this large would be exorbitant lock synchronization is an impor- ( tantmatterinany faul-olerastdis- sbuted sytem and has been es tensively studied in recent years. Unfori- ately, the various solutions proposed are dificult to compare because they ae pre Seed under diferent notations ands ‘imptions. Additonal diicuty ares be ‘atseof ight ciferences inte assumptions toad bythe different synchronization a forth. In this article we classified and resented several software and hardware Eynchronization algorithms as well 35 2 hybrid synchronization algorithm, all us: ing a consistent aoution. We also ident Fed the assumptions eachalgoithmmakes. ‘Software algorithms require nodes 10 ‘exchange and adjust thei individual clock ‘ales periodically. Since the lock values tre exchanged via message passing. the time overhead thevealgorithmsimpose can ‘substantial especially ia ight synehro- ization sesired. They re terefore stable for applications that ean tolerate a loose synchronization between the system nodes. Hardware algorithms, on he other hand, uve special hantware to ensure avery tight synchronization, Although the overhead they impose onthe system i minimal the handwate algorithm are too expensive to ‘efor ll but small systems “Thehybri scheme is cost-effective and achieves a reasonably Hight sypehroiza tion. es also stable for synchronizing large, parallyonnectedsysemsandhence isthe most viable scheme for future dis tebuted systems. 2 Acknowledgments “Tae work reporedin ths ale mas sap. in prthy NASA ude rant No. AG 296 adie Die of val Reseurchanderconmt NO NonorsseSRs3h References 1 A Lamport ang PM Mati Sn PACH oA 32,80. Ian AS. 9p 5278 Simp Principe of isbued Compan ‘ACM. New Yon 8p. 9108 J Landis Welch abd N. Lyn" New ‘hroniation infrmation and Comput 4 TK Santand, Toue “Opti Ch aly 1987. pp. 626685, 5. CM. Kish, KG. Shin, and RW. Bat “Enoring aut olerincs of Pose Locked (Cocks TEBE Tron Computers, Vol € SAINO, 8 A 98S, pp. 782.75. 6, KG. Shin an, Ramanatan, “Clock Sy Shteminte reser of Malou Fat IEEE Tran, Computers. ol. 35. No Jan. 98 pp 2 2. K.G, Shin and P. Ramanathan, “Teas thon Delssn Harare Clack Speco Jaton IEEE Trans Computers VOLC 37 Nocti, Now 198K pp. 1651867 N. Vana and PN. Marios "S33 ‘Gvoniaton of Fat Tolert Cocks nthe Fresnce of Maou Falter: TERE Tran. Computers NOC No.4 Age 1988. 9, F Ramanstin, BD. Kanu, and KG. Shin “Hardware Asied Software Clock Synchronization fr Homogeneous Ds Cae No ape 990; pp Stas 10, FCs, “Probab Clock Spats ato, Tech Report) 6832 (62350 18M ‘Nien Resuth Contr, Sop. 188 11, FB, Seti,“ Paradigm for Relish {ioc Sychroioaton Tech Report TR 5:75, Compute Scene Dept Corel Uae ihaca, NY. Fe. 1986. 12. L Lamport R Shostak and M. Pease “The Syeamine Gers Problem: ACH Tran. Programming LanguayevandSien. Va. UR aly 19 pp 382-0 13, AL Hopking, TB, Smith and 14, Lai “FMP a iighly Reliable Fat Toe SMupocenor or Aira Proc IEEE ‘Vole No 10,00 1998 pp 221-1240. Parameswaran Ramanathan i an asst ‘fear of clectcal and computer copie. Rpatite Cnivemy of Wisorsn, Main Fro 19841080 he wasarevearchasssantin the Department of Ecce Eninering tad outer Science atthe Univers of Mich fa, how Aor sete ave feud on feueoleatcomptingstbute yea time ommpatne. ISI desig, and computer ach Ramanahanreceivedthe8 Teh degre from tie Tndan Taste of Tecnology, Bombay. Indi in 980 and MSE and PAD dress rom ‘he University ot Michigin, Ane Arbon 1986 ‘Kang 6 Shin ined he University of Mihi profewor of elena engine ad com Teagus ar caren allo 19-ote hea ‘oat mesh mlm to vate vain tierce snag reas othe es of Sine veal one compan ‘Shinectved BS mele cginering from Seoul Naval Univesity, South Koes nsnceone from Comet Univesity. ac, ‘Now Yoko 19Y6and 1978 expects Hels {rember of te IEEE Compute Sk tence Retearch enc Hs rseure meres evn desgn and val of ill Smuts syste eed fo ghee pp argos scene his BA te maemo {i and bis MS in computer sence 97H, trom he Univers of Vigini Inuit tte sors can beaded 0 ratuneseanm Rana, bp of rs te Conptet Eager, Unc o Wi Shean Macon Se engieering Ble 5 Schason De Madson, Wi 5708.09 ‘COMPUTER

You might also like