You are on page 1of 29

March  18  –  22,  2013   IIITDM-­‐Jabalpur   “Dependable  CompuDng”

Distributed systems
Takashi  Nanya   Canon,  Inc.

Issues in distributed systems
•  Distributed Systems
–  Include arbitrary number of system processes and user processes –  Consist of two or more processor/memory modules –  Processes communicate with each other by message passing (with no shared memory) –  Global control for inter-process communication and system management –  Variable delays exist in communication between processes


Purpose: Fault tolerance, Performance enhancement、Extensibility, Resource sharing All information describing the global system state ( process state and data) must be maintained so that all participating processes have a consistent and identical view of the global state Issues
–  Clock Synchronization, Mutual Exclusion, Concurrency control, Multiple copy update, Error recovery



Clock Synchronization
•  Distributed systems are asynchronous in nature •  Variable delays in computing and communication •  Possible inconsistency in processes recognizing the temporal ordering of event occurrences
P1   P2   P3  

•  P2 perceives “P1 -> P2”, while P3 perceives “P2 -> P1” !

Atomic action (1)
•  Main problem in Distributed Systems: Maintaining Consistency •  Basic concept for solution: Atomic Action •  To realize the Atomic Action (consistency control), processes need to have a common agreement on the following; –  Temporal ordering of event occurrences in the system –  Global system state and state transition •  Possibility that the agreement may be impaired due to delays in inter-process communication and faults in nodes and/or links =>Clock synchronization, Byzantine agreement


Atomic action (2)
•  •  •  •  •  A method of process structuring for allowing the writer of a procedure to secure the same benefit of atomicity, i.e. Indivisibility, noninterference, strict sequencing Basic notion to solve consistency problems in distributed systems Generalized notion of transactions for database concurrency control Several definitions: 1) An action is atomic if the process performing it is not aware of the existence of any other active process (can detect no spontaneous state change) and no other process is aware of the activity of this process (its state changes are concealed) during the time the process is performing the action 2) An action is atomic if the process performing it does not communicate with other processes while it is executing the action 3) Actions are atomic if they can be considered, so far as other processes are concerned, to be indivisible and instantaneous, such that the effects on the system are as if they were interleaved as opposed to concurrent

•  • 

Nested  structure  of  Atomic  AcDons
•  •  Defined  relaDvely  at  any  level  of  process  structure   An  atomic  acDon  at  a  level  consists  of  two  or  more  atomic  acDons  at  a  lower   level   A P1   C  

D   P2   E F B P3   G


Logical Clock (1)
L.Lamport: ”Time, clocks, and the ordering of events in a distributed system”, C.ACM, Vol.21, No.7, pp.558 – 565 (1978) •  System: a collection of distinct processes, each of which consists of a sequence of events Event a “happened before” event b (denoted by a -> b):
–  If a and b are in the same process, and a comes before b, then a->b –  If a is the sending of a message by one process and b is the receipt of the same message by another process, then a->b –  If a->b and b->c, then a->c


•  •  •  •  • 

(Logical) Clock Ci for each process Pi is defined to be a function which assigns a number Ci(a) to any event a in that process  (Correct) Clock condition: For any events a, b, If a->b then C(a)<C(b) The clock condition is satisfied if the following two conditions hold; C1: If a and b are in process Pi, and a comes before b, then Ci(a)<Ci(b) C2: If a is the sending of a message by process Pi, and b is the receipt of the same message by process Pj, then Ci(a)<Cj(b)


Logical  Clock  (2)
•  Assume that the processes are algorithms and the events represent certain actions during their execution. Process Pi’s clock is represented by a register Ci, so that Ci(a) is the value contained by Ci during the event a Condition C1 and C2, and therefore the Clock Condition, are satisfied if the following implementation rules are satisfied; IR1: Each process Pi increments Ci between any two successive events IR2: (a) If event a is the sending of a message m by process Pi, then message m contains a timestamp Tm = Ci(a). (b) Upon receiving a message m, process Pj sets Cj greater than or equal to its present value and greater than Tm Hence, the simple implementation rules guarantee a correct system of logical clocks A system of clocks satisfying the Clock Condition can be used to place a total ordering on the set of all system events

•  •  • 

•  • 

Total ordering of events
•  •  Define a relation “=>” as follows; “If a is an event in process Pi and b is an event in process Pj, then a=>b if and only if either (i) Ci(a)<Cj(b) or (ii) Ci(a)=Cj(b) and Pi<<Pj , where << is any arbitrary total ordering of processes” Then, relation “=>” is a total ordering, i.e. relation “=>” is a way of completing the “happened before” partial ordering to a total ordering



Total ordering is useful in implementing a distributed system !

Mutual Exclusion Problem
•  Find an algorithm for granting the single shared resource to a process which satisfies the following three conditions; •  (I) A process which has been granted the resource must release it before it can be granted to another process. •  (II) Different requests for the resource must be granted in the order in which they are made. •  (III) If every process which is granted the resource eventually release it, then every request is eventually granted •  This is a non-trivial problem. A central scheduling process will not work!

Distributed algorithm for M.E.
1.  To request the resource, process Pi sends the message Tm:Pi requests resource to every other process, and puts that message on its request queue, where Tm is the timestamp of the message 2.  When process Pj receives the message Tm:Pi requests resource, it places it on its request queue and sends a (timestamped) acknowledgment message to Pi 3.  To release the resource, process Pi removes any Tm:Pi requests resource message from its request queue, and sends a (timestamped) Pi releases resource message to every other process 4.  When process Pj receives a Pi releases resource message, it removes any Tm:Pi requests resource message from its request queue 5.  Process Pi is granted the resource when the following two conditions are satisfied: (i) There is a Tm:Pi requests resource message in its request queue which is ordered before any other request in its queue by the relation “=>” (ii) Pi has received a message from every other process timestamped later than Tm Note that conditions (i) and (ii) of rule 5 are tested locally by Pi

Anomalous behavior
•  Logical clock based on the relation “=>” may cause “anomalous behavior”

Computer   C TA:Req  A Computer   A TB:Req  B Computer   B

a Phone  call


TB  <  TA  can  happen   While  actually  a  -­‐>  b

•  • 

This can happen because the system has no way of knowing the actual precedence information a->b that is based on the phone message external to the system => we need a system of physical clocks

Physical clock
•  •  Ci(t) denotes the reading of clock Ci at physical time t For Ci to be a true physical clock, the following must be satisfied; –  PC1: There exists a constant κ <<1 such that for all i, |dCi(t)/dt-1|<κ –  PC2: There exists a constant ε such that for all i, j, |Ci(t) – Cj(t)|<ε To prevent anomalous behavior, for such a number µ that is less than the shortest transmission time for interprocess messages, it must be made sure that, for any i, j, Ci(t+µ) – Cj(t) > 0 Combining the above with PC1 implies that Ci(t+µ) – Ci(t) > (1-κ)µ Using PC2, it actually holds that Ci(t+µ) – Cj(t) > 0 if it holds that ε/(1-κ)≤µ Let m be a message sent at physical time t and received at t’, and the minimum transmission delay µm for m be known to the process that receives m Assuming PC1, PC2 can be insured by the following Implementation Rule; –  IR1’: For each i, if Pi does not receive a message at physical time t, then Ci is differentiable at t and dCi(t)/dt>0 –  IR2’: (a) If Pi sends a message m at physical time t, then m contains a timestamp Tm=Ci(t). (b) Upon receiving a message m at time t’, process Pj sets Cj(t’) equal to MAX{Cj(t’-0), Tm+µm) To synchronize physical clocks, a process only needs to know its own clock reading and the timestamps of messages it receives

•  •  •  •  • 


Byzantine Generals Problem
L.Lamport,et al:”The Byzantine generals problem, ACM Trans. Prog. Lang. Syst., Vol.4, No.3, pp.382-401 (1982)


A problem of coping such a situation that one or more faulty components of a system send conflicting information to different part of the system A group of generals of the Byzantine army camped with their troops around an enemy city Communicating with one another only by messenger, the generals must agree upon a common battle plan However, some of the generals may be traitors trying to prevent the loyal generals from reaching agreement [Byzantine Generals Problem]: A commanding general must send an order to his n-1 lieutenant generals such that IC1: All loyal lieutenants obey the same order IC2: If the commanding general is loyal, then every loyal lieutenant obeys the order he sends

•  •  •  •  •  • 

Impossibility for n<3m+1
•  With oral messages, no solution for fewer than 3m+1 generals can cope with m traitors
traitor “a\ack” Lieutenant 2

Commander “a\ack” Lieutenant 1

“he said ‘retreat’”

Commander “a\ack” Lieutenant 1 “retreat” Lieutenant 2

“he said ‘retreat’”


There is no way for Lieutenant 1 to distinguish between the two scenarios

Oral message for n≥3m+1
•  Oral message is one whose contents are completely under the control of senders, so a traitorous sender can transmit any possible message Assumptions A1: Every message that is sent is delivered correctly A2: The receiver of a message knows who sent it A3: The absence of a message can be detected We inductively define the Oral Message algorithm OM(m), for all non-negative integers m, by which a commander sends an order to n-1 lieutenants OM(m) solves the Byzantine Generals Problem for 3m+1 or more generals in the presence of at most m traitors We consider the case in which only possible decisions are “attack” or “retreat” The algorithm is described in terms of Lieutenants “obtaining a value” rather than “obeying an order” •  •  •  •  •  •  •  • 

Oral Message algorithm OM(m)
•  •  •  Algorithm OM(0) (1) The commander sends his value to every lieutenant (2) Each lieutenant uses the value he receives from the commander, or uses the value RETREAT if he receives no value Algorithm OM(m), m>0 (1) The commander sends his value to every lieutenant (2) For each i, let vi be the value Lieutenant i receives from the commander, or else be RETREAT if he receives no value. Lieutenant i acts as the commander in OM(m-1) to send the value vi to each of the n-2 other lieutenant (3) For each i, and each j≠i, let vj be the value Lieutenant i received from Lieutenant j in step (2) (using OM(m-1)), or else RETREAT if he received no such value. Lieutenant i uses the value majority(v1, v2, …, vn-1)

•  •  • 


Algorithm OM(1)
Commander traitor

Lieutenant 1

Lieutenant 2

Lieutenant 3



Lieutenant 2 obtains the correct value v = majority(v, v, x)


x y
Lieutenant 1

Lieutenant 2


Lieutenant 3


z x

All lieutenants obtain the same value majority(x, y, z)

Signed message
•  •  •  •  •  If the traitors’ ability to lie can be restricted, an algorithm exists to cope with m traitors for any number ( ≥ m+2) of generals A4 (Additional assumption): (a) A loyal general’s signature cannot be forged, and any alteration of the contents of his signed messages can be detected (b) Anyone can verify the authenticity of a general’s signature (No assumption is made about a traitorous general’s signature. His signature is allowed to be forged by another traitor, thereby permitting collusion among the traitors) The commander sends a signed order to each of his lieutenants Each lieutenant then adds his signature to that order and send it to the other lieutenants, who add their signatures and send it to others, and so on Let x:i denote the value x signed by General i. Thus, x:i,j denotes the value x signed by i, and then that value x:i signed by j Let General 0 be the commander Each lieutenant i maintains a set Vi of properly signed orders he has received so far

•  •  •  •  • 

Signed message algorithm SM(m)
•  •  •  Initially Vi = ϕ (1) The commander signs and sends his value to every lieutenant (2) For each i : –  (A) If Lieutenant i receives a message of the form v:0 from the commander and he has not yet received any order, then •  (i) he lets Vi equal {v}; •  (ii) he sends the message v:0:i to every other lieutenant –  (B) If Lieutenant i receives a message of the form v:0:j1: … :jk and v is not in the set Vi, then •  (i) he adds v to Vi; •  (ii) if k<m, then he sends the message v:0:j1: … jk:i to every lieutenant other than j1, … , jk •  (3) For each i: When Lieutenant i will receive no more messages, he obeys the order choice (Vi)

Commander “a\ack”:0 “a\ack”:0:1 Lieutenant 1 “retreat”:0:2 Lieutenant 2 “retreat”:0

Lieutenants  1  and  2  obey  the  order  choice  ({“a\ack”,  “retreat”})  and     know  the  commander  is  a  traitor  because  of  his  signature  on  two  different  orders

Commander “a\ack”:0 “a\ack”:0:1 Lieutenant 1 “a\ack”:0:x Lieutenant 2 “a\ack”:0

Lieutenant  1  obeys  the  order  choice  ({“a\ack”})  

Concurrency Control
•  •  Even if mutual exclusion is realized at a basic action level of processes, a inconsistent state may appear when two or more processes try to access the same database Example: Two client A and B may wish to send $10 and $20, respectively, to a common account independently of one another:


Making READ and WRITE being atomic actions individually is not enough ! What will happen if the order of the READ and WRITE commands being executed is, for example, A1, B1, A2, B2 ? Transaction:a sequence of READ and WRITE commands sent by a client to the file system Concurrency control (Serializability control): Executing multiple transactions that occur simultaneously as serializable atomic actions

•  • 


T1 T2 READ  Y   Y=Y  -­‐  20   WRITE  Y   READ  Z   Z=Z  +  20   WRITE  Z   READ  X   X=X  -­‐  10   WRITE  X   READ  Y   Y=Y  +  10   WRITE  Y   T3 READ  X   Y=Y  -­‐  20   X=X  -­‐  10   WRITE  Y   WRITE  X   READ  Z   READ  Y   Z=Z  +  20   Y=Y  +  10   WRITE  Z   WRITE  Y   Serial   ExecuDon Serializable   ExecuDon T4 READ  Y  

X=20, y=40, z=60 X+Y+Z is preserved

T5 READ  X   X=X  -­‐  10   WRITE  X  


READ  Y   Y=Y  -­‐  20   READ  Y   WRITE  Y   Y=Y  +  10   WRITE  Y   READ  Z   Z=Z  +  20   WRITE  Z  

Non-­‐serializable   ExecuDon

2-phase locking
•  •  A lock is an access privilege on a data item, which is granted to a particular transaction so that one transition can access the data item at a time When a transaction tries to access a data item, it must lock the item before accessing it, and unlock it on finishing the access In order for the 2-phase locking to guarantee consistency, each transaction –  Does not lock the data item that has been already locked –  locks a data item before accessing it –  unlocks all the data items before finishing the transaction –  Once having unlocked a data item, does not acquire any more locks Each transaction is divided into two phases, i.e. growing phase and shrinking phase. The number of locked items increases monotonically at the growing phase and decreases monotonically at the shrinking phase The 2-phase locking makes all the transactions serializable, i.e. atomic actions !




•  •  •  Every transaction are given a timestamp when it occurs Every request for accessing a data item are given its transaction’s timestamp If there is a conflict among requests for accessing a data item, the earliest one is granted according to the order of timestamps Algorithm for the scheduler at each site; –  For each data item X , the scheduler records the largest timestamp W(X) of WRITE requests and the largest timestamp R(X) of READ requests that have been processed –  For READ request with timestamp T、if T<W(X), the scheduler rejects the READ requests . Otherwise, it outputs the READ request and set R(X) to MAX(R(X), T). –  For WRITE request with timestamp T, if T<MAX(R(X), W(X)), the scheduler rejects the WITE request. Otherwise, it outputs the WRITE request and sets W(X) to T If READ request or WRITE request is rejected, the requesting transaction is aborted, assigned a new larger timestamp and restarted



Multiple Copy Update
•  Multiple copies of a complete database distributed for higher reliability/ availability requirements must be kept consistent –  commit:make all the update made by a transaction permanent –  Abort:roll back( or undo) a transaction to ensure that no effect of the transaction remains in the database


Commit control for replicated database: –  Ensures that either a transaction is committed by every site or aborted by every site (all or nothing) –  Involves a) commit control for a single transaction, and b) serialization of concurrent transactions


2-phase commit protocol
•  Given a coordinator node designated, the commit control for a single transaction can be realized by the following the 2-phase commit protocol; –  Commit-request phase:Coordinator node sends a query to commit message to all the other nodes. Each node replies to the coordinator with agree-to-commit message if the transaction succeeded, or abort message if the transaction failed –  Commit phase:If the coordinator receives “agree to commit” from all the other nodes, it sends them a “commit” message, otherwise sends a “roll back” message to all the nodes •  Access control of replicated database including serialization of concurrent transactions
–  L.Svobodova:”Attaining resilience in distributed systems”, Chapter 5 of Dependability of Resilient Computers(Ed. By T.Anderson) BSP Professional Books 1989

Error recovery
•  •  When a transient fault or a process abort occurs, affected processes are rolled back to a point (checkpoint) prior to the occurrence of the fault Checkpointing:recording a snapshot of the entire state of a process at a moment that is needed to restart the process from that point
CA   Process  A  
communicaDon failure

Process  B  


•  • 

CA: checkpoint for process A   CB:checkpoint for process B If the communication line intersects the line that links CA and CB , there will be an inconsistency in the system state when the failed process is rolled back to the checkpoint CA

Domino effects
•  If processes establish their checkpoints independently of each other, there will occur the Domino effects
Process  A   Process  B  
SA   SB   CA1   CB1   CA2   CB2  


Process  C   SC  
CC1   CA1   CB1   CA2   CB2   CB3   CC2  

Process  A   SA   Process  B  


Process  C   SC  
CC1   CC2  

A recovery line is created if CB3 is additionally established!