Proposed by- (Koo & Toueg) A coordinated checkpointing and recovery technique that takes a consistent set of checkpointing and avoids domino effect and livelock problems during the recovery. Includes 2 parts: the checkpointing algorithm and the recovery algorithm
THE CHECKPOINTING ALGORITHM
Assumptions a) Processes communicate by exchanging messages through communication channels b) Channels are FIFO c) End-to-end protocols (such a sliding window) are assumed to cope with message loss due to rollback recovery and communication failures d) Communication failures do not partition the network Two kinds of checkpoints on stable storage: permanent and tentative 1. Permanent checkpoint- It is local checkpoint on a process and part of a consistent global checkpoint. 2. Tentative checkpoint- It is temporary checkpoint, become permanent checkpoint when the algorithm terminates successfully. Processes roll back only to their permanent checkpoint. Futhermore the check pointing algorithm assumes that a single process invokes the algorithm and also ensures that no site in the distributed system fails during the execution of the algorithm. ALOGRITHM: Phase 1- 1) Initiating process Pi takes a tentative checkpoint and requests that all the processes take tentative checkpoints. 2) Each process informs Pi whether it succeeded in taking a tentative checkpoint. 3) If Pi learns that all processes have taken tentative checkpoints, Pi decides that all tentative checkpoints should be made permanent. 4) Otherwise, Pi decides that all tentative checkpoints should be discarded. Phase 2- 1) Pi propagates its decision to all processes. 2) On receiving the message from Pi, all processes act accordingly. 3) No process sends message after taking a tentative checkpoint till phase 2 is completed. Characteristics: all or none of the processes take permanent checkpoints there is no record of a message being received but not sent THE ROLL BACK RECOVERY ALGORITHM Assumptions- a) A single process invokes the algorithm b) Checkpoint and rollback recovery are not concurrently invoked Algorithm- Phases 1: 1) Process Pi checks whether all processes are willing to restart from their previous checkpoints. 2) A process may reply “no” if it is already participating in a checkpointing or recovering process initiated by some other process. 3) If all processes are willing to restart from their previous checkpoints, Pi decides that they should restart. 4) Otherwise, Pi decides that all the processes continue with their normal activities. (Pi may attempt recovery at later time) Phase 2: 1) Pi propagates its decision to all processes. 2) On receiving Pi ’s decision, the processes act accordingly. Properties all or none of the processes restart from checkpoints after rollback, all processes resume in a consistent state