You are on page 1of 25

Session : Feb-Jun 2011

FAULT TOLERANCE & FAULT TOLERANCE ARCHITECTURES In Critical Systems Development


Under the guidance of Mr. Manjunath C.R. Asst. Prof., SBMJCE By, Sadiya Farheen 10MT6ECS10 SBMJCE, Jain University

Session : Feb-Jun 2011

FAULT TOLERANCE
In critical situations, software systems must be fault tolerant. Fault tolerance is required where there are high availability requirements or where system failure costs are very high. Fault tolerance means that the system can continue in operation in spite of software failure.

Session : Feb-Jun 2011

FAULT TOLERANCE ACTIONS

Fault detection Damage assessment Fault recovery Fault repair

Session : Feb-Jun 2011

FAULT DETECTION
The first stage of fault tolerance is to detect that a fault (an erroneous system state) has occurred or will occur.
Ex. Insulin pump software:

// The dose of insulin to be delivered m ust alw ays be greater // than zero and less that som e defined m axim um sin gle dose insulin_dose >= 0 & insulin_dose <= insulin_reservo ir_contents // The total am ount of insulin delivered in a day m ust be less // than or equal to a defined daily m axim um dose cum ulative_dose <= m axim um _daily_dose

Session : Feb-Jun 2011

Types of fault detection


Preventative fault detection
- The fault detection mechanism is initiated before the state change is committed.

Retrospective fault detection


- The fault detection mechanism is initiated after the system state has been changed.

Session : Feb-Jun 2011

Implementation of preventative fault detection


c la s s Po sitiv eE ve nInteg er { in t v al = 0 ; P os iti ve Ev en In te ge r (in t n ) t h row s Num er icE xc ep tion { if (n < 0 | n%2 = = 1 ) th ro w new N ume ric Ex ce ptio n () ; e ls e v al = n ; }// P os itive E ve nI n te ge r

Session : Feb-Jun 2011

public vo id a ssign ( in t n ) thro s u e ric xception { if (n < 0 | n 2 = = 1 ) thro ne u e ric xce ptio n (); else val = n ; } // as si gn in t toI nteg er () { re turn va l ; } //to Int eg er boolean e qual s ( ositive ve nInteger n ) { re turn (val == n .val ) ; } // eq uals } // ositive ve n

Session : Feb-Jun 2011

Damage Assessment
Analyse system state to judge the extent of corruption caused by a system failure. The assessment must check what parts of the state space have been affected by the failure. Generally based on validity functions that can be applied to the state elements to assess if their value is within an allowed range.

Session : Feb-Jun 2011

Interface CheckableObject { public boolean check(); }


c las s R ob u stA rra y { // C h e ck s th a t a ll th e o bje c ts in a n a rra y o f o bje cts // c on fo rm to s om e de fine d c on s train t b o o lea n [] c he ck S ta te ; C h e c ka b le O b jec t [] th e R ob u stA rra y ; R o b u s tA rra y (C h e ck a ble O b ject [] the A rra y) { c h ec k S tate = n e w b o o lea n [t he A rra y.le n gth ] ; th e R o b us tA rra y = th e A rra y ; } //R o b us tA rra y

Session : Feb-Jun 2011

p ub lic v o id a ss es sDa m ag e () thr ow s A rra yD ama ge dEx ce ptio n { b oo le an h as B e enD ama ge d = fa lse ; fo r (in t i= 0 ; i <this. theR ob us tA rr a y.le ng th ; i ++ ) { if (! t h eR obu st A rray [i].c he c k ()) { c he ck S tat e [ i] = t ru e ; h as B e enD ama ge d = tru e ; } e ls e c he ck S tat e [ i] = f al s e ; } if (h as B ee nD ama ged ) th ro w new A rr a yDa m ag ed E x cep tio n () ; } //a ss es sDa m ag e } // Ro bu stA rra y

10

Session : Feb-Jun 2011

Damage assessment techniques

Checksums Pointers Watch dog timers

11

Session : Feb-Jun 2011

Fault recovery and repair


Forward recovery
- Apply repairs to a corrupted system state.

Backward recovery
- Restore the system state to a known safe state.

Forward recovery is usually application specific - domain knowledge is required to compute possible state corrections. Backward error recovery is simpler. Details of a safe state are maintained and this replaces the corrupted system state.

12

Session : Feb-Jun 2011

Forward recovery
Corruption of data coding
- Error coding techniques which add redundancy to coded data can be used for repairing data corrupted during transmission.

Redundant pointers
- When redundant pointers are included in data structures (e.g. two-way lists), a corrupted list or filestore may be rebuilt if a sufficient number of pointers are uncorrupted - Often used for database and file system repair.

13

Session : Feb-Jun 2011

Backward recovery
Transactions are a frequently used method of backward recovery. Changes are not applied until computation is complete. If an error occurs, the system is left in the state preceding the transaction. Periodic checkpoints allow system to 'roll-back' to a correct state.

14

Session : Feb-Jun 2011

Safe sort procedure


A sort operation monitors its own execution and assesses if the sort has been correctly executed. It maintains a copy of its input so that if an error occurs, the input is not corrupted. Based on identifying and handling exceptions. Possible in this case as the condition for avalid sort is known. However, in many cases it is difficult to write validity checks.

15

Session : Feb-Jun 2011

c la s s

a fe o rt {

sta tic v o id s o rt ( int [] in tarra y, int orde r ) th ro s o rt rror { in t [] cop y = ne in t [int a rray.le ng th ]; // co py t he in pu t ar ray fo r (in t i = 0; i < in ta rr a y.le ng th ; i++ ) c op y [i] = i nt a rray [i] ; try { ort.b ub ble so rt (in ta rr a y, in ta rr a y.le ng th , o rd er ) ;

16

Session : Feb-Jun 2011

if (o rd er == o rt. a sce nd in g ) fo r (in t i = 0; i < = in tarra y. len gth- 2 ; i+ +) if (i n ta rr a y [i] > i n ta rra y [i+ 1 ]) or t rro r () ; th ro ne e ls e fo r (in t i = 0; i < = in tarra y. len gth- 2 ; i+ +) if (i n ta rr a y [i+ 1] > in ta rr a y [i]) or t rro r () ; th ro ne } // try bloc k c at c h ( ort rr o r e ) { fo r (in t i = 0; i < in ta rr a y.le ng th ; i++ ) in ta rr a y [i] = co py [i] ; th ro ne or t rro r (" rra y no t s orte d ") ; } //c atch } // s or t } // a fe o rt
17

Session : Feb-Jun 2011

Fault tolerant architecture


Defensive programming cannot cope with faults that involve interactions between the hardware and the software. Where systems have high availability requirements, a specific architecture designed to support fault tolerance may be required. This must tolerate both hardware and software failure.

18

Session : Feb-Jun 2011

Hardware fault tolerance


Triple Modular Redundancy(TMR) to cope with hardware failure

19

Session : Feb-Jun 2011

Software analogies to TMR


N-version programming
- The same specification is implemented in a number of different versions by different teams. All versions compute simultaneously and the majority output is selected using a voting system. - This is the most commonly used approach e.g. in many models of the Airbus commercial aircraft.

Recovery blocks
- A number of explicitly different versions of the same specification are written and executed in sequence. - An acceptance test is used to select the output to be transmitted.

20

Session : Feb-Jun 2011

N-version programming

21

Session : Feb-Jun 2011

Recovery blocks

22

Session : Feb-Jun 2011

Key points
Exceptions are used to support error management in dependable systems. The four aspects of program fault tolerance are failure detection, damage assessment, fault recovery and fault repair. N-version programming and recovery blocks are alternative approaches to fault-tolerant architectures.

23

QUERIES??

24

THANK YOU FOLKS!!!!

25

You might also like