Professional Documents
Culture Documents
Sadiya Farheen
Sadiya Farheen
FAULT TOLERANCE
In critical situations, software systems must be fault tolerant. Fault tolerance is required where there are high availability requirements or where system failure costs are very high. Fault tolerance means that the system can continue in operation in spite of software failure.
FAULT DETECTION
The first stage of fault tolerance is to detect that a fault (an erroneous system state) has occurred or will occur.
Ex. Insulin pump software:
// The dose of insulin to be delivered m ust alw ays be greater // than zero and less that som e defined m axim um sin gle dose insulin_dose >= 0 & insulin_dose <= insulin_reservo ir_contents // The total am ount of insulin delivered in a day m ust be less // than or equal to a defined daily m axim um dose cum ulative_dose <= m axim um _daily_dose
public vo id a ssign ( in t n ) thro s u e ric xception { if (n < 0 | n 2 = = 1 ) thro ne u e ric xce ptio n (); else val = n ; } // as si gn in t toI nteg er () { re turn va l ; } //to Int eg er boolean e qual s ( ositive ve nInteger n ) { re turn (val == n .val ) ; } // eq uals } // ositive ve n
Damage Assessment
Analyse system state to judge the extent of corruption caused by a system failure. The assessment must check what parts of the state space have been affected by the failure. Generally based on validity functions that can be applied to the state elements to assess if their value is within an allowed range.
p ub lic v o id a ss es sDa m ag e () thr ow s A rra yD ama ge dEx ce ptio n { b oo le an h as B e enD ama ge d = fa lse ; fo r (in t i= 0 ; i <this. theR ob us tA rr a y.le ng th ; i ++ ) { if (! t h eR obu st A rray [i].c he c k ()) { c he ck S tat e [ i] = t ru e ; h as B e enD ama ge d = tru e ; } e ls e c he ck S tat e [ i] = f al s e ; } if (h as B ee nD ama ged ) th ro w new A rr a yDa m ag ed E x cep tio n () ; } //a ss es sDa m ag e } // Ro bu stA rra y
10
11
Backward recovery
- Restore the system state to a known safe state.
Forward recovery is usually application specific - domain knowledge is required to compute possible state corrections. Backward error recovery is simpler. Details of a safe state are maintained and this replaces the corrupted system state.
12
Forward recovery
Corruption of data coding
- Error coding techniques which add redundancy to coded data can be used for repairing data corrupted during transmission.
Redundant pointers
- When redundant pointers are included in data structures (e.g. two-way lists), a corrupted list or filestore may be rebuilt if a sufficient number of pointers are uncorrupted - Often used for database and file system repair.
13
Backward recovery
Transactions are a frequently used method of backward recovery. Changes are not applied until computation is complete. If an error occurs, the system is left in the state preceding the transaction. Periodic checkpoints allow system to 'roll-back' to a correct state.
14
15
c la s s
a fe o rt {
sta tic v o id s o rt ( int [] in tarra y, int orde r ) th ro s o rt rror { in t [] cop y = ne in t [int a rray.le ng th ]; // co py t he in pu t ar ray fo r (in t i = 0; i < in ta rr a y.le ng th ; i++ ) c op y [i] = i nt a rray [i] ; try { ort.b ub ble so rt (in ta rr a y, in ta rr a y.le ng th , o rd er ) ;
16
if (o rd er == o rt. a sce nd in g ) fo r (in t i = 0; i < = in tarra y. len gth- 2 ; i+ +) if (i n ta rr a y [i] > i n ta rra y [i+ 1 ]) or t rro r () ; th ro ne e ls e fo r (in t i = 0; i < = in tarra y. len gth- 2 ; i+ +) if (i n ta rr a y [i+ 1] > in ta rr a y [i]) or t rro r () ; th ro ne } // try bloc k c at c h ( ort rr o r e ) { fo r (in t i = 0; i < in ta rr a y.le ng th ; i++ ) in ta rr a y [i] = co py [i] ; th ro ne or t rro r (" rra y no t s orte d ") ; } //c atch } // s or t } // a fe o rt
17
18
19
Recovery blocks
- A number of explicitly different versions of the same specification are written and executed in sequence. - An acceptance test is used to select the output to be transmitted.
20
N-version programming
21
Recovery blocks
22
Key points
Exceptions are used to support error management in dependable systems. The four aspects of program fault tolerance are failure detection, damage assessment, fault recovery and fault repair. N-version programming and recovery blocks are alternative approaches to fault-tolerant architectures.
23
QUERIES??
24
25