Sadiya Farheen

Session : Feb-Jun 2011
FAULT TOLERANCE & FAULT TOLERANCE ARCHITECTURES In Critical Systems Development

Under the guidance of Mr. Manjunath C.R. Asst. Prof., SBMJCE By, Sadiya Farheen 10MT6ECS10 SBMJCE, Jain University
FAULT TOLERANCE
In critical situations, software systems must be fault tolerant. Fault tolerance is required where there are high availability requirements or where system failure costs are very high. Fault tolerance means that the system can continue in operation in spite of software failure.
FAULT TOLERANCE ACTIONS
Fault detection Damage assessment Fault recovery Fault repair
FAULT DETECTION
The first stage of fault tolerance is to detect that a fault (an erroneous system state) has occurred or will occur.
Ex. Insulin pump software:
// The dose of insulin to be delivered m ust alw ays be greater // than zero and less that som e defined m axim um sin gle dose insulin_dose >= 0 & insulin_dose <= insulin_reservo ir_contents // The total am ount of insulin delivered in a day m ust be less // than or equal to a defined daily m axim um dose cum ulative_dose <= m axim um _daily_dose
Types of fault detection

Preventative fault detection
- The fault detection mechanism is initiated before the state change is committed.
Retrospective fault detection

- The fault detection mechanism is initiated after the system state has been changed.
Implementation of preventative fault detection

c la s s Po sitiv eE ve nInteg er { in t v al = 0 ; P os iti ve Ev en In te ge r (in t n ) t h row s Num er icE xc ep tion { if (n < 0 | n%2 = = 1 ) th ro w new N ume ric Ex ce ptio n () ; e ls e v al = n ; }// P os itive E ve nI n te ge r
public vo id a ssign ( in t n ) thro s u e ric xception { if (n < 0 | n 2 = = 1 ) thro ne u e ric xce ptio n (); else val = n ; } // as si gn in t toI nteg er () { re turn va l ; } //to Int eg er boolean e qual s ( ositive ve nInteger n ) { re turn (val == n .val ) ; } // eq uals } // ositive ve n
Damage Assessment
Analyse system state to judge the extent of corruption caused by a system failure. The assessment must check what parts of the state space have been affected by the failure. Generally based on validity functions that can be applied to the state elements to assess if their value is within an allowed range.
Interface CheckableObject { public boolean check(); }

c las s R ob u stA rra y { // C h e ck s th a t a ll th e o bje c ts in a n a rra y o f o bje cts // c on fo rm to s om e de fine d c on s train t b o o lea n [] c he ck S ta te ; C h e c ka b le O b jec t [] th e R ob u stA rra y ; R o b u s tA rra y (C h e ck a ble O b ject [] the A rra y) { c h ec k S tate = n e w b o o lea n [t he A rra y.le n gth ] ; th e R o b us tA rra y = th e A rra y ; } //R o b us tA rra y
p ub lic v o id a ss es sDa m ag e () thr ow s A rra yD ama ge dEx ce ptio n { b oo le an h as B e enD ama ge d = fa lse ; fo r (in t i= 0 ; i <this. theR ob us tA rr a y.le ng th ; i ++ ) { if (! t h eR obu st A rray [i].c he c k ()) { c he ck S tat e [ i] = t ru e ; h as B e enD ama ge d = tru e ; } e ls e c he ck S tat e [ i] = f al s e ; } if (h as B ee nD ama ged ) th ro w new A rr a yDa m ag ed E x cep tio n () ; } //a ss es sDa m ag e } // Ro bu stA rra y
10
Damage assessment techniques
Checksums Pointers Watch dog timers
11
Fault recovery and repair

Forward recovery
- Apply repairs to a corrupted system state.
Backward recovery
- Restore the system state to a known safe state.
Forward recovery is usually application specific - domain knowledge is required to compute possible state corrections. Backward error recovery is simpler. Details of a safe state are maintained and this replaces the corrupted system state.
12
Forward recovery
Corruption of data coding
- Error coding techniques which add redundancy to coded data can be used for repairing data corrupted during transmission.
Redundant pointers
- When redundant pointers are included in data structures (e.g. two-way lists), a corrupted list or filestore may be rebuilt if a sufficient number of pointers are uncorrupted - Often used for database and file system repair.
13
Backward recovery
Transactions are a frequently used method of backward recovery. Changes are not applied until computation is complete. If an error occurs, the system is left in the state preceding the transaction. Periodic checkpoints allow system to 'roll-back' to a correct state.
14
Safe sort procedure

A sort operation monitors its own execution and assesses if the sort has been correctly executed. It maintains a copy of its input so that if an error occurs, the input is not corrupted. Based on identifying and handling exceptions. Possible in this case as the condition for avalid sort is known. However, in many cases it is difficult to write validity checks.
15
c la s s
a fe o rt {
sta tic v o id s o rt ( int [] in tarra y, int orde r ) th ro s o rt rror { in t [] cop y = ne in t [int a rray.le ng th ]; // co py t he in pu t ar ray fo r (in t i = 0; i < in ta rr a y.le ng th ; i++ ) c op y [i] = i nt a rray [i] ; try { ort.b ub ble so rt (in ta rr a y, in ta rr a y.le ng th , o rd er ) ;
16
if (o rd er == o rt. a sce nd in g ) fo r (in t i = 0; i < = in tarra y. len gth- 2 ; i+ +) if (i n ta rr a y [i] > i n ta rra y [i+ 1 ]) or t rro r () ; th ro ne e ls e fo r (in t i = 0; i < = in tarra y. len gth- 2 ; i+ +) if (i n ta rr a y [i+ 1] > in ta rr a y [i]) or t rro r () ; th ro ne } // try bloc k c at c h ( ort rr o r e ) { fo r (in t i = 0; i < in ta rr a y.le ng th ; i++ ) in ta rr a y [i] = co py [i] ; th ro ne or t rro r (" rra y no t s orte d ") ; } //c atch } // s or t } // a fe o rt
17
Fault tolerant architecture

Defensive programming cannot cope with faults that involve interactions between the hardware and the software. Where systems have high availability requirements, a specific architecture designed to support fault tolerance may be required. This must tolerate both hardware and software failure.
18
Hardware fault tolerance

Triple Modular Redundancy(TMR) to cope with hardware failure
19
Software analogies to TMR

N-version programming
- The same specification is implemented in a number of different versions by different teams. All versions compute simultaneously and the majority output is selected using a voting system. - This is the most commonly used approach e.g. in many models of the Airbus commercial aircraft.
Recovery blocks
- A number of explicitly different versions of the same specification are written and executed in sequence. - An acceptance test is used to select the output to be transmitted.
20
N-version programming
21
Recovery blocks
22
Key points
Exceptions are used to support error management in dependable systems. The four aspects of program fault tolerance are failure detection, damage assessment, fault recovery and fault repair. N-version programming and recovery blocks are alternative approaches to fault-tolerant architectures.
23
QUERIES??
24
THANK YOU FOLKS!!!!
25

Sadiya Farheen

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sadiya Farheen

Uploaded by

Copyright:

Available Formats

Session : Feb-Jun 2011

FAULT TOLERANCE & FAULT TOLERANCE ARCHITECTURES In Critical Systems Development

Session : Feb-Jun 2011

Session : Feb-Jun 2011

FAULT TOLERANCE ACTIONS

Fault detection Damage assessment Fault recovery Fault repair

Session : Feb-Jun 2011

Session : Feb-Jun 2011

Types of fault detection

Retrospective fault detection

Session : Feb-Jun 2011

Implementation of preventative fault detection

Session : Feb-Jun 2011

Session : Feb-Jun 2011

Session : Feb-Jun 2011

Interface CheckableObject { public boolean check(); }

Session : Feb-Jun 2011

Session : Feb-Jun 2011

Damage assessment techniques

Checksums Pointers Watch dog timers

Session : Feb-Jun 2011

Fault recovery and repair

Session : Feb-Jun 2011

Session : Feb-Jun 2011

Session : Feb-Jun 2011

Safe sort procedure

Session : Feb-Jun 2011

Session : Feb-Jun 2011

Session : Feb-Jun 2011

Fault tolerant architecture

Session : Feb-Jun 2011

Hardware fault tolerance

Session : Feb-Jun 2011

Software analogies to TMR

Session : Feb-Jun 2011

Session : Feb-Jun 2011

Session : Feb-Jun 2011

THANK YOU FOLKS!!!!

You might also like