You are on page 1of 43

LECTURE-1: Overview

Prof. Anand Mohan
Electronics Engineering, IT-BHU
Fault Tolerant System Correctly Performs
Specified Task in Presence of Hardware
Failures / Software Errors

Fault Tolerance System’s Attribute to Achive
Fault Tolerant Operation
Easily Testable System Simple & Straightforward
Verification of Correct Operation
Approaches for Design of Dependable Systems
 Fault avoidance: Preventing fault occurrence in the
operational system limits introduction of faults
during system construction
 Fault prevention: Attempts eliminating possibility
of fault creeping in
 Fault tolerance: Masking effect of fault
(System meets specifications in
the presence of faults )
 Fault removal: Reduces presence / number /
seriousness of faults
 Fault forecasting: Estimates present number /
future incidence &
consequences of faults
 Although fault forecasting can be used for
fault avoidance, fault prevention, fault
removal & fault tolerance
Difficult to ensure that system shall not
develop faults

 Hardware / Software / Networks can’t be
totally free from failures

 Therefore fault tolerant design becomes an
important system attribute for dependability
System Design
Process

5
6
Three Barriers Created due to Design Techniques
7
Fault Tolerant Systems
“To find fault is easy;
to do better may be difficult”
--
“fault-tolerant system” is one that continues to perform
Plutarch
at desired level of service in spite of failures in some
components
 Fault Tolerance System Property to Recover
from Partial Failure

Performance affected by Partial Failure but should not
stop System Operation

Fault tolerance measures attempt minimizing impact of
partial failure on operation & performance of the
system
Essence of fault tolerant system is Dependability

Fault tolerance QoS Measure to be achieved with
minimal user’s involvement
Attributes of Dependable Systems:
 Availability
 Reliability
 Safety
 Maintainability
 Testability

QoS Availability + Reliability + Safety + Maintainability
+ Testability + Performability
 Availability : System is ready to be used
immediately
 Reliability : System can run consistently without
complete failure
 Safety : If system temporarily fails to operate
correctly, nothing catastrophic
happens
Stopping nuclear
Or
reaction on leakage
detection
Measure of Fail-Safe Probability of system’s correct
Capability performance; else discontinuing function
with overall safety
 Maintainability: Ease in repairing a failed system

 Testability : Measure of ease in testing certain
attributes of HW/SW system
Goal of Fault Tolerant Design
Fulfilling Performability P(L, t) & Dependability
requirements of a system

Probability of System Reduced throughput / Available
Performance at or memory
above some level ‘L’ at
time ‘t’

Graceful Degradation System ability to automatically
degrade its performance level to compensate for HW failures
/ Software errors
Closely related to Performability
Failure Modes:
 Fail-Action (Fail-Safe) – Protective system initiates
protective action on fault occurrence

 Fail-No-Action (Fail-to-Danger) – No protective action
can be taken upon occurrence of fault
Fault Response:
 Covert Faults – Hidden or non-self revealing faults
• Difficult to detect ( no fault response)
• Could result in a fail-to-danger situation
 Overt Faults– Obvious or self-revealing faults
• Overt faults may result in unnecessary shutdown if system
does not posses fault tolerance
• Fail-safe designs ensure safe state of system operation
Failure Models
Crash failures: Device / Component simply halts, but
behaves correctly before halting
Omission failures: Device / Component fails to respond
(send / receive a message)
Timing failures: The output of a component is correct,
but lies outside a specified real-
time interval (e.g. performance failure:
too slow)

Response failures: The output of a component is
incorrect (& the error can not be
accounted to another component)
Problem Crash Failures Clients cannot distinguish
between crashed component &
one that is just a bit slow

 Value failure: Wrong value is produced
 State transition failure: Execution of service by
component brings it into a wrong state

 Arbitrary (Byzantine) failures: A component may
produce arbitrary output and exhibits arbitrary timing
failures

14
Fail-stop: The component exhibits crash failures, but
its failure can be detected (either through
announcement or timeouts)
Fail-silent: The component exhibits omission or
crash failures; clients cannot tell what
went wrong

Fault Tolerance – Standards
 Safety Instrument System (SIS): Instrumentation or controls
that are responsible for bringing a process to a safe state
in the event of a failure
 Safety Integrity Level (SIL): Statistical representation of the
availability of a SIS at the time of a process demand
15
Fault Tolerant Strategies
Fault tolerance is achieved
Redundancy in hardware, software, information,
and /or Time (Computations)

Static / Dynamic / Hybrid Redundancy

Or
Fault Masking i.e. preventing introduction
of errors in a system due to faults

Reconfiguration process of
eliminating faulty component and restoring
system to some operational state
Reconfiguration Steps
Fault detection Recognizing fault occurrence

Often required before any recovery
procedure can be initiated
Fault location Locating faulty component / module

required for initiating appropriate
recovery process

Fault containment Isolating fault and preventing
propagation of its effects
throughout system
Fault recovery Process of remaining operational /
regaining operational status via
reconfiguration
Fault Tolerance Necessity

Long-Life High-
Hazardous
Systems Availability
Production
Critical Control/ Maintenance
Environments Postponement
Computation
•Satellites,
•Unmanned/ Chemical industry
Manned •Aircraft (methyl iso cyanate
Space Probes (Bhopal gas tragedy)
Controllers,
•Space Shuttles nitric acid !

•Life Support / Transaction Processing
Telephone
& Banking (Stock Market
Diagnostic Switching
Systems / ATM)
Medical Systems System
Challenger Columbia catastrophe Expensive
Maintenance
 could FT design help??
(e.g. Space
Applications)
Prevention from Fault Consequences

 Training operations / maintenance personnel
on protective system operation
 Simulated emergency training both initial
and refresher
 Review of protective system adequacy
When modules / units are changed /
considering performance history
 Design verification through both qualitative
and quantitative review exercises
Lecture 2

Reliability & Related Issues
Reliability ⇒Probability of Failure ⇒ Performance
Under Stated Conditions & Specified Time
Period
Reliability ⇒0.9i Where ‘i’ is the no. of 9s

Factors Affecting Reliability:
 Design ⇒ Reliability Included ⇒ Design Parameter

Top-level Reliability Requirements Allocated to Subsystems

 Environment ⇒(Temperature, Location & Velocity) ⇒ Difference
in Manufacturing & Operational Environmental Conditions
 Components ⇒ Using High Quality Components
Increased Systems Cost
RELIABILITY ENHANCEMENT TECHNIQUES:

High Quality Components : e.g. Low Tolerance Active /
Passive Components
Quality Control Procedures : Adhering to High Quality
Assembly Standards
Worst Case Design : Design Considering Worst Case
Parameters

Above Procedures ⇒Effective But Costly ⇒Cost

Effective Approach ⇒ Redundancy
REDUNDANCY ⇒Used for “Masking Effects of Faults”
Does Not Require High Quality Components

Uses Standard Components ⇒ Redundant / Reconfigurable
Self-Repairing Systems

REDUNDANT & RECONFIGURABLE ARCHITECTURE

Redundancy is an Important Design Technique

Automobile Brake Light Might Use 2 Light Bulbs,if
One Fails, Other is Available
Surviving
S (t)
Reliability ; R (t) Components
N

Failed
F (t)
Unreliability ; Q (t) Components
N

Reliability + Unreliability = One

S(t) No. of Surviving components
F(t) No. of Failure components
N No. of Identical components
FAILURE/HAZARD RATE ⇒Number of Failures Per Unit
Time Compared to Number of Surviving
Components
dF (t)
dt
Failure / Hazard Rate ; Z (t) ⇒ =λ
S (t)
Estimating Useful Life Period

BATH-TUB CURVE ⇒ Failure Rates of Components Under
Normal Conditions
BATH-TUB CURVE
Component(s) &
Ageing effect
interconnection
defect(s)

Wear-out period/
Early life /
Useful Life Period/ End of Life Period
burn in period
Constant Failure Rate (λ)
Failure Rate

(Random Failure)

Time
EXPONENTIAL FAILURE LAW ⇒Relates Reliability &
Failure Rate

R (t) = exp ( -λt )

For small λt R (t) = 1 - λt Percentage Failures Per
1000 Hours / Failures Per
Hour

SYSTEM FAILURE RATE ⇒ System Failures ⇒Component
Failures

There are ‘K’ Type of Components, Each with Failure Rate λk

λov = Σ Nk λk

Where there are N K of Each Type of Component
RELIABILITY ⇒Depends on Operating Conditions &
Time Of Operation
Not Suitable for Realistic Conditions

MEAN TIME BETWEEN FAILURE ⇒Area Underneath
Reliability Curve or Average Time a
System will Run Between Failures
MTBF


1
MTBF = ∫ R (t) dt = (hours)
º λ
t
MTBF =
1 – R (t)
RELIABILITY CURVE 1.0
Reliability Decreases with

Reliability R (t)
0.8 Increasing Time
0.6
0.4
0.2

1MTBF 2MTBF 3MTBF

Time t

MAINTAINABILITY: M (t)⇒ Probability of Isolating & Repairing
a Fault within Time

M (t) = 1 – exp(-µt) = 1 – exp (- t / MTTR)

Repair Rate Mean Time to Repair
SYSTEM REPAIR TIME (for Maintainability)

Passive Repair Time Active Repair Time
(Traveling Time to Site) (Actual Repair Time)

Time Between Time to Detect Fault Time to Verify
Time to Replace
Occurrence & + & Isolate Faulty + Faulty Component + Correct System
Awareness of Component Operation
Failure
Repair Time Replaceable Components With
Reduction
Self Test Feature
AVAILABILITY:A (t) ⇒ Function of Time ⇒Defined as
Probability of Correct System
Time Instant Instead of Time Operations & its Availability to
Interval as in Reliability Perform its Functions at Time “t”

TYPES of AVAILABILITY

Inherent Availability Operational Availability

 Depends on  Depends On
Inherent design
Inherent Design
 Theoretical value Availability of Spare Parts
Maintenance Policy
Maximum
 Highly Available Systems ⇒ May Have Frequent
Inoperability Periods of Extremely Short Duration
 Availability ⇒ Depends Upon Frequency of
Inoperability & Quickness of Repairing
 Availability ⇒ Important Design Goal

Providing Services as Often as Possible is Primary
System’s Purpose

 Highly Available Systems Possess ⇒ High Probability
of Performing Correctly at Desired Instant

High Availability Applications ⇒Time Shared Computing /
Transaction Processing Systems
System up time
Availability =
System up time + System down time
System up time
=
System up time +(No. of Failures × MTTR)
System up time
=
System up time +(System up-time × λ × MTTR)
1
=
1 + (λ ×MTTR)

MTBF
= Since λ = 1/MTBF
MTBF + MTTR

If MTTR Availability Economical System
Fault Tolerance Improves System’s Availability ⇒ Spare /
Standby Components
MTBF (MAX)
INHERENT AVAILABILITY Ai =
(MTBF+MTTR)

If MTBF ⇒ [R (t) ] compared to MTTR ⇒Availability
for Low MTBF we require better Maintainability (i.e.,
short MTTR) to obtain same availability
Tradeoff between Reliability & Maintainability to
obtain same Availability
MTBM
OPERATIONAL AVAILABILITY A =
o (MTBM+MDT)
MTBM Mean Time Between Maintenance
MDT Mean Down Time
Figure 1 Operational Availability Relationships
For Fixed % Availability
Reliability Improves as Time Between Failure
1000000
A = 99.9%
100000
Mean Time Between Maintenance (hrs)

10000 A = 99%

1000
Actions Increases

100

10 A = 95%

MTBM A = 90%
1 A =
o
MTBM + MDT
A = 85%
.1
1/(Mean Down Times)  .0001 .001 .01 .1 1 10 100 1000 10000 1/hrs
or
Mean Down Times (hrs)  10000 1000 100 10 1 .1 .01 .001 .0001 hrs
Maintainability Improves as Time to Repair Decreases
SAFETY: S (t) ⇒ Probability of Correct System Performance;
else Discontinuing Function with Overall Safety
(of other systems or people)

Measure of “Fail-Safe” Capability of a System

PERFORMABILITY: P (L, t) ⇒Probability that System’s
Performance will be at ,or Above, Some Level “L”
at the Instant of Time
Lowered Performance ⇒Multiprocessor
Reduced Throughput or Reduced Available Memory
DIFFERENCE BETWEEN RELIABILITY & PERFORMABILITY :

Reliability ⇒ All the Functions are Performed Correctly
Performability ⇒ Subset of the Functions are Performed
Correctly

GRACEFUL DEGRADATION ⇒System’s Ability to
Automatically Decrease its Level of Performance to
Compensate Hardware Failures & Software Errors

Fault Tolerance can Provide Graceful Degradation & Improve
Performability by Eliminating Failed Hardware and Software
from System
TEST ⇒Determining Existence & Quality of Certain
Attributes within a System ⇒Designing Test to
Verify Processor’s Throughput

TESTABILITY ⇒Ability to Test for Certain Attributes
within a System

Can be Improved ⇒Including Testing as Integral
Part of the Design Parameter
• Measure of Testability is Ease in Testing
• Testability ⇒ Related to Maintainability ⇒Minimizing
Time to Identify & Locate Problems
Purpose of Testing:

 Does System Work ?
 Does System Possess Complete Capability ?

DEPENDABILITY ⇒ QoS Provided by a Particular System

Reliability , Availability , Safety , Maintainability , Performability
& Testability are Measures to Quantify System Dependability
SERIES SYSTEM Each subsystem must function

1 2 N

Overall Reliability Rov = Ri ; i 1 to N

For identical subsystems, Ri = R

Rov = R N MTBF = 1 / Nλ
Decreases by N - fold Decrease by factor N

Need highly reliable individual subsystem
PARALLEL SYSTEMS One subsystem is sufficient

1

2 Rov = 1 - ( 1 - Ri ); i = 1 to N

N

For identical subsystems, Ri = R

Rov = 1- ( 1 – R ) N
MTBF = Σ(1 / j) λ ; j =1 to N
DIFFERENT INTERCONNECTIONS

Parallel to Series Series to Parallel

A B A B

C D C D

Used when primary failure Used when primary failure
mode is an open circuit. mode is an short circuit

Always R ps > R sp
REFERENCES:

Fault tolerant digital system design by Parag.K.Lala

Design and Analysis of fault tolerant digital system by
Barry W. Johnson

http://www.barringer1.com/jul01prb.htm

http://www.relex.com/resources/maintpred.asp

http://en.wikipedia.org/wiki/Mean_time_between_failure