UNIVERSITY OF MASSACHUSETTS Dept.

of Electrical & Computer Engineering Fault Tolerant Computing
ECE 655 Part 1 Introduction

C. M. Krishna Fall 2006

ECE655/Krishna Part.1 .1

Copyright 2006 Koren & Krishna

Prerequisites
♦ Basic courses in ∗ Digital Design ∗ Hardware Organization/Computer Architecture ∗ Probability

ECE655/Krishna Part.1 .2

Copyright 2006 Koren & Krishna

Page 1

Fault Tolerance - Basic definition
♦ Fault-tolerant systems - ideally systems
capable of executing their tasks correctly regardless of either hardware failures or software errors ♦ In practice - we can never guarantee the flawless execution of tasks under any circumstances ♦ Limit ourselves to types of failures and errors which are more likely to occur

ECE655/Krishna Part.1 .3

Copyright 2006 Koren & Krishna

Need For Fault Tolerance
♦ 1. ♦ 2. ♦ 3.

ECE655/Krishna Part.1 .4

Copyright 2006 Koren & Krishna

Page 2

Need For Fault Tolerance - Life Critical Applications
♦ Life-critical applications such as;
aircrafts, nuclear reactors, chemical plants, and medical equipment ♦ A malfunction of a computer in such applications can lead to catastrophe ♦ Their probability of failure must be extremely low, possibly one in a billion per hour of operation

ECE655/Krishna Part.1 .5

Copyright 2006 Koren & Krishna

Need for Fault Tolerance - Harsh Environment
♦ A computing system operating in a harsh
environment where it is subjected to ∗ electromagnetic disturbances ∗ particle hits and alike ♦ Very large number of failures means: the system will not produce useful results unless some fault-tolerance is incorporated

ECE655/Krishna Part.1 .6

Copyright 2006 Koren & Krishna

Page 3

Need For Fault Tolerance - Highly Complex Systems
♦ Complex systems consist of millions of devices ♦ Every physical device has a certain probability
of failure ♦ A very large number of devices implies that the likelihood of failures is high ♦ The system will experience faults at such a frequency which renders it useless

ECE655/Krishna Part.1 .7

Copyright 2006 Koren & Krishna

Fault Tolerance Measures
♦ It is important to have proper yardsticks measures - by which to measure the effect of fault tolerance ♦ A measure is a mathematical abstraction, which expresses only some subset of the object's nature ♦ Measures?

ECE655/Krishna Part.1 .8

Copyright 2006 Koren & Krishna

Page 4

♦ Assumption: The system can be in one of two states:

Traditional Measures

‘’up” or ‘’down” ♦ Examples: A lightbulb is either good or burned out; A wire is either connected or broken ♦ Two traditional measures: Reliability and Availability ♦ Reliability, R(t): probability that the system is up during the interval [0,t], given it was up at time 0 ♦ Availability, A(t), is the fraction of time that the system is up during the interval [0,t] ♦ Point Availability, Ap(t), is the probability that the system is up at time t ♦ A related measure is MTTF - Mean Time To Failure the average time the system remains up before it goes down and has to be repaired or replaced
ECE655/Krishna Part.1 .9 Copyright 2006 Koren & Krishna

Need For More Measures
♦ The assumption of the system being in state
‘’up” or ‘’down” is very limiting ♦ Example: A processor with one of its several hundreds of millions of gates stuck at logic value 0 and the rest is functional - may affect the output of the processor once in every 25,000 hours of use ♦ The processor is not fault-free, but cannot be defined as being ‘’down” ♦ More general measures than the traditional reliability and availability are needed
Copyright 2006 Koren & Krishna

ECE655/Krishna Part.1 .10

Page 5

More General Measures
♦ Capacity Reliability - probability that the
system capacity (as measured, for example, by throughput) at time t exceeds some given threshold at that time ♦ Another extension - consider everything from the perspective of the application ♦ This approach was taken to define the measure known as Performability

ECE655/Krishna Part.1 .11

Copyright 2006 Koren & Krishna

Computational Capacity
♦ Example: N processors in a gracefully degrading
system ♦ System recovers from failures of processors and is useful as long as at least one processor remains operational ♦ Let Pi = Prob {i processors are operational}

♦ R(t) = Σ Pi ♦ Let c be the computational capacity of a processor
(e.g., number of fixed-size tasks it can execute) ♦ Computational capacity of i processors: Ci = i • c

♦ Computational capacity of the system: Σ Ci • Pi
ECE655/Krishna Part.1 .12 Copyright 2006 Koren & Krishna

Page 6

Performability
♦ Application is used to define ‘’accomplishment
levels” L1, L2,...,Ln ♦ Each represents a level of quality of service delivered by the application ♦ Example: Li indicates i system crashes during the mission time period T ♦ Performability is a vector (P(L1),P(L2),...,P(L n)) where P(Li) is the probability that the computer functions well enough to permit the application to reach up to accomplishment level Li

ECE655/Krishna Part.1 .13

Copyright 2006 Koren & Krishna

Network Connectivity Measures
♦ Focus on the network that connects the
processors ♦ Classical Node and Line Connectivity - the minimum number of nodes and lines, respectively, that have to fail before the network becomes disconnected ♦ Measure indicates how vulnerable the network is to disconnection ♦ A network disconnected by the failure of just one (critically-positioned) node is potentially more vulnerable than another which requires several nodes to fail before it becomes disconnected
ECE655/Krishna Part.1 .14 Copyright 2006 Koren & Krishna

Page 7

Connectivity - Examples

ECE655/Krishna Part.1 .15

Copyright 2006 Koren & Krishna

Network Resilience Measures
♦ Classical connectivity distinguishes between only
two network states: connected and disconnected ♦ It says nothing about how the network degrades as nodes fail before becoming disconnected ♦ Two possible resilience measures: ∗ Average node-pair distance ∗ Network diameter - the maximum node-pair distance ♦ Both calculated given the probability of node and/or link failure
ECE655/Krishna Part.1 .16 Copyright 2006 Koren & Krishna

Page 8

More Network Measures
♦ What happens upon network disconnection ♦ A network that splits into one large connected

component and several small pieces may be able to function, vs. a network that splinters into a large number of small pieces ♦ Another measure of a network's resilience to failure: probability distribution of the largest component upon disconnection

ECE655/Krishna Part.1 .17

Copyright 2006 Koren & Krishna

Redundancy
♦ Redundancy is at the heart of fault tolerance ♦ Redundancy can be defined as the incorporation

of extra parts in the design of a system in such a way that its function is not impaired in the event of a failure ♦ We will study four forms of redundancy:

∗ ∗ ∗ ∗

1. 2. 3. 4.

ECE655/Krishna Part.1 .18

Copyright 2006 Koren & Krishna

Page 9

Hardware Redundancy
♦ Extra hardware is added to override the
effects of a failed component ♦ Static Hardware Redundancy - for immediate masking of a failure ♦ Example: Use three processors (instead of one), each performing the same function. The majority output of these processors can override the wrong output of a single faulty processor ♦ Dynamic Hardware Redundancy - spare components are activated upon the failure of a currently active component ♦ Hybrid Hardware Redundancy - A combination of static and dynamic redundancy techniques
ECE655/Krishna Part.1 .19 Copyright 2006 Koren & Krishna

Software Redundancy
♦ Software redundancy is provided by having
multiple teams of programmers ♦ Write different versions of software for the same function ♦ The hope is that such diversity will ensure that not all the copies will fail on the same set of input data

ECE655/Krishna Part.1 .20

Copyright 2006 Koren & Krishna

Page 10

Information and Time Redundancy
♦ Information redundancy: provided by adding bits to
the original data bits so that an error in the data bits can be detected and even corrected ♦ Error detecting and correcting codes have been developed and are being used ♦ Information redundancy often requires hardware redundancy to process the additional bits ♦ Time redundancy: provided by having additional time during which a failed execution can be repeated ♦ Most failures are transient - they go away after some time ♦ If enough slack time is available, the failed unit can recover and redo the affected computation
ECE655/Krishna Part.1 .21 Copyright 2006 Koren & Krishna

Hardware Faults Classification
♦ Three types of faults: ♦ Transient Faults - disappear after a relatively short
time ∗ Example - a memory cell whose contents are changed spuriously due to some electromagnetic interference ∗ Overwriting the memory cell with the right content will make the fault go away ♦ Permanent Faults - never go away, component has to be repaired or replaced ♦ Intermittent Faults - cycle between active and benign states ∗ Example - a loose connection
ECE655/Krishna Part.1 .22 Copyright 2006 Koren & Krishna

Page 11

Failure Rate
♦ The rate at which a component suffers faults
depends on its age, the ambient temperature, any voltage or physical shocks that it suffers, and the technology ♦ The dependence on age is usually captured by the bathtub curve:

ECE655/Krishna Part.1 .23

Copyright 2006 Koren & Krishna

Bathtub Curve
♦ When components are very young, the failure rate
is quite high: there is a good chance that some units with manufacturing defects slipped through manufacturing quality control and were released ♦ As time goes on, these units are weeded out, and the unit spends the bulk of its life showing a fairly constant failure rate ♦ As it becomes very old, aging effects start to take over, and the failure rate rises again

ECE655/Krishna Part.1 .24

Copyright 2006 Koren & Krishna

Page 12

Empirical Formula for λ - Failure Rate

♦λ = π L πQ (C1 π T π V + C2 π E) ∗ π L: Learning factor, (how mature the technology is) ∗ π Q: Manufacturing process Quality factor (0.25 to 20.00) ∗ π T: Temperature factor, (from 0.1 to 1000), proportional
depending on the supply voltage and the temperature); does not apply to other technologies (set to 1)

∗ π V: Voltage stress factor for CMOS devices (from 1 to 10 ∗ π E: Environment shock factor: from about 0.4 (air∗ C1, C2: Complexity factors; functions of number the chip and number of pins in the package ∗ Further details: MIL-HDBK-217E handbook
ECE655/Krishna Part.1 .25

to exp(-Ea/kT) where Ea is the activation energy in electronvolts associated with the technology, k is the Boltzmann constant and T is the temperature in Kelvin

conditioned environment), to 13.0 (harsh environment) of gates on
Copyright 2006 Koren & Krishna

Environment Impact
♦ Devices operating in space, which is replete
with energy-charged particles and can subject devices to severe temperature swings, can be expected to fail more often ♦ Similarly for computers in automobiles (high temperature and vibration) and industrial applications

ECE655/Krishna Part.1 .26

Copyright 2006 Koren & Krishna

Page 13

Faults Vs. Errors
♦ A fault can be either a hardware defect or
a software/programming mistake ♦ By contrast, an error is a manifestation of a fault ♦ For example, consider an adder circuit one of whose output lines is stuck at 1. This is a fault, but not (yet) an error. This fault causes an error when the adder is used and the result on that line is supposed to have been a 0, rather than a 1

ECE655/Krishna Part.1 .27

Copyright 2006 Koren & Krishna

Propagation of Faults and Errors
♦ Both faults and errors can spread through
the system ∗ If a chip shorts out power to ground, it may cause nearby chips to fail as well ♦ Errors can spread because the output of one processor is frequently used as input by other processors ∗ Adder example: the erroneous result of the faulty adder can be fed into further calculations, thus propagating the error

ECE655/Krishna Part.1 .28

Copyright 2006 Koren & Krishna

Page 14

Containment Zones
♦ To limit such situations, designers incorporate
containment zones into systems ♦ Barriers that reduce the chance that a fault or error in one zone will propagate to another ∗ A fault-containment zone can be created by providing an independent power supply to each zone ∗ The designer tries to electrically isolate one zone from another ∗ An error-containment zone can be created by using redundant units and voting on their output

ECE655/Krishna Part.1 .29

Copyright 2006 Koren & Krishna

Time to Failure - Analytic Model
♦ Consider the following model: ∗ N identical components, all operational at time t=0 ∗ Each component remains operational until it is hit
by a failure ∗ All failures are permanent and occur in each component independently from other components ∗ We first concentrate on one component ♦ T - the lifetime of one component - the time until it fails ♦ T is a random variable ♦ f(t) - the density function of T ♦ F(t) - the cumulative distribution function of T
ECE655/Krishna Part.1 .30 Copyright 2006 Koren & Krishna

Page 15

Probabilistic Interpretation
♦ F(t) - the probability that the component will
fail at or before time t ∗ F(t) = Prob (T ≤ t) ♦ f(t) - the momentary rate of failure ∗ f(t)dt = Prob (t ≤ T ≤ t+dt) ♦ Like any density function (defined for t ≥ 0)

♦ These functions are related through ∗ f(t)=dF(t)/dt
ECE655/Krishna Part.1 .31

∞ ∗ f(t) dt =1 0 ∗ f(t) ≥ 0, for all t ≥ 0

∗ F(t)= 0 f(s) ds

t

Copyright 2006 Koren & Krishna

Reliability and Failure (Hazard) Rate
♦ The reliability of a single component - R(t) ∗ R(t) = Prob (T>t) = 1- F(t) ♦ The failure probability of a component at time t,
p(t) - the conditional probability that the component will fail at time t, given it has not failed before ∗ p(t) = Prob (t ≤ T ≤ t+dt | T ≥ t) = Prob (t ≤ T≤ t+dt) / Prob(T ≥ t) = f(t)dt / (1-F(t)) ♦ The failure rate (or hazard rate) of a component at time t, h(t), is defined as p(t)/dt ∗ h(t) = f(t)/(1- F(t)) ♦ Since dR(t)/dt = -f(t), we get h(t) = -1/R(t) dR(t)/dt
ECE655/Krishna Part.1 .32 Copyright 2006 Koren & Krishna

Page 16

Constant Failure Rate
♦ If the failure rate is constant over time h(t) = λ - then ♦ dR(t) / dt = - λ R(t) ; R(0)=1 ♦ The solution of the differential equation is -λ t R(t) = e -λ t f(t)= λ e -λ t F(t)=1- e ♦ A constant failure rate is obtained if and only if T, the lifetime of the component, has an exponential distribution
ECE655/Krishna Part.1 .33 Copyright 2006 Koren & Krishna

Mean Time to Failure
♦ MTTF - the expected value of the lifetime T ∞ ♦ MTTF = E[T] =0 t f(t) dt ♦ dR(t)/dt= - f(t) ∞ ∞ ∞ ♦ MTTF = -0 t dR(t)/dt dt = [-t R(t) ] | + R(t) dt 0 0 ♦ -t R(t) = 0 when t=0 and when t= ∞ since R(∞ )=0 ∞ ♦ Therefore, MTTF = 0 R(t) dt ♦ If the failure rate is a constant λ
-λ t R(t) = e

∞ -λt MTTF = 0 e dt = 1/ λ
ECE655/Krishna Part.1 .34 Copyright 2006 Koren & Krishna

Page 17

Weibull Distribution - Introduction
♦ In most calculations of reliability, a constant

failure rate λ is assumed, or equivalently - the Exponential distribution for the component lifetime T ♦ There are cases in which this simplifying assumption is inappropriate ♦ Example - during the ‘’infant mortality” and ‘’wear-out” phases of the bathtub curve ♦ In such cases, the Weibull distribution for the lifetime T is often used for reliability calculation

ECE655/Krishna Part.1 .35

Copyright 2006 Koren & Krishna

Weibull distribution - Equation
♦ The Weibull distribution has two parameters,
λ and β ♦ The density function of the component lifetime T: β β -1 - λ t f(t)= λ β t e ♦ The failure rate for the Weibull distribution is β -1 h(t)= λ β t ♦ The failure rate h(t) is decreasing with time for β <1, increasing with time for β >1, constant for β =1, appropriate for the infant mortality, wearout and middle phases, respectively
ECE655/Krishna Part.1 .36 Copyright 2006 Koren & Krishna

Page 18

MTTF for the Weibull Distribution
♦ The reliability for the Weibull distribution is
β -λ t R(t) = e ♦ The MTTF for the Weibull distribution is MTTF = Γ (1/β )/(β λ 1/β )

♦ where Γ(x) is the Gamma function ♦ The special case β = 1 is the Exponential
distribution with a constant failure rate

ECE655/Krishna Part.1 .37

Copyright 2006 Koren & Krishna

Page 19

Sign up to vote on this title
UsefulNot useful