This action might not be possible to undo. Are you sure you want to continue?

# UNIVERSITY OF MASSACHUSETTS Dept.

**of Electrical & Computer Engineering Fault Tolerant Computing
**

ECE 655 Part 1 Introduction

C. M. Krishna Fall 2006

ECE655/Krishna Part.1 .1

Copyright 2006 Koren & Krishna

Prerequisites

♦ Basic courses in ∗ Digital Design ∗ Hardware Organization/Computer Architecture ∗ Probability

ECE655/Krishna Part.1 .2

Copyright 2006 Koren & Krishna

Page 1

**Fault Tolerance - Basic definition
**

♦ Fault-tolerant systems - ideally systems

capable of executing their tasks correctly regardless of either hardware failures or software errors ♦ In practice - we can never guarantee the flawless execution of tasks under any circumstances ♦ Limit ourselves to types of failures and errors which are more likely to occur

ECE655/Krishna Part.1 .3

Copyright 2006 Koren & Krishna

**Need For Fault Tolerance
**

♦ 1. ♦ 2. ♦ 3.

ECE655/Krishna Part.1 .4

Copyright 2006 Koren & Krishna

Page 2

**Need For Fault Tolerance - Life Critical Applications
**

♦ Life-critical applications such as;

aircrafts, nuclear reactors, chemical plants, and medical equipment ♦ A malfunction of a computer in such applications can lead to catastrophe ♦ Their probability of failure must be extremely low, possibly one in a billion per hour of operation

ECE655/Krishna Part.1 .5

Copyright 2006 Koren & Krishna

**Need for Fault Tolerance - Harsh Environment
**

♦ A computing system operating in a harsh

environment where it is subjected to ∗ electromagnetic disturbances ∗ particle hits and alike ♦ Very large number of failures means: the system will not produce useful results unless some fault-tolerance is incorporated

ECE655/Krishna Part.1 .6

Copyright 2006 Koren & Krishna

Page 3

**Need For Fault Tolerance - Highly Complex Systems
**

♦ Complex systems consist of millions of devices ♦ Every physical device has a certain probability

of failure ♦ A very large number of devices implies that the likelihood of failures is high ♦ The system will experience faults at such a frequency which renders it useless

ECE655/Krishna Part.1 .7

Copyright 2006 Koren & Krishna

**Fault Tolerance Measures
**

♦ It is important to have proper yardsticks measures - by which to measure the effect of fault tolerance ♦ A measure is a mathematical abstraction, which expresses only some subset of the object's nature ♦ Measures?

ECE655/Krishna Part.1 .8

Copyright 2006 Koren & Krishna

Page 4

♦ Assumption: The system can be in one of two states:

Traditional Measures

‘’up” or ‘’down” ♦ Examples: A lightbulb is either good or burned out; A wire is either connected or broken ♦ Two traditional measures: Reliability and Availability ♦ Reliability, R(t): probability that the system is up during the interval [0,t], given it was up at time 0 ♦ Availability, A(t), is the fraction of time that the system is up during the interval [0,t] ♦ Point Availability, Ap(t), is the probability that the system is up at time t ♦ A related measure is MTTF - Mean Time To Failure the average time the system remains up before it goes down and has to be repaired or replaced

ECE655/Krishna Part.1 .9 Copyright 2006 Koren & Krishna

**Need For More Measures
**

♦ The assumption of the system being in state

‘’up” or ‘’down” is very limiting ♦ Example: A processor with one of its several hundreds of millions of gates stuck at logic value 0 and the rest is functional - may affect the output of the processor once in every 25,000 hours of use ♦ The processor is not fault-free, but cannot be defined as being ‘’down” ♦ More general measures than the traditional reliability and availability are needed

Copyright 2006 Koren & Krishna

ECE655/Krishna Part.1 .10

Page 5

**More General Measures
**

♦ Capacity Reliability - probability that the

system capacity (as measured, for example, by throughput) at time t exceeds some given threshold at that time ♦ Another extension - consider everything from the perspective of the application ♦ This approach was taken to define the measure known as Performability

ECE655/Krishna Part.1 .11

Copyright 2006 Koren & Krishna

Computational Capacity

♦ Example: N processors in a gracefully degrading

system ♦ System recovers from failures of processors and is useful as long as at least one processor remains operational ♦ Let Pi = Prob {i processors are operational}

**♦ R(t) = Σ Pi ♦ Let c be the computational capacity of a processor
**

(e.g., number of fixed-size tasks it can execute) ♦ Computational capacity of i processors: Ci = i • c

**♦ Computational capacity of the system: Σ Ci • Pi
**

ECE655/Krishna Part.1 .12 Copyright 2006 Koren & Krishna

Page 6

Performability

♦ Application is used to define ‘’accomplishment

levels” L1, L2,...,Ln ♦ Each represents a level of quality of service delivered by the application ♦ Example: Li indicates i system crashes during the mission time period T ♦ Performability is a vector (P(L1),P(L2),...,P(L n)) where P(Li) is the probability that the computer functions well enough to permit the application to reach up to accomplishment level Li

ECE655/Krishna Part.1 .13

Copyright 2006 Koren & Krishna

**Network Connectivity Measures
**

♦ Focus on the network that connects the

processors ♦ Classical Node and Line Connectivity - the minimum number of nodes and lines, respectively, that have to fail before the network becomes disconnected ♦ Measure indicates how vulnerable the network is to disconnection ♦ A network disconnected by the failure of just one (critically-positioned) node is potentially more vulnerable than another which requires several nodes to fail before it becomes disconnected

ECE655/Krishna Part.1 .14 Copyright 2006 Koren & Krishna

Page 7

Connectivity - Examples

ECE655/Krishna Part.1 .15

Copyright 2006 Koren & Krishna

**Network Resilience Measures
**

♦ Classical connectivity distinguishes between only

two network states: connected and disconnected ♦ It says nothing about how the network degrades as nodes fail before becoming disconnected ♦ Two possible resilience measures: ∗ Average node-pair distance ∗ Network diameter - the maximum node-pair distance ♦ Both calculated given the probability of node and/or link failure

ECE655/Krishna Part.1 .16 Copyright 2006 Koren & Krishna

Page 8

**More Network Measures
**

♦ What happens upon network disconnection ♦ A network that splits into one large connected

component and several small pieces may be able to function, vs. a network that splinters into a large number of small pieces ♦ Another measure of a network's resilience to failure: probability distribution of the largest component upon disconnection

ECE655/Krishna Part.1 .17

Copyright 2006 Koren & Krishna

Redundancy

♦ Redundancy is at the heart of fault tolerance ♦ Redundancy can be defined as the incorporation

of extra parts in the design of a system in such a way that its function is not impaired in the event of a failure ♦ We will study four forms of redundancy:

∗ ∗ ∗ ∗

1. 2. 3. 4.

ECE655/Krishna Part.1 .18

Copyright 2006 Koren & Krishna

Page 9

Hardware Redundancy

♦ Extra hardware is added to override the

effects of a failed component ♦ Static Hardware Redundancy - for immediate masking of a failure ♦ Example: Use three processors (instead of one), each performing the same function. The majority output of these processors can override the wrong output of a single faulty processor ♦ Dynamic Hardware Redundancy - spare components are activated upon the failure of a currently active component ♦ Hybrid Hardware Redundancy - A combination of static and dynamic redundancy techniques

ECE655/Krishna Part.1 .19 Copyright 2006 Koren & Krishna

Software Redundancy

♦ Software redundancy is provided by having

multiple teams of programmers ♦ Write different versions of software for the same function ♦ The hope is that such diversity will ensure that not all the copies will fail on the same set of input data

ECE655/Krishna Part.1 .20

Copyright 2006 Koren & Krishna

Page 10

**Information and Time Redundancy
**

♦ Information redundancy: provided by adding bits to

the original data bits so that an error in the data bits can be detected and even corrected ♦ Error detecting and correcting codes have been developed and are being used ♦ Information redundancy often requires hardware redundancy to process the additional bits ♦ Time redundancy: provided by having additional time during which a failed execution can be repeated ♦ Most failures are transient - they go away after some time ♦ If enough slack time is available, the failed unit can recover and redo the affected computation

ECE655/Krishna Part.1 .21 Copyright 2006 Koren & Krishna

**Hardware Faults Classification
**

♦ Three types of faults: ♦ Transient Faults - disappear after a relatively short

time ∗ Example - a memory cell whose contents are changed spuriously due to some electromagnetic interference ∗ Overwriting the memory cell with the right content will make the fault go away ♦ Permanent Faults - never go away, component has to be repaired or replaced ♦ Intermittent Faults - cycle between active and benign states ∗ Example - a loose connection

ECE655/Krishna Part.1 .22 Copyright 2006 Koren & Krishna

Page 11

Failure Rate

♦ The rate at which a component suffers faults

depends on its age, the ambient temperature, any voltage or physical shocks that it suffers, and the technology ♦ The dependence on age is usually captured by the bathtub curve:

ECE655/Krishna Part.1 .23

Copyright 2006 Koren & Krishna

Bathtub Curve

♦ When components are very young, the failure rate

is quite high: there is a good chance that some units with manufacturing defects slipped through manufacturing quality control and were released ♦ As time goes on, these units are weeded out, and the unit spends the bulk of its life showing a fairly constant failure rate ♦ As it becomes very old, aging effects start to take over, and the failure rate rises again

ECE655/Krishna Part.1 .24

Copyright 2006 Koren & Krishna

Page 12

Empirical Formula for λ - Failure Rate

♦λ = π L πQ (C1 π T π V + C2 π E) ∗ π L: Learning factor, (how mature the technology is) ∗ π Q: Manufacturing process Quality factor (0.25 to 20.00) ∗ π T: Temperature factor, (from 0.1 to 1000), proportional

depending on the supply voltage and the temperature); does not apply to other technologies (set to 1)

∗ π V: Voltage stress factor for CMOS devices (from 1 to 10 ∗ π E: Environment shock factor: from about 0.4 (air∗ C1, C2: Complexity factors; functions of number the chip and number of pins in the package ∗ Further details: MIL-HDBK-217E handbook

ECE655/Krishna Part.1 .25

to exp(-Ea/kT) where Ea is the activation energy in electronvolts associated with the technology, k is the Boltzmann constant and T is the temperature in Kelvin

**conditioned environment), to 13.0 (harsh environment) of gates on
**

Copyright 2006 Koren & Krishna

Environment Impact

♦ Devices operating in space, which is replete

with energy-charged particles and can subject devices to severe temperature swings, can be expected to fail more often ♦ Similarly for computers in automobiles (high temperature and vibration) and industrial applications

ECE655/Krishna Part.1 .26

Copyright 2006 Koren & Krishna

Page 13

Faults Vs. Errors

♦ A fault can be either a hardware defect or

a software/programming mistake ♦ By contrast, an error is a manifestation of a fault ♦ For example, consider an adder circuit one of whose output lines is stuck at 1. This is a fault, but not (yet) an error. This fault causes an error when the adder is used and the result on that line is supposed to have been a 0, rather than a 1

ECE655/Krishna Part.1 .27

Copyright 2006 Koren & Krishna

**Propagation of Faults and Errors
**

♦ Both faults and errors can spread through

the system ∗ If a chip shorts out power to ground, it may cause nearby chips to fail as well ♦ Errors can spread because the output of one processor is frequently used as input by other processors ∗ Adder example: the erroneous result of the faulty adder can be fed into further calculations, thus propagating the error

ECE655/Krishna Part.1 .28

Copyright 2006 Koren & Krishna

Page 14

Containment Zones

♦ To limit such situations, designers incorporate

containment zones into systems ♦ Barriers that reduce the chance that a fault or error in one zone will propagate to another ∗ A fault-containment zone can be created by providing an independent power supply to each zone ∗ The designer tries to electrically isolate one zone from another ∗ An error-containment zone can be created by using redundant units and voting on their output

ECE655/Krishna Part.1 .29

Copyright 2006 Koren & Krishna

**Time to Failure - Analytic Model
**

♦ Consider the following model: ∗ N identical components, all operational at time t=0 ∗ Each component remains operational until it is hit

by a failure ∗ All failures are permanent and occur in each component independently from other components ∗ We first concentrate on one component ♦ T - the lifetime of one component - the time until it fails ♦ T is a random variable ♦ f(t) - the density function of T ♦ F(t) - the cumulative distribution function of T

ECE655/Krishna Part.1 .30 Copyright 2006 Koren & Krishna

Page 15

Probabilistic Interpretation

♦ F(t) - the probability that the component will

fail at or before time t ∗ F(t) = Prob (T ≤ t) ♦ f(t) - the momentary rate of failure ∗ f(t)dt = Prob (t ≤ T ≤ t+dt) ♦ Like any density function (defined for t ≥ 0)

**♦ These functions are related through ∗ f(t)=dF(t)/dt
**

ECE655/Krishna Part.1 .31

∞ ∗ f(t) dt =1 0 ∗ f(t) ≥ 0, for all t ≥ 0

∗ F(t)= 0 f(s) ds

t

Copyright 2006 Koren & Krishna

**Reliability and Failure (Hazard) Rate
**

♦ The reliability of a single component - R(t) ∗ R(t) = Prob (T>t) = 1- F(t) ♦ The failure probability of a component at time t,

p(t) - the conditional probability that the component will fail at time t, given it has not failed before ∗ p(t) = Prob (t ≤ T ≤ t+dt | T ≥ t) = Prob (t ≤ T≤ t+dt) / Prob(T ≥ t) = f(t)dt / (1-F(t)) ♦ The failure rate (or hazard rate) of a component at time t, h(t), is defined as p(t)/dt ∗ h(t) = f(t)/(1- F(t)) ♦ Since dR(t)/dt = -f(t), we get h(t) = -1/R(t) dR(t)/dt

ECE655/Krishna Part.1 .32 Copyright 2006 Koren & Krishna

Page 16

**Constant Failure Rate
**

♦ If the failure rate is constant over time h(t) = λ - then ♦ dR(t) / dt = - λ R(t) ; R(0)=1 ♦ The solution of the differential equation is -λ t R(t) = e -λ t f(t)= λ e -λ t F(t)=1- e ♦ A constant failure rate is obtained if and only if T, the lifetime of the component, has an exponential distribution

ECE655/Krishna Part.1 .33 Copyright 2006 Koren & Krishna

**Mean Time to Failure
**

♦ MTTF - the expected value of the lifetime T ∞ ♦ MTTF = E[T] =0 t f(t) dt ♦ dR(t)/dt= - f(t) ∞ ∞ ∞ ♦ MTTF = -0 t dR(t)/dt dt = [-t R(t) ] | + R(t) dt 0 0 ♦ -t R(t) = 0 when t=0 and when t= ∞ since R(∞ )=0 ∞ ♦ Therefore, MTTF = 0 R(t) dt ♦ If the failure rate is a constant λ

-λ t R(t) = e

∞ -λt MTTF = 0 e dt = 1/ λ

ECE655/Krishna Part.1 .34 Copyright 2006 Koren & Krishna

Page 17

**Weibull Distribution - Introduction
**

♦ In most calculations of reliability, a constant

failure rate λ is assumed, or equivalently - the Exponential distribution for the component lifetime T ♦ There are cases in which this simplifying assumption is inappropriate ♦ Example - during the ‘’infant mortality” and ‘’wear-out” phases of the bathtub curve ♦ In such cases, the Weibull distribution for the lifetime T is often used for reliability calculation

ECE655/Krishna Part.1 .35

Copyright 2006 Koren & Krishna

**Weibull distribution - Equation
**

♦ The Weibull distribution has two parameters,

λ and β ♦ The density function of the component lifetime T: β β -1 - λ t f(t)= λ β t e ♦ The failure rate for the Weibull distribution is β -1 h(t)= λ β t ♦ The failure rate h(t) is decreasing with time for β <1, increasing with time for β >1, constant for β =1, appropriate for the infant mortality, wearout and middle phases, respectively

ECE655/Krishna Part.1 .36 Copyright 2006 Koren & Krishna

Page 18

**MTTF for the Weibull Distribution
**

♦ The reliability for the Weibull distribution is

β -λ t R(t) = e ♦ The MTTF for the Weibull distribution is MTTF = Γ (1/β )/(β λ 1/β )

**♦ where Γ(x) is the Gamma function ♦ The special case β = 1 is the Exponential
**

distribution with a constant failure rate

ECE655/Krishna Part.1 .37

Copyright 2006 Koren & Krishna

Page 19