This action might not be possible to undo. Are you sure you want to continue?
Fault Tolerant Design of Digital Systems
4.1 The Important of Fault Tolerance
Fault Tolerant design can provide dramatic improvements in system availability and lead to a substantial reduction in maintenance costs as a consequence of fewer system failures.
4.2 Basic Concepts of Fault Tolerance
4.3 Static Redundancy
• • Also known as “masking redundancy” Two major techniques employed: 1. Triple modular redundancy 2. Use of error correcting codes
4.3.1 Triple Modular Redundancy (TMR)
• • •
Could be expanded to NMR (N-modular-redundancy) An NMR system can tolerate up to n module failures, where n = (N-1)/2 In general, in an NMR system N is an odd number.
CEN 491 • Fault-Tolerance Page# 2 The Reliability equation of an NMR system is: i ( − iN ) RNMR= (1−⋅ RM) ⋅ RM i = 0 i ∑ n N For the TMR case N=3 and n=1 3 i (3− i ) RTMR = ∑ (1−⋅ RM) ⋅ RM i = 0 i 3 0 3 3 1 2 = (1− RM) ⋅ RM + (1− RM) ⋅ RM 0 1 1 = R + 3(1− R ) ⋅ R = R + 3 R − 3R = 3R − 2 R Note: 3 12 M M M 3 2 3 M M M 2 3 M M .
5 and 0.CEN 491 Fault-Tolerance Page# 3 n n! = r !( − rnr )! n = 1 0 n = n 1 • Another way to calculate RTMR RTMR = Probability of all three modules functioning + Probability of any two modules functioning 3 2 = RM + 3RM (1 − RM ) 2 3 = 3RM − 2 RM Exercise : Evaluate RTMR if RM = 0.4 .6 and 0.
CEN 491 Fault-Tolerance Page# 4 Reliability & MTBF & Failure rate For a constant failure rate λ . . for TMR where RTMR = 3e −2 λt − 2e −3λt MTBF TMR 2 t 3 t = 3 e− − 2e − dt ∫ 0 ∞ ∞ 3 2 2 t 3 t = − e− + e− 0 2 3 3 2 = − 2 3 9 − 4 = 6 5 = 6 NOTE : 5 1 6λ λ λ λ λ λ λ λ λ λ λ λ We should look for a more useful parameter than MTBF. RM = e −λt MTBFM 1 −λt∞ 1 1 = = R dt = e dt = − e = 0+ = ∫ λ ∫ M λ 0 λ λ 0 0 1 ∞ ∞ −λt Thus.
respectively. while TR and TN are times at which the system reliability RR(t) and RN(t).99 or 0. Rv is the reliability of the voter.CEN 491 Fault-Tolerance Page# 5 Other Parameters for evaluating system reliability • 1 − RN 1 − RR Where. fall to the value Rf. 1-RR : probability of failure of redundant system. . 0. then the reliability of the TMR becomes: v RTMR = e − λ t (3e −2 λt − 2e −3λt ) v If λv λ . Rsys = ( RM RV )3 + 3( RM RV ) 2 (1 − RM RV ) where. Reliability Improvement Factor (RIF) = • TR at R f TN Where Rf is some predetermined reliability (e. Thus. 1-RN : probability of failure of non-redundant system.g.90). we have to improve the reliability of the voter. Mission Time Improvement Factor (MTIF) = The reliability of the voter element If the voter has the reliability e −λ t . the reliability of the system is less than that of the original system for all t.
2. • The reliability R is an increasing function of the number of spare modules. 4. • A dynamic redundant system with S spares has a reliability : R = 1 − (1 − Rm )( S +1) where Rm is the reliability of each module. This reliability function is obtained assuming that the fault detection and the switchover mechanism are perfect.CEN 491 Fault-Tolerance Page# 6 The major advantages of the TMR scheme Major advantages of the TMR are: 1. If a fault is detected in the operating module it is switched out and replaced by a spare. The fault-masking action occurs immediately. It requires consecutive actions of fault detection and fault recovery. .4 Dynamic redundancy • • • A system with dynamic redundancy consists of several modules but with only one operating at a time. active or spare in the system. both temporary and permanent faults are masked. 3. No separate fault detection is necessary before masking. The conversion from a non-redundant system to a TMR system is straightforward.
o Disadvantage: cannot detect temporary faults unless they occur while the module is tested. 2.CEN 491 • • Fault-Tolerance Page# 7 However. Losq has shown that for every dynamic redundant system there exists a finite best number of spares for a given mission time: a) When the mission time is extremely short ⇒ one spare is best. Hot-standby system. Watchdog timers: timer. Retry: so that a module will not be removed because of a temporary fault. .e. Self-checking circuits: provide a very cost effective method of fault detection 3. the use of too many spares may have a detrimental effect on the system reliability. • The detection of a fault in the individual modules of a dynamic system can be achieved by using one of the following techniques: 1. Periodic tests: o Offline. Self-repair: the replacement is invisible to the user and the system continues its operation uninterrupted. b) When the mission time is less than one-tenth of the simplex (i. In general dynamic redundant systems can be divided into two categories: (a) (b) Cold-standby system. non-redundant) mean-life ⇒ five spares or fewer is the best. checkpoints • • • • Reconfiguration: switching the faulty element and selecting the system output to come from one of the alternative modules.