You are on page 1of 14

0.4pt0.4pt 0pt0.

4pt

ECE 753 - FAULT-TOLERANT COMPUTING


(Spring 2010-2011)

Examination
CLOSED BOOK Kewal K. Saluja

Date: April 13, 2011 Location: Room 3024 Engineering Hall Time: 7:15 PM Duration: 90 Minutes

No 1 2 3 4 5 6 7 8

PROBLEM General Testing Reliability Reliability System ECC Cyclic Codes Checkpointing TOTAL

POINTS 11 13 8 14 13 18 15 8 100

SCORE

Show your work carefully for both full and partial credit. You will be given credit only for what appears on your exam. Last Name (Please print): First Name: ID Number: Page left intentionally blank SOLUTION

ECE 753: FaultTolerant Computing 1. (11 points) General questions (a) (2 points) Dene the term fault secure. Answer: Circuit continues to give correct results even in the presence of fault or it produces a non code word for the correct code inputs. (b) (2 points) Why self-testing is an important property of self-checking circuits? Answer: This property assures that a circuit can be tested using only valid code words as inputs. Thus, it provides ability to test a circuit on-line (during normal operation without introducing extraneous and non-code words. (c) (2 point) Dene the terms checkpointing latency Answer: Total time needed to save the checkpoint information on a stable store. (d) (1 point) What is an orphan message? Answer: A message that has been received but has not been sent, i.e. the message does not have a parent (sender yet) (e) (2 point) What is the dierence between reliability and availability? Answer: Reliability does not allow going back to an operational state once the system fails, where as availability allows such transitions in the system state. (f) (2 point) Name two methods that can reduce the number of signatures in the context of watchdog technique. Answer: Branch address hashing Path signatures

Spring 2010-11 (LEC: Saluja)

ECE 753: FaultTolerant Computing 2. (13 points) Testing and test generation Consider the combinational circuit of Figure 1.
a Out1 A b B d Out2 C D e

Figure 1: A circuit for testing

(a) (9 points) Two faults, f1 and f2 in a circuit are said to be equivalent if the circuit function at each output in the presence of fault f1 is identical to the output in the presence of fault f2 in the circuit. Answer the following and you must show your work. i. (3 points) Is the fault at line c stuck-at 1 equivalent to fault at line e stuck-at 1? Answer: These two faults are equivalent. For both these faults the out-2 is always 1 and out-1 is not aected by these faults. ii. (3 points) Is the fault at line a stuck-at 1 equivalent to fault at line b stuck-at 1? Answer: These two faults are not equivalent. The fault a stuck-at 1 does not aect the output out-2 where as line b aects out-2 and is not redundant with respect to out-2. iii. (3 points) Is the fault at line b stuck-at 0 equivalent to fault at line d stuck-at 0? Answer: These two faults are not equivalent. The output out-1 is not same for these two faults because line d stuck-at 0 does not aect out-1 whereas line b stuck-at 0 makes out-2 to be 0. 3 Spring 2010-11 (LEC: Saluja)

ECE 753: FaultTolerant Computing (b) (2 points) Based on the structure of the circuit, what is the lower bound on the maximum number of variables that would need to be assigned binary values to detect a fault on line b? Answer: Two. Because assigning A and B will excite as well as sensitize the fault to out-1. (c) (2 points) Based on the structure of the circuit, what is the lower bound on the maximum number of variables that would need to be assigned binary values to detect a fault on line c? Answer: Four. This fault can be detected only on out-2 and hence we may have assign all four variables based only in structural information.

Spring 2010-11 (LEC: Saluja)

ECE 753: FaultTolerant Computing 3. (8 points) Reliability modeling (Reliability Block Diagram) Reliability equation of system with 5 modules; A, B, C, D and E, is given below: Rsystem = RC [1 (1 RA )((1 RB )][1 (1 RD )(1 RE )] + (1 RC )[1 (1 RA RD )(1 RB RE )] Draw the reliability block diagram of the system based on the above expression. Answer: Figure 2 below
A D

Figure 2: Non series-parallel system satisfying reliability equation.

Spring 2010-11 (LEC: Saluja)

ECE 753: FaultTolerant Computing 4. (14 points) Reliability modeling (Markov Model) A fault tolerant system is to be modeled using Markov model with three states namely states O, T, R. The following set of dierential equations are obtained from the Markov model. d pO (t) = p0 (t) + 2 pR (t) dt d pT (t) = p0 (t) 1 pT (t) dt d pR (t) = 1 pT (t) 2 pR (t) dt (a) (3 points) Write the A matrix - the equations in matrix form. Answer: pO (t) 0 2 p (t) d O 0 pT (t) pT (t) = 1 dt pR (t) 0 1 2 pR (t) (b) (5 points) Draw the Markov Model of the system. Answer: See the Figure 3.
2

Figure 3: Markov chain of a fault tolerant system

(c) (6 points) The reliability of the above system is obtained my solving the Markov model and it is: 1 Rel(t) = e1 t et 1 1 What is the MTTF the system. Use any method you like to solve for this. Answer:

Spring 2010-11 (LEC: Saluja)

ECE 753: FaultTolerant Computing To obtain the MTTF of the system one can compute

Rel(t)dt
0

On integration and then simplication you we will nd 1 1 + MTTF = 1

Spring 2010-11 (LEC: Saluja)

ECE 753: FaultTolerant Computing 5. (13 points) System level diagnosis (a) (6 points) One-step t-fault diagnosable system It is known that a system of n units in which no two units test each other is one-step t-fault diagnosable if and only if every unit is tested by t other units. I claim that the following system with 6 units shown in Figure 4 is one-step 2-fault diagnosable even though some units test each other.
2 1 3

6 4 5

Figure 4: A one-step 2-fault diagnosable system

Prove the above claim by providing convincing argument(s). Answer: In the above system if the link from node 2 to 3, or form node 3 to 2, or both links 3 are removed the resulting system of 6 units is such that no two nodes test each other. But in this modied system every node is tested by two other nodes and it meets the necessary and sucient conditions for one-step 2-fault diagnosis. Hence we can conclude that the system with additional links is also one-step 2-faults diagnosable.

Spring 2010-11 (LEC: Saluja)

ECE 753: FaultTolerant Computing (b) (7 points) Single loop system Consider a single loop system consisting of 10 units v1 , v2 , ... v10 . In this system unit v1 tests v2 , v2 tests v3 , etc. and nally v10 tests v1 . This system is sequentially 4 fault diagnosable. For the following syndrome, identify as many faulty units as you can and provide reasoning for your answer. The syndrome is written with the outcome of the test v1 v2 as the rst element. v1 1 v2 1 v3 1 v4 1 v5 1 v6 1 v7 1 v8 0 v9 0 v10 1

Answer: In this system we can diagnose all the faulty units in one-step as follows: Step 1: Unit v1 must be faulty. Because if it is not, then v2, v8, v9, and v10 must be faulty. In addition one unit from each of the pairs (v3,v4), (v5,v6) must also be faulty. The reasoning for the second conclusion being if the test outcome between a pair of units is 1, then one of the units from that pair must be faulty. Hence v1 is faulty Step 2: Form each of the pairs (v2,v3), (v4,v5), (v6,v7) at least one unit must be faulty. This makes a total of four faulty units in the system. Now we assert that from the pair (v6,b7) the unit v7 must be faulty. If not, then in addition to v6 being faulty v8 must also be faulty and that will make the total number of faulty units to be at least 5. Hencev7 is faulty Step 3: Use argument similar to Step 2 considering pairs (v2,v3) and (v4,v5) and conclude that v5 must be faulty and v6 must not be faulty. Step 4: Will nally lead to unit v3 being faulty. Hence the above syndrome leads to the conclusion that units v1, v3, v5, v7 are faulty.

Spring 2010-11 (LEC: Saluja)

ECE 753: FaultTolerant Computing 6. (18 points) Error detection and correction coding (a) (6 points) A (9,6) linear block code with six information bits; i1 , i2 , i3 , i4 , i5 , i6 ; and three parity bits; p1 , p2 , p3 ; uses the following three equations to realize the three party bits:

p1 = even parity over even number inf ormation bits p2 = even parity over odd number inf ormation bits p3 = even parity over all inf ormation bits

Write its generator matrix in systematic form in the following table. For your convenience I have completed the rst row and the rst column of the table.

i1 1 0 0 0 0 0

i2 0

i3 0

i4 0

i5 0

i6 0

p1 0

p2 1

p3 1

G=

Answer: In the gure below the blank entries are 0s.


i1 1 0 0 0 0 0

i2 0 1

i3 0 1

i4 0

i5 0

i6 0

p1 0 1 1

p2 1 1

p3 1 1 1 1 1 1

G=

1 1 1

1 1

10

Spring 2010-11 (LEC: Saluja)

ECE 753: FaultTolerant Computing (b) (12 points) The parity check matrix H of a (11,8) linear block code is given below: 1 0 1 0 1 0 1 0 1 0 0 H= 0 1 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1 1 1 0 0 1

Now prove or disprove the following. You must provide an example or a good explanation. i. (2 points) This is a single error correcting code. Answer: This is not true. There are many identical columns, e.g. column 1 and three are same, therefore error in bit 1 or bit 3 will give rise to same syndrome. ii. (3 points) It can detect any consecutive two bit errors. Answer: This is true. Sum of any two consecutive columns is zero. If we were to enumerate all cases, we can reduce them by taking advantage of the observation that rst 8 columns are only two dierent types of columns which alternate. The the remaining cases are easy to enumerate. iii. (2 points) It can detect arbitrary two bit errors. Answer: This is false. Errors in bit positions which correspond to identical columns in H matrix (such as bits 1 and 3) will give zero syndrome and hence will not be detected. iv. (3 + bonus points) It can detect any consecutive three bit errors. Answer: This is true. You can show is it by demonstrating the syndrome for all three consecutive errors. However, you can reduce the cases by showing that in the rst 8 columns of H, we can dispense them by two cases as follows: 1 0 1 0 0 1 0 1 0 + 1 + 0 = 1 OR 1 + 0 + 1 = 0 1 1 1 1 1 1 1 1 Then there are only three more cases that remain and they also lead to non zero syndrome. v. (2 points) It can detect any consecutive 4 bit errors. Answer: This is false. Consider for example errors in the rst four bit location. The corresponding syndrome is 0, hence the four bit error will not be detected. 11 Spring 2010-11 (LEC: Saluja)

ECE 753: FaultTolerant Computing 7. (15 points) Cyclic code A 4 bit information word is encoded using a single error correcting (7,4) cyclic code. The corresponding decoder (LFSR) is shown in Figure 5. The 7-bit encoded word is transmitted twice and the two 7-bit received words (before decoding the received words) are the following. In both cases the least signicant bit (LSB) is written on the right. i) 0 1 1 0 1 0 1 ii) 0 1 1 0 0 0 1

Decoded output

Encoded input

Figure 5: Cyclic code decoder

Now answer the following: (a) (3 points) Write the generating polynomial used for encoding the information word. Answer: The generator polynomial is: g (x) = 1 + x + x3 (b) (3 points) Write the received word shown in i) above in polynomial form. Answer: The received word (encoded word) in polynomial form is: Ri (x) = 1 + x2 + x4 + x5 (c) (6 points) Assuming that no more then single error can occur during transmission of 7 bits, identify which of the two received word(s) is(are) in error, if any. Use any method you like (such as decoding, polynomial algebra) but you must show the work you perform to identify the correct and/or the erroneous word(s). Answer: We divide the polynomial Ri (x) = 1 + x2 + x4 + x5 by g (x) and we nd that the remainder is x2 . This is non-zero, hence we can conclude that this word is in error. 12 Spring 2010-11 (LEC: Saluja)

ECE 753: FaultTolerant Computing Similarly when we divide the polynomial for the second word Ri (x) = 1 + x4 + x5 by g (x) we nd that the remainder is 0. Hence we conclude this is error free received word. (d) (3 points) What was the 4-bit information word (write LSB to the right) that was encoded in the above case. Answer: To obtain the encoded word, we can decode the error-free received word using the decoder provided in the Figure 5. The simulation of the decoding process is shown below. Note that the input is 1 + x4 + x5 i.e. 0110001 (LSB to the right) input 1 0 0 0 1 1 0 State 0 0 0 1 1 1 1 0 1 1 0 1 0 0 Output 1 1 1 0 0 0 0

0 1 1 1 0 0 0

Thus the output will be 0000111. The right most 3 bits are the decoded word and the left most three bits are remainder (syndrome) after division. Hence the encoded word must be 0111 (1 + x + x2 )

13

Spring 2010-11 (LEC: Saluja)

ECE 753: FaultTolerant Computing 8. (8 points) Checkpointing and recovery Two processes, P and Q, running in parallel exchange messages and take un-coordinated checkpoints. The Figure 6 shows all checkpoints taken by the processes P and Q. However it does not show the message exchanges. You are required to create a message exchange scenario such that if the process p fails at the point marked in the gure, it will cause a domino eect. The scenario created by you must satisfy the following constraint on the message exchanges. No process sends more than one message in any one checkpoint interval and no process receives more than one message in any one checkpoint interval.
PC1 P PC2 PC3 Failuare

Q QC1 QC2 QC3

Figure 6: Figure for demonstrating domino eect.

Answer: A possible message exchange scenario is given the Figure 7


PC1 P PC2 PC3 Failuare

Q QC1 QC2 QC3

Figure 7: Figure for demonstrating domino eect.

14

Spring 2010-11 (LEC: Saluja)