You are on page 1of 58

A PORTABLE ERROR MONITORING SOFTWARE TOOL FOR COMMERCIAL MICROPROCESSORS IN SPACE APPLICATIONS

BY HARI N. KOMMARAJU B.Tech., Regional Engineering College, Warangal, 1999

THESIS Submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Engineering in the Graduate College of the University of Illinois at Urbana-Champaign, 2001

Urbana, Illinois

Certificate of Committee Approval

ii

ABSTRACT Transient errors due to high-intensity radiation in outer space are a serious problem, especially for commercial off-the-shelf (COTS) microprocessors. These processors are typically not radiation hardened and are therefore susceptible to transient errors. A new portable softwaremonitoring tool is developed, which gives the transient error rate in a commercial microprocessor and also provides the information to localize these errors to a particular functional block in that processor. The tool is designed to be portable across several different platforms and is intended to provide crucial information about the state of the system. Experiments are conducted using the physical fault injection technique of power supply disturbances. Results of these experiments verify that this tool is capable of detecting transient errors within the processor.

iii

ACKNOWLEDGMENTS I would like to thank my advisor, Professor Janak H. Patel, for his guidance, and my coresearcher, Karen E. Wells, for her contribution to this work. I would also like to thank my parents for their continued support.

iv

TABLE OF CONTENTS Page 1 INTRODUCTION ................................................................................................................... 1 1.1 Background .................................................................................................................... 1 1.2 Software-Monitoring Tool Description.......................................................................... 2 PREVIOUS AND RELATED WORK.................................................................................... 4 TRANSIENT ERROR OVERVIEW ...................................................................................... 7 3.1 Single-Event Upsets ....................................................................................................... 7 3.2 Analysis of Transient Faults .......................................................................................... 8 3.3 Fault Models .................................................................................................................. 9 SOFTWARE ERROR DETECTION TECHNIQUES.......................................................... 11 4.1 Time Redundancy ........................................................................................................ 11 4.2 Consistency Checks ..................................................................................................... 12 PROCESSOR MODEL ......................................................................................................... 13 5.1 Overview of MPC750 Processor.................................................................................. 15 TESTING THE FLOATING-POINT UNIT ......................................................................... 17 6.1 Addition ....................................................................................................................... 18 6.2 Subtraction ................................................................................................................... 20 6.3 Multiplication............................................................................................................... 20 6.4 Division........................................................................................................................ 22 TESTING THE DATA CACHE UNIT................................................................................. 23 7.1 Fault Model .................................................................................................................. 23 7.2 Functional Testing........................................................................................................ 23 TESTING THE VECTOR PROCESSING UNIT................................................................. 26 8.1 Vector Processing Unit of MPC7400........................................................................... 26 8.2 AltiVec Instruction Set................................................................................................. 27 8.3 Functional Testing........................................................................................................ 27 FAULT INJECTION METHODS......................................................................................... 29 9.1 Physical Fault Injection Techniques ............................................................................ 29

2 3

4

5 6

7

8

9

10 EXPERIMENTAL RESULTS .............................................................................................. 32 10.1 Analysis of the Software Tool ..................................................................................... 32 10.1.1 Error statistics .................................................................................................. 32 10.1.2 Error latency..................................................................................................... 34 10.1.3 Tool runtime..................................................................................................... 34 10.2 Hardware and Software Setup...................................................................................... 35 10.3 Experimental Results ................................................................................................... 37 10.3.1 Power supply variations ................................................................................... 37 10.3.2 Weak radiation exposure.................................................................................. 40 10.3.3 Changes in temperature.................................................................................... 40 11 CONCLUSIONS AND FUTURE WORK............................................................................ 41

v

45 A..…44 A...................3 Running the Program…………………………………………………………………........4 Statistical Output…………………………………………………………………..…………….... 43 A...2....2 Downloading the visionWARE file and updating the firmware………………....................................45 A...……..........1 Establishing a serial connection…………………………………………….2...5 Making Changes to the Program………………………………………................2 Using vDESKTOP……………………………………………………………………...............................46 REFERENCES ..…………………….................…....... 47 vi ..........44 A............…43 A...........APPENDIX A REETOOL USER’S GUIDE .........44 A.1 Setting Up the Board with the Serial Connection………….......................

........................... . .......... 35 5: Results of Experiments Based on Nonuniform Power Supply Variations....................................... ....... ..................... 39 6: Results of Uniform Power Supply Variation Experiments.................................................................. 46 vii ................................................ 18 2: Vectors for Floating-Point Multiplication Operations in IEEE 754 Format................................................ 22 4: Operations and Running Times for Each Routine................... 22 3: Vectors for Floating-Point Division Operations in IEEE 754 Format.. .........LIST OF TABLES Table Page 1: Minimum Test Set for Ripple Carry Adders Under MCFM.................................. 39 7: Command Line Options for REETool............................

.. 28 9: Code Excerpt of the Software Tool............................................................................................ 36 viii ....................... 20 4: Example Code for Independent Floating-Point Subtraction Operations...... ........LIST OF FIGURES Figure Page 1: MPC750 Processor Block Diagram.................................................................................................................. .................................. ........ 33 10: Experimental Setup..................................... 15 2: Example Code for Independent Floating-Point Addition Operations............................................................................................................... 19 3: Example Code for Dependent Floating-Point Addition Operations........ 21 6: An LFSR of Length Eight.................. 25 8: High-Level Routine to Test Vector Processing Unit. 24 7: Routine to Test Data Cache Unit....................................................................................................................... ................................................................. ............... 21 5: Example Code for Dependent Floating-Point Subtraction Operations.............................................

and power consumption suggest the use of commercial processors [1]. higher in mass and power consumption. We present a new method for detecting errors in a microprocessor due to radiation or power supply fluctuations. However. scalable.1 INTRODUCTION This thesis describes a self-contained error-monitoring software tool that was developed to support NASA’s Remote Exploration and Experimentation (REE) project. With the increased use of software in space and the drive toward smaller. which will develop low-power. NASA’s Jet Propulsion Laboratory (JPL) supported this work. This performance growth and the need for more computational power in satellites without an accompanying increase in cost. higher in volume. Transient faults are induced by external perturbation such as radiation or power supply fluctuations. failures can be costly and hazardous. specifically those due to transient faults. Of particular concern are temporary errors. However. there is currently little data for analyzing and detecting transient errors in microprocessors. The design of highly reliable computer-based space systems requires a thorough understanding of errors in the underlying COTS components. better. As a result. volume. they are much more vulnerable to transient failures [2]. faster. since COTS microprocessors are not radiationhardened. Alpha particle radiation in outer space is the main cause of the transient faults. mass. of which a microprocessor is the principal part. Prevention of such failures is extremely important. Due to the mission-critical nature of these systems. high-performance computing for use in space. 1. Even though a large number of radiation-hardened systems are available. The rapid performance increase in commercial microprocessors continues to outstrip the performance of hardened processors. various tools and methodologies are being researched to tolerate the effects of radiation on COTS microprocessors in space applications. and more cost-effective satellites. they are often more costly. fault-tolerant. and lower in performance than components presently available off–the–shelf.1 Background Reliability of the microprocessors used in space applications should be characterized in order to assess the availability of a computer-based space system. COTS systems are attractive for use in computer-based satellite systems. We developed a high-level self-contained software1 .

The operations in each routine are chosen in such a way as to maximize the switching activity in the corresponding functional unit. Also. The only data in that case would be the system crash rate. More details on these functional units in a processor are given in Chapter 5. the probability of capturing an SEU is very high with the use of enhanced signal activities and the creation of sensitized paths. and if they are not detected early. There are many advantages of this software-monitoring tool for error measurements. This tool accurately localizes the single-event upset (SEU) to within a small functional block. Some assumptions are made for the tool to work successfully. This tool can give an early warning to the system of increased error rates. Error coverage is maximized because most of the functional unit is used as fast as possible. Error latency is only a few instruction cycles in most cases. The errors that do not cause an immediate crash are the most worrisome. 1. The error rates can vary during a long mission due to a changing environment and possibly due to aging of the electronics on the chip. not an error rate. The tool is portable to different systems and can be used during long space missions to periodically measure the error rates. The type and frequency of the errors are assumed to be such that the error rate is not so high as to crash the monitoring tool very frequently. It also keeps count of the number of errors found in each unit. Each routine in this tool consists of a sequence of predetermined computations that are performed millions of times in order to test the specific functional unit. low-level processor-specific routines are not implemented in it. they can multiply and lead to a 2 . This tool is broken down into separate routines for each major functional unit in a generalpurpose microprocessor.monitoring tool to detect and characterize transient errors in spaceborne applications running on COTS components. Since one of the goals is to keep the tool portable. It runs on a COTS microprocessor to measure the error rate and also estimates the location of any errors encountered.2 Software-Monitoring Tool Description The software-monitoring tool is written in the high-level language C. They are also far more likely than catastrophic errors. Flooding the functional units with millions of operations per second minimizes the error latency.

For these reasons. 3 . the most interesting case for measurement is for infrequent SEUs.system crash or to additional new errors.

and Intel i80960MX parts. • • In [3]. two generations of the 80Cx86 microprocessor family and two floating-point digital signal processors (DSPs) were tested for single-event effects (SEE). LSI LR33000HC. • In Asenek et al. two software based techniques for on-line detection of control flow error in small computer systems are presented and evaluated by fault injection. • In [5].2 PREVIOUS AND RELATED WORK Over the last several years. • In [8]. In [4]. These tools are built around a commercial microprocessor simulator and are used to analyze real satellite application systems. several techniques and software tools for testing the effects of SEUs on microprocessors have been presented: • In [1]. Both singleevent upset and latchup conditions were monitored. and the upset vulnerability was correlated to the size of the program running on each processor. The SEE rates for each of the device types were measured. • Similarly. the 80387 coprocessor. upset tests are done on the MC68020 and the SPARC MHS 90C601. A comparison is given of each processor’s susceptibility to proton irradiation in various modes of operation. a feasibility study aimed at adapting a low-cost commercial off-the-shelf microprocessor simulator into to a tool to predict the rate of observable SEU-induced errors in microprocessor systems is presented. software tools are presented for predicting the rate and nature of observable SEU-induced errors in microprocessor systems. [2]. the SEU rates of the Pentium MMX and Pentium II microprocessors using proton irradiation are determined. the 82380 peripheral device. in [7]. • In [6]. Results obtained from simulating the 4 . IDT 79R3081. An evaluation of the performance of these microprocessors in the space radiation environment is also presented. proton ion and single-event phenomena (SEP) tests were performed on R3000As from all commercial manufacturers along with tests on the Performance PR3400 family. It is established that the SPARC processor is less vulnerable than the MC68020. and the corresponding floating-point coprocessors (the 68882 and the 90C602). SEE tests are done on the Intel 80386 microprocessor. and the 80486 microprocessor.

a first-generation test bed is created to test the above concepts. a software approach to overcome radiation-induced errors in spaceborne applications running on commercial off-the-shelf components is proposed. to enable the direct use of latest-generation commercial hardware and software components in future space systems. aims at developing a portable tool. This strategy will allow high-throughput computation even in the presence of relatively high rates of radiation-induced transient upsets as well as in the presence of permanent faults. a series of experiments aimed at error analysis through the physical insertion of faults have been conducted at the NASA AIRLAB test-bed facility. An experimental 5 . none of the tools is portable to other systems. Our research. • In [12]. The approach uses numeric checksum methods to validate results returned by a numerical subroutine operating subject to unpredictable errors in data. Previous work in this area includes: • In [9]. therefore. which can run on any microprocessor. rather than on environmentally induced faults. Also. Other work at the microprocessor level has primarily focused on vulnerability assessment and software detection methods: • In [11]. as well as a generic library of algorithm-based fault tolerance (ABFT) techniques. result checking is suggested as a way of enforcing hardware/software reliability. Also. Another area of related work is that done by the NASA Remote Exploration and Experimentation (REE) project itself. a novel concurrent error detection scheme for FFT networks with low hardware overhead using butterfly networks is proposed. • In [10]. Another time redundancy method has also been proposed to achieve both concurrent error detection and location with negligible hardware overhead. A time redundancy method is used to locate the faulty modules. All of the above-mentioned techniques.nature of SEU-induced errors are shown to correlate with ground-based radiation test data. Their work focuses on computation errors inherent in the system. The testbed is equipped with fault injection capabilities and is constructed out of COTS hardware and software. but architecturally insensitive. REE is working to incorporate a custom. software-implemented fault tolerance (SIFT) middleware layer. present data for specific microprocessors and do not attempt to localize the errors.

and eliminate hangs. • In [14]. a software environment for providing adaptive fault-tolerance in an environment of COTS system components and software.analysis is presented to study the susceptibility of a microprocessor-based jet engine controller to upsets caused by current and voltage transients. PECOS is used in combination with a data audit subsystem to eliminate failsilence violations. 6 . All of the above studies have the drawback of being low-level and hardware-specific. • In [15]. A four-level error detection hierarchy is proposed in the context of Chameleon. The technique uses assertions that can be embedded in the assembly language code and that are triggered by control flow instructions in the code. Therefore. reduce the incidence of crashes. which is determined at runtime. Software-based error injection techniques are used to evaluate PECOS. • Bagchi et al. The design and implementation of a software-based distributed signaturemonitoring scheme are described. Softwareimplemented error injection is performed to evaluate the system. they are not portable. [13] proposed a hierarchical error detection framework for a softwareimplemented fault tolerance (SIFT) layer for a distributed system. Our tool has the advantage of portability across various platforms. PECOS can detect errors in control flow that span multiple subroutines or source files as well as control-flow. the design and evaluation of a preemptive control signature technique (PECOS) for on-line detection of control flow errors is presented.

3. A transient error is a manifestation of the transient fault.1 Single-Event Upsets A single-event upset (SEU) is a type of transient error that is commonly caused by cosmic rays and alpha particle radiation. Failures are caused by errors. which are manifestations of faults in the system. and how to avoid these transient errors from causing system failure. it is the alpha particle hits that are of most interest in this work. The majority of heavy ions 7 . which is present for only part of a time. An SEU is a soft error introduced when an ionizing heavy ion penetrates the depletion region of a reversed-biased pn-junction. A transient fault is a nonpermanent fault. therefore.3 TRANSIENT ERROR OVERVIEW A system failure occurs when system behavior is incorrect or interrupted. The heavy ion affects only stored information by changing bit values from 0 to 1 or vice versa. In this research we are interested in the transient faults and errors. and occurs randomly. Common causes of the transient faults are: • • • • • • • • • • • Cosmic rays α-particles (ionized helium atoms) Power supply fluctuations Electromagnetic interference Static electric discharges Air pollution Humidity Temperature Pressure Vibrations Ground Loops However α-particles are the common cause of transient errors in space applications. A fault is present in the system when there is a physical difference between good and incorrect (failing or bad) system behavior. but some time may elapse before a fault causes a detectable system error.

such as a biomedical. the results could be catastrophic. faults to memory would be largely screened. In [8]. and charge injections in dynamic circuits contribute to the susceptibility of microprocessors. appropriate actions may be taken to prevent those catastrophes. It should be noted that since memory will be error detecting and correcting. Also. No hardware is damaged by the ions.2 Analysis of Transient Faults It has been reported that transient faults account for 80% or more of the failures in digital systems [11]. The shrinking feature sizes. These are more severe than data errors since SEUs affecting control would be expected to cause an exception. it is useful to predict the rate of observable SEU-induced errors in microprocessor systems and to identify and classify the errors into different categories. SEUs affecting data are particularly troublesome because they typically have fewer obvious consequences than an SEU affecting control.affect only single bits. lower supply voltages. This is essential in justifying its suitability for a particular application or mission. upsets in the instruction fetch unit or the branch prediction unit. If the digital system is a critical application. most data faults will therefore affect the microprocessor or its cache. The higher density of interconnects and the high slew rate of logic signals together make crosstalk due to capacitive coupling a serious problem. or data cache. space. results were obtained from simulating the nature of SEU-induced errors in a microprocessor as well as results from radiation tests causing SEUs. or avionic application. FPU. 3. on the order of a fraction of a nanosecond [3]. However. lower noise margins. SEUs affecting data are associated with upsets in the execution units such as the ALU. SEUs affecting control result in program flow errors (sequencing errors). current leakage. The SEUs are assumed to occur randomly both in time and in position within a circuit. military. To analyze the nature of SEU-induced errors in microprocessor systems. Typically the transient introduced by an alpha particle lasts only for a very short period of time. If the critical areas in the circuit that are more susceptible to transient-fault disturbances are identified. and lower threshold voltages of today’s microprocessors make them more susceptible to SEUs. 8 The experiments in this . It is also useful to understand whether the occurrence of an SEU will cause a system to fail catastrophically or to simply lose service for short duration.

but they make problems tractable and represent the types of faults that can occur. and the rate at which they will appear. This data gives us a good reference point for our studies. The fault models and types of errors expected for specific microprocessor units are presented in Chapter 5. there is an expectation that multiple bit flips will become more prevalent due to shrinking feature sizes [9]. approximately 31% were due to transient errors. they are not considered. This allows us to specifically define the types of faults that will be considered and the behavior these faults will have [13]. 3. These disturbances include those due to particle irradiation. However. faults due to radiation are not prevented. we can represent the behavior of physical occurrences. of the 131 observed errors. Clearly. Though fault models are not 100% accurate. 64% were resets. The general fault model used for transient faults in this research includes incorrect signal values caused by coupled disturbance. the design of such a system is dependent on a detailed understanding of the types of faults that will occur. Also. The principle faults for the REE environments are single bit flips. In a non-radiation-hardened system. the errors generated by these faults. We make the assumption that faults behave according to some fault model. For the radiation tests. Since the software tool designed in this project is intended for use in space applications where radiation exposure is imminent. but rather are allowed to occur. 5% were propagated errors. 9 . however.paper used an 8052 microprocessor system executing the COTS2 operating system program [8]. these single-event multiple upsets are still relatively rare and for this research. The error (the logical manifestation of the fault as seen by the system or software) or a subsequent manifestation due to propagation of the error is then detected and handled by the software [9]. Several defects are usually mapped to one fault model. In this fault mode. a radiation fault model proves to be helpful. several physically adjacent cells may be disrupted by the passage of a single energetic particle.3 Fault Models Fault models allow us to characterize the behavior of errors.

and any other structure that persistently stores a bit. faults due to radiation can be grouped into two types: • Latch faults include latches. faults in latches are tens of thousands of times more likely than faults at the gate level. such as data registers. • Gate faults occur when an SEU happens at approximately the same time as a clock transition. flip-flops. ‘Visible’ latches are included.From a high-level point of view. More information on these types of faults and the fault model structure can be found in [9]. which are used to implement structures such as instruction pipeline stages and processor register reservation scoreboards [9]. As the clock rate increases. Due to the tight timing required for a gate fault to propagate to and be latched by a register. thus causing the gate to flip its effective bit value [9]. as are ‘invisible’ latches. The types of errors expected in specific functional blocks of a microprocessor from this fault model can be found in Chapter 5. The goal of this section is to provide a basic idea of what types of faults are expected from radiation and how these faults relate to errors that may be produced. memory cells. 10 . or combinatorial logic. this difference will shrink due to the increased fraction of time available for combinatorial logic to present erroneous values to registers. which may then latch these transients [9].

the computations can be performed again to see if the disagreement remains or disappears [16]. or they are coded completely in machine language. This allows the use of some common software error detection techniques such as time redundancy and consistency checks to detect the transient errors occurring in a COTS microprocessor. The tool described in this thesis is a stand-alone monitoring program that was written from scratch in a high-level language. Time redundancy techniques attempt to reduce the amount of extra hardware required at the expense of using additional time. however. Also in space missions. 11 . they are not usually portable because they change the original application code in some way. information redundancy. if the discrepancy disappears. therefore making it deterministic as well as portable. However. where extra hardware means more payload. Software error detection techniques based on time and software redundancy are becoming more popular because they do not need the costly hardware as in hardware redundancy techniques.4 SOFTWARE ERROR DETECTION TECHNIQUES In general an error can be detected by several kinds of redundancy techniques such as hardware redundancy. it is known in advance what instructions will be executed and what the results of those instructions should be. Thus all these tools have inherent nondeterministic characteristics. Several software error detection techniques have been implemented in the past. then there is a permanent error present. If the discrepancy remains.1 Time Redundancy Time redundancy is a method that uses the repetition of computations in ways that allow errors to be detected. software-based error detection techniques are quite useful. The fundamental concept is to perform the same computation two or more times and compare the results to determine if a discrepancy exists. If an error is detected. time redundancy. then a transient error has occurred. Also. 4. often it is unknown what instruction sequences will be executed and what the results of various operations in the application will be at any given time. and software redundancy. Also.

in be used to capture transient errors in software.2 Consistency Checks Consistency check is a popular software redundancy technique. it is known in advance that a digital quantity should never exceed a certain magnitude. then there is likely a transient error occurring.4. then an error of some sort is present. if the result is same as expected most of the time and something other than that the rest of the time. A loop consisting of a group of operations that are checked for correctness upon completion is executed millions of times. characteristics of information to verify the correctness of that information. 12 . there is most likely a permanent error present. If the signal exceeds that magnitude. some applications. Consistency checks can also They use a priori knowledge about the For example. and if the result is consistently something other than expected. The software program described in this paper uses consistency checks as the primary method of detecting errors. the given operation (instruction) can be performed several times. Combining this idea with the time redundancy method above. However.

etc. we will concentrate on obtaining high switching on the inputs and outputs and assume that this will maximize the activity in any given design. processorspecific functional units are not targeted in this tool. Also. because they all use different hardware that can be tested explicitly from a high-level program. is not incorporated in this tool.5 PROCESSOR MODEL A model of the general purpose commercial off-the-shelf (COTS) microprocessor has been adopted for this research. However. a multiply unit. Our model incorporates the major functional units. In this way. Since there are many ways to implement each functional unit. It is assumed that the processors have 32-bit registers. Though vector unit. For example. some units. Functional units and their sub blocks targeted are given below: • Integer units 1 Addition Subtraction Multiplication Division Logical AND Logical OR Logical XOR • • Integer unit 2 Register unit 13 .. Also. a subtract unit. the integer unit can be broken down into an add unit. a fault model is then developed for each of these functional units. which are found in the most of the commercial processors. computations can be generated to test each unit without the detailed knowledge of how the instruction sequencing and control section for each unit are implemented. some units are broken down into smaller units if it is possible to localize errors to smaller areas in the circuitry. cannot be broken down further because it is impossible to write a high-level computation that will exclusively test its subunits such as the branch prediction hardware. Also. which is present in some of the processors. a detailed study of the possible structure of the routine for vector unit is provided for future work. such as the branch processing unit.

are: The instruction fetch unit includes components such as the instruction cache. the dispatch unit.• Floating point unit Addition Subtraction Multiplication Division • • • • • Data cache unit Load/store unit Vector unit (not incorporated in the tool) Instruction fetch unit Branch processing unit The control units. and therefore they have not been explicitly targeted in the program. many other operations that are performed will utilize these units. the instruction register. and the bus interface unit cannot be thoroughly and exclusively accomplished in a high-level program. which are targeted in our program. The branch processing unit includes hardware such as the branch target instruction cache. and any other hardware involved in fetching instructions. it is possible that they will be detected indirectly. control registers. the branch history table. 14 . Also. A block diagram of this processor is shown in Figure 1. and any other circuitry involved in branch prediction. There are other functional units not mentioned above that are extremely difficult to test from a high-level program. For example. differentiating between errors in the L1 and L2 cache is difficult. the reservation stations. The MPC750 processor contains all of the units described previously and provides a basis for the software tool. detecting errors occurring in the memory management unit (MMU). even though errors in these units are not tested for explicitly. the completion unit. But. and if errors occur. The preliminary version of the tool developed in this research was written for a Motorola PowerPC processor – the MPC750.

Most instructions in the IU2 take only one cycle to execute. Three-stage FPU 15 .1 Overview of MPC750 Processor Some of the major features of the MPC750 are as follows: • Six independent execution units and two register files BPU featuring both static and dynamic branch prediction ♦ 64-entry (16-set. ♦ IU2 can execute all integer instructions except the multiply and divide.Figure 1: MPC750 Processor Block Diagram. four-way. 5. set-associative) branch target instruction cache (BTIC) ♦ 512-entry branch history table (BHT) with two bits per entry for four levels of prediction Two integer units (IUs) that share 32 GPRs for integer operands ♦ IU1 can execute any integer instruction.

refer to the master’s thesis by Wells [17]. 16 .and little-endian modes • Rename buffers Six GPR rename buffers Six FPR rename buffers • Separate on-chip instruction and data caches (Harvard architecture) 32-kbyte. The following chapters present more detail on how the testing was done for each unit.♦ Fully IEEE 754-1985-compliant FPU for both single.and doubleprecision operations ♦ Thirty-two 64-bit FPRs for single. For a detailed description on the rest of the functional units. pipelined cache access ♦ Three-entry store queue ♦ Supports both big. Caches can be disabled or blocked in software. data cache can provide two words per clock. eight-way set-associative instruction and data caches 32-byte (eight-word) cache block Cache write-back or write-through operation programmable on a per-page or per-block basis Instruction cache can provide four instructions per clock.or double-precision operands Two-stage LSU ♦ Single-cycle. only the floating-point unit. Since this research was split between two students. and vector unit are discussed in this thesis. data cache unit.

They developed deterministic and pseudorandom (PSR) test procedures for testing FPUs. exponent (E). The operations used for each floating-point routine are either independent or dependent computations. It is also assumed that the floating-point unit in a commercial processor handles both single. The first bit (bit 0) is sign bit S. which may be represented as bits numbered from 0 to 63. It is also reported that in a 32-bit microprocessor. Accuracy of the floating-point computations is affected by the hardware implementation of the FPU. Janusz et al. and then the consistency checks are performed after the last computation is completed. which works with primary CPU. implemented in high-level language. which we are interested in. The IEEE single precision floating-point standard representation requires a 32-bit word. They worked on 32-bit SPARC compatible processor for space applications. Therefore there is a good chance that more errors occur in FPU than integer units.and double-precision arithmetic. We assume that all the COTS processors are compliant with the IEEE 754 floating-point standard. and the final 52 bits are the mantissa M. This sometimes leads to inconsistency in the results of floating-point computations. the next eleven bits are the exponent bits E. Quite often the FPU is called an arithmetic coprocessor. and mantissa (M) called also significant or fraction. which may be represented as bits numbered from 0 to 31. The IEEE double precision floating-point standard representation requires a 64-bit word. The number value is (-1) s * m * 2e. However this technique is not suitable to our research because they deal with the manufacturing testing rather than functional testing. Independent computations are executed one after another. [18] proposed a new technique for extensive testing of the Intel-compatible floating-point units.6 TESTING THE FLOATING-POINT UNIT The floating-point unit (FPU) is a quite complex functional unit in any COTS processor compared to the integer units. Typically. across different platforms. the next eight bits are the exponent bits E. the FPU is found to be more sensitive to fault injection than the integer unit and memory controller in [19]. The idea is to fill up 17 . binary representation of the floating-point numbers can be divided into three parts: sign (S). The first bit (bit 0) is the sign bit S. The independent computations are executed in a separate loop from that of the dependent computations. and the final 23 bits are the mantissa M.

to explicitly set the carry-in bit. and then a single consistency check is performed at the end. these vectors must be modified. exponent comparator (adder/subtractor).the arithmetic pipeline by flooding it with operations. . The x0y0 bits for vectors t8-t11 are modified so that the carry will be rippled through as intended. . They will also check for some types of control flow errors. . The new vectors are labeled t8’t11’ and represent the changes made in parenthesis shown in the table. . Several dependent instructions are performed consecutively. 18 . control over the carry bit c0 is required in some of the vectors (t8 – t11). if not impossible. and elementary function calculation circuitry. Both the independent and dependent instruction streams mentioned above will exercise different pipeline control. multiplier/divider. For example adder/subtractor is composed of mantissa ALU. . . All these subblocks are quite complex. from a high-level program. . Dependent operations in a pipeline typically require more complex control than independent operations. . . .1 Addition The major subblocks in a typical FPU are: adder/subtractor. . . . 6. . . . . . As a result. . . . . . . 1 (0) 00 (10) 1 (0) 01 (11) 11 T10 (t10’) 1 (0) 10 (11) T11 (t11’) 1 (0) Note that in Table 1. . Table 1: Minimum Test Set for Ripple Carry Adders Under MCFM. However. . it is difficult. . and this is a good way to check the functionality of that hardware. . . . . . . 00 01 00 10 00 11 00 00 01 10 11 00 00 01 00 10 00 11 11 01 10 11 00 01 00 10 00 11 00 00 01 10 11 00 00 01 00 10 00 11 00 01 10 11 00 01 00 10 00 11 00 11 01 10 11 . . Vector t1 t2 t3 t4 t5 t6 t7 t8 (t8’) t9 (t9’) C0 0 0 0 0 0 0 0 x0 y0 00 00 01 00 10 00 11 x1y1 x2y2 X3y3 x4y4 x5y5 .

it is primarily targeted in our work. Figure 2: Example Code for Independent Floating-Point Addition Operations. if (R0 != 4. CheckFPUAddition() register double R0. The decimal values of the vectors.0 + 2.f with f (binary fraction) stored in the mantissa field (23 or 52 bits). when converted into the binary form (IEEE standard).00000000000000000000e+00) UpdateStats(Err.99999999999999910000e+00) UpdateStats(Err. // vector R1 = 3. for (int i=0. their mantissas are added in the mantissa ALU and the new exponent is obtained by a simple exponent circuitry. . 19 . rounding circuitry. which is used for integer addition can be applied here again for the reasons mentioned above. The vectors generated for FPU addition are based on the minimum test set given in Table 1 (p.9999999999999996. If two normalized floating-point numbers are added together. . i++) { R0 = 2. fpua). For normalized numbers the mantissa is 1. i < LOOP_ITERATIONS. The multiple cell fault model (MCFM) [20]. if (R1 != 7. . Figure 2 shows the code excerpt with the independent operations along with the consistency checks. . } t1 t2 fpua). result in bit patterns similar to the minimal test set given in Table 1. Figure 3 shows the code excerpt with the dependent operations with some random vectors. . Since most of the processing happens in mantissa ALU. and normalization circuitry.0.9999999999999996 + 3. 18).argument alignment circuitry (barrel shifters). leading zero detector. R1.// vector . It is also assumed that there are more normalized instructions than denormalized instructions.

20 . . then any errors occurring in the subtraction routine are likely due to the extra hardware required for subtraction. fpua).CheckFPUAddition() register double R0. . } Figure 3: Example Code for Dependent Floating-Point Addition Operations. the same test vectors are used. Therefore. R1 = R1 + 7. 6. R1 = R1 + 1. multiplier and the product is maximized. It is also possible that subtraction is done using the addition hardware and some additional circuitry for 2’s complement. If there are no errors in the addition routine. Figure 4 shows the code excerpt with the independent operations along with the consistency checks. Figure 5 shows the code excerpt with the dependent operations with some random vectors. Most common implementation of subtraction is to take the 2’s complement of the subtrahend and add it to the minuend. if (R0 != 2. i++) { // random vectors R0 = 0. Ideally every bit needs to be flipped after each instruction. R1.8479999e+02.0. A separate routine is used for subtraction because subtraction involves one more step than addition – a 2’s compliment step – and the instruction code for subtraction will be different.1234318614070305e34) UpdateStats(Err. 6. However obtaining such vectors is difficult. We execute the subtraction routine soon after the addition routine.075959e+33. for (int i=0. we settle for vectors. i < LOOP_ITERATIONS. Therefore.2 Subtraction The fault model used for the subtraction is same as that of addition.3 Multiplication The routine used to test the multiplication unit uses the vectors in which switching activity of the multiplicand. which give reasonable switching activity.

R1 = R1 . R1. i < LOOP_ITERATIONS. fpus). fpus). CheckFPUSubtraction() register double R0.1. } Figure 5: Example Code for Dependent Floating-Point Subtraction Operations. .. // vector t3 .9999999999999996. if (R0 != 0. fpus).7.. . if (R2 != -1. .0 2. . } Figure 4: Example Code for Independent Floating-Point Subtraction Operations. if (R0 != 2.... if (R1 != 7. R1 = R1 . R1.0138161680000002e+34) UpdateStats(Err.3.075959e+33. for (int i=0.8479999e+02.// vector t2 R2 = 2. .33333333333333300000e+00) UpdateStats(Err.CheckFPUSubtraction() register double R0.333333333333333. i++) { // random vectors R0 = 2. Also. fpus). ..99999999999999910000e+00) UpdateStats(Err.3. i < LOOP_ITERATIONS.00000000000000000000e+00) UpdateStats(Err.9999999999999996 . 21 . for (int i=0.0. Table 2 shows the hexadecimal vectors used for the independent operations in the multiplication unit in the IEEE 754 format.1234318614070305e34. // vector t1 R1 = 3. several dependent instructions are executed and a final check is performed at the end.0 . i++) { R0 = 2. .

divisor. It also involves the same issues addressed in creating the vectors for multiplication. Table 3 shows the hexadecimal vectors used for the independent operations in the division unit in the IEEE 754 format. Table 3: Vectors for Floating-Point Division Operations in IEEE 754 Format. The vectors are chosen so that the bits in the dividend. and quotient are switching as frequently as possible in a high-level program. they are satisfactory in attempting to catch the control flow errors in the division loop. it is difficult to maximize the switching the hardware without knowing the exact implementation.Table 2: Vectors for Floating-Point Multiplication Operations in IEEE 754 Format. Also. several dependent instructions are executed and a final check is performed at the end. Dividend 0x4000000000000000 0x400FFFFFFFFFFFFF 0x4000000000000000 0x400FFFFFFFFFFFFF 0x4000000000000000 0x4000000000000000 0xC00AAAAAAAAAAAAA 0xC004924924924924 0xC005555555555555 0xC009249249249249 0xC002492492492492 Divisor 0x4000000000000000 0x400FFFFFFFFFFFFF 0x400AAAAAAAAAAAAA 0x4000000000000001 0x4005555555555555 0x400FFFFFFFFFFFFF 0x4000000000000000 0x4004924924924924 0x4000000000000000 0x4009249249249249 0x4002492492492492 Quotient 0x3FF0000000000000 0x3FF0000000000000 0x3FE3333333333334 0x3FFFFFFFFFFFFFFD 0x3FE8000000000000 0x3FE0000000000002 0xCFFAAAAAAAAAAAAA 0xCFF0000000000000 0xCFF5555555555555 0xCFF0000000000000 0xCFF0000000000000 Number of bit-flips in quotient 0 0 26 27 50 2 26 26 26 26 0 22 .4 Division Division and multiplication are usually done using the same hardware in an FPU. Multiplicand 0x4000000000000000 0x400FFFFFFFFFFFFF 0x4000000000000000 0x400FFFFFFFFFFFFF 0x4000000000000000 0x4000000000000000 0xC00AAAAAAAAAAAAA 0xC004924924924924 0xC005555555555555 0xC009249249249249 0xC002492492492492 Multiplier 0x4000000000000000 0x400FFFFFFFFFFFFF 0x400AAAAAAAAAAAAA 0x4000000000000001 0x4005555555555555 0x400FFFFFFFFFFFFF 0x4000000000000000 0x4004924924924924 0x4000000000000000 0x4009249249249249 0x4002492492492492 Product 0x4010000000000000 0x402FFFFFFFFFFFFE 0x401AAAAAAAAAAAAA 0x4020000000000000 0x4015555555555555 0x4020000000000000 0xC01AAAAAAAAAAAAA 0xC01A72F05397829B 0xC015555555555555 0xC023C14E5E0A72F0 0xC014E5E0A72F0539 Number of bit-flips in Product 0 53 27 28 28 28 29 24 27 28 28 6. Again. While these vectors do not accomplish perfect switching among the bits for each step.

1 Fault Model An error in a data cache may lead to the following types of errors: 1. are accessed. are then compared. On-chip L1 cache is further divided into data and instruction caches. Most of today’s caches are divided into two levels based on their size and access time. which is bigger and slower. which is bigger than the size of the data cache. If they are not equal. which is smaller and faster. which are both on and off the chip. there is no error in the data cache unit. An incorrect data returned by the data cache 2.2 Functional Testing In our data cache test routine. it means that there might be a transient error in the data cache unit. This 23 . Level 1 (L1) cache. However. This is done by defining an array. Therefore. which are supposed to be the same. 7. the array is divided into two halves and the elements of the first half are the same as that of the second half in the order of their index. and level 2 (L2) cache.7 TESTING THE DATA CACHE UNIT Cache is the first level of the memory hierarchy encountered once the address leaves the CPU. testing on-chip cache is of utmost importance in the commercial processors used for the space applications. It is hard to explicitly test the instruction cache by a program written in a high-level language. instruction cache is accessed whenever any instruction in the program is executed. If they are same. Cache miss even if the data is in the data cache 7. which are separated by the distance of half the size of the array. Both these data. Then two elements of this array. This array initialization and comparison of two of its elements is repeated millions of times. It has been reported that cache is more susceptible to single-event upsets than the central processing unit (CPU) and the floating-point processor (FPP) when subjected to radiation [21]. Data and instruction caches are continuing to occupy a significant part of the silicon in the present generation commercial processors. is on the chip. is off the chip. Performance of a COTS processor is significantly affected by the size and access time of the caches. Therefore it is possible that the tool will detect some transient errors in the instruction unit. we first try to flood the data cache unit with more data than its maximum capacity. Our work in this section deals with the testing of the data cache unit. Also.

Figure 6 shows a typical LFSR of length eight. It produces patterns that have an approximately equal number of 1s and 0s and have an equal number of runs of 1s and 0s. A maximal-length LFSR produces the maximum number of patterns (vectors) possible and has a pattern count equal to 2n-1. An LFSR is a shift register that. The test vectors generated in our program are 32-bit-wide. Figure 6: An LFSR of Length Eight. It might be accessing the instruction fetch unit and load/store unit but most of the processing is done using the data cache unit. advances the signal through the register from one bit to the next-most-significant bit. when clocked. They are used in the built-in self test (BIST) schemes. Linear feedback shift registers (LFSRs) are the most efficient and popular pseudo-exhaustive test pattern generation mechanism. where n is the number of the register elements in the LFSR. The more unique these vectors. We have selected one the characteristic polynomial given in [22]. The effectiveness of the strategy mentioned above is dependent on how good are the test data (or vectors) stored in this array. The quality of the tests generated by the LFSR depends on the characteristic polynomial and the initial state (seed).strategy is very effective because the processor will be exercising most of the data cache continuously and the activity of the other functional units is minimal. the greater the chance of detecting a transient error in the data cache unit. Some of the outputs are combined in exclusive-OR configuration to form a feedback mechanism. By default the vectors are generated in this tool to 24 . The functional vectors for the data cache unit are derived using a pseudo-exhaustive test generator. which is our target in this routine. The maximum size of the target cache being tested can be varied in the tool. More information on the LFSRs can be found in [22].

A flowchart of the routine. j =1.test a data cache with a maximum size of around 780 kbytes. Mem_array[i] = V. is given in Figure 7. Figure 7: Routine to Test Data Cache Unit. Mem_array[j] =Mem_array[TEST_SIZE+j] Yes No Cache_Errors = Cache_Errors +1. which tests the data cache unit. which is well above the size of the data cache found in present-generation commercial processors.Cache_Errors =0 i> TEST_SIZE No Yes Generate a new vector V (or data) j> TEST_SIZE Exit i = i+1. Mem_array[TEST_SIZE+i] =V. 25 . Start Inititalize the lfsr_reg & mask_reg /*lfsr_reg is the initial state of the reg mask_reg is the characteristic polynomial */ i = 0.

it is not possible to write a single routine in a high-level language. In this situation. Therefore. possibly on a single chip.e. 32-bit implementation of the PowerPC RISC architecture combined with a high-performance vector parallel processing unit based on the Motorola’s AltiVecTM technology.8 TESTING THE VECTOR PROCESSING UNIT A vector processor is a processor that can operate on entire vectors with one instruction. memory access latency and bandwidth become serious bottlenecks. In a scalar machine the operands are numbers. and suggest a way of testing it. Our preliminary target.1 Vector Processing Unit of MPC7400 The MPC7400 PowerPC microprocessor is a high-performance. which has an on-chip vector processing unit. which can test all types of vector units. will have an on-chip vector processor unit. For example. As processing power continues to increase. consider an addition instruction C= A+B. an on-chip vector unit is advantageous since it can make efficient use of such high internal bandwidth. but in a vector processor the operands are vectors and the instruction directs the machine to compute the pairwise sum of each pair of vector elements. general-purpose processing performance while concurrently addressing high-bandwidth data processing and arithmetic-intensive computations in a single-chip solution. the operands of some instructions specify complete vectors. which are used in the space applications. i. does not have any on chip vector unit. low-power. We have studied MPC7400. MPC750. based on the applications they are used in. Processors and DRAM memory will be packaged together more tightly. Therefore. On-chip vector processing units expand the capabilities of COTS processors by providing leading-edge. One way to tackle this problem is to add several routines each for a specific target processor either in a high-level or assembly language. 8.. Key elements of this technology are: • • 12-bit vector execution unit based in single-instruction multiple-data (SIMD) model Parallel processing 26 . we assume that most of the future COTS processors. This integration would introduce orders of magnitude superior bandwidth/latency to local memory than to remote memory. It should be noted that the vector processing units vary a lot in their implementation and instruction-set.

This process is repeated millions of times in a big loop. splat. are executed. The general idea behind the test routine in this program is to execute several types of vector operations grouped into subsections and monitor the types of errors found in each of these subsections. This approach can be extended to other COTS processors with vector unit. which consists of errors occurring in various types of vector operations. unpack. Then the results are checked for errors. by adding similar routines to execute the vector instructions of those particular processors. Instructions are first grouped based on the type of data they operate on. This correlates with the assumptions that any SEUs occurring will be localized to a specific area. which use different types of vector operations. is assumed for the vector processing unit. permute. A high-level algorithm for testing the vector processing unit is given in the Figure 8. select and shift instructions Processor control instructions Memory control instructions 8. Then a group of similar vector operations are performed and checked for the expected results. Therefore.2 162 vector instructions Vector permute unit AltiVec Instruction Set The AltiVec instructions can be grouped as follows: • • • • • • Vector (integer) arithmetic instructions Vector floating point arithmetic instructions Vector load and store instructions Vector permutation and formatting instructions. This way the vector unit will always be busy and there is a minimal chance of CPU exercising other functional units. transient errors occurring in the vector unit can be detected.3 Functional Testing A simple fault model. which include pack. 27 .• • 8. Also. merge. These functions are also repeated millions of times. other functions like FIR_filter_Implementation and Sum_of_Absolute_Differences.

.CheckVectorProcessingUnit() Begin /*Loops of operations grouped based on their type */ Big_Loop begin Vector arithmetic operation1. Sum_of_Absolute_Differences(). . Check the results of all arithmetic operations. end end Figure 8: High-Level Routine to Test Vector Processing Unit. end . . Check the results of all the Vector permutation operations. Vector permutation operation2. 28 . . . /* Loops with mixed operations */ Big_loop begin FIR_filter_implementation(). . Check results of Sum_of_Absolute_Difference operations. end Big_Loop begin Vector permutation operation1. Check results of FIR_filter operations. Vector arithmetic operation2. .

NASA’s JPL plans to do radiation tests which simulate the real-time environment closer than any other technique. Also. mainly interested in techniques used in the prototype phase because our tool is in the preliminary stage. we use hardware-implemented fault injection techniques because these are most representative of the types of errors that our tool is trying to detect. In the operation phase. It is the artificial insertion of faults into a real or simulated computer system under test. however. In the prototype phase. there is no need to for us to use the software-based techniques. Traditionally. with the rising levels of integration. 29 . because of internal pipelines and caches. controlled physical fault injection is used to evaluate the system behavior under faults. fault injection experiments on prototypes of a fault-tolerant system were carried out by physical or hardware-implemented fault injection techniques. software-based fault injection techniques are used to evaluate the design via simulation. For these reasons we did not use any pinlevel fault injection techniques in our experiments. though. However. a direct measurement-based approach is used to measure systems in the field under real workloads [23]. Several softwareimplemented fault injection techniques are also developed to overcome the limitations of physical fault injection techniques. Therefore. A definite drawback.9 FAULT INJECTION METHODS Fault injection is a widely used method for evaluating dependable systems. Both transient and permanent faults may be injected. It employs specific hardware to change the electrical signals at selected target device pins. Obviously. Moreover.1 Physical Fault Injection Techniques Pin-level fault injection is believed to have been the first physical technique used to inject faults. only a minute part of what a complex processor does is visible at the pins. In the design phase. the pin-level approach has lost some of its usefulness. is the need for custom hardware. 9. We are. Fault injection is intended to yield three benefits: • • • an understanding of the effects of real faults feedback for system correction or enhancement forecast of expected system behavior A dependable system can be evaluated experimentally at different phases of the system’s life.

However. Also. For example. This method produces transient faults at random locations evenly and can cause single-bit or multiple-bit flips[23]. We use a stable external power supply as a source to the microprocessor directly instead of using the power signals generated on the embedded board (SBC603/740/750). as it has the potential to prevent the heavy ions from reaching the depletion regions. These timing violations manifest as errors in the functionality of the microprocessor. Decreasing the supply voltage might result in extra delays in the various components of the microprocessor. which generate bit errors during information transfer [24]. it has been shown that voltage sags injected in power rails of ICs result in gate propagation delay faults. the MPC750 microprocessor.5 V. 30 . One way is to expose the system to the heavy-ion radiation from a Californium252 (Cf252) source. Electromagnetic fields also can induce errors in a digital system. some transient errors might be induced in the processor by decreasing the voltage below the nominal value or “stressing” the processor in terms of the input power. both of which were run in lock step. and the external buses were compared while the power supply pin of the test CPU was being disturbed. Therefore. Also. The heavy ions emitted from the source are capable of creating transient faults when they pass through the depletion regions of the IC.Radiation-induced fault injection is another effective way of producing transient faults at random locations inside ICs. The experiments reported in [25] also support this idea of inducing transient errors by power variations. provided the generated electric spark is strong enough to affect the ICs on the system. They used a test CPU and a reference CPU. we have not done any experiments on fault injection using electromagnetic fields. More details on the experimental setup are given in Chapter 10. has the nominal core voltage of +2. which we are working on for the preliminary experiments. Power supply disturbances are also commonly used to inject faults in the system under test. Any input voltage more than that will be destructive to the processor. However. which in turn leads to timing violations. special care should be taken with the packaging of the circuit. we do not have facility to control and direct these electromagnetic sparks only to the CPU without affecting other subsystems on the embedded board. Irradiation of a circuit is usually performed in a vacuum chamber.

In the next chapter. Radiation experiments for our system will be conducted in the near future by the JPL. we present results for fault injection experiments using power supply disturbances. 31 .

Our tool can be directed to execute some or all of the test routines. 10. which we have added deliberately. 10. Otherwise. This development kit has some in-built routines to enable serial communication between the SBC target board and host. to initialize the target and to program EEPROM As explained earlier. It should be noted that all the optimizations possible in any compiler must be disabled before compiling our tool.1 Analysis of the Software Tool The error monitoring software tool.10 EXPERIMENTAL RESULTS Experimental results concerning the error behavior of the MPC750 processor for errors caused by power supply disturbances and a few other fault injection techniques are presented in this chapter. we have used the visionWARE Development Kit supplied by the WindRiver Systems to compile our program. we developed is written in C language. We first analyze our software tool in Section 10. This gives the flexibility of targeting only some functional units.1. and may not execute all the instructions as we expected. our program has separate routines to test various functional units in the processor. It also outputs the same statistics to enable the continuous status-monitoring of the target. UpdateStats() is invoked whenever an error is encountered in any of the test routines. It is also executed after the completion of each test routine.1.1 Error statistics Our tool not only detects the errors but also provides the following information when an error occurs: • • The routine in which the error occurred Number of errors found in that particular routine 32 . Figure 9 shows the code excerpt of the software tool (for “-all” input option). In our experiments we used the SBC603/740/750 embedded board as the target. UpdateStats() generates and updates the error statistics in our program. Then the experimental setup is presented in Section 10.2. This program can be compiled using any compiler that supports the target CPU type. As shown in the figure.3. Also. if at all required. and finally results of the experiments are discussed in Section 10. these compilers might consider some of the instructions in our code to be redundant.

. &units[dc]). /* Function definition */ void CheckFPUAddition(ExecutionUnit *FPUA) { register double R0.9999999999999996. CheckFPUAddition(&units[fpua]). UpdateStats(0.REETool() { . . } /* end of REETool() */ . . . .. for (done =0 . &units[fpua]). Figure 9: Code Excerpt of the Software Tool. CheckFPUSubstraction(&units[fpus]). FPUnitA). }/* end of for loop */ . FPUnitA). . }/* end of CheckFPUAddition */ .99999999999999910000e+00)UpdateStats(1. /* if input option is “-all”. 33 .9999999999999996 + 3.. . } /* end of if (pIn->all) */ .0 + 2. if (R1 != 7. UpdateStats( 0. done < LOOP_iterations. R1. which means execute all test routines */ if (pIn->all) { . CheckDataCache(&units[dc]). UpdateStats(0. . done ++) { R0 = 2. . . /* Checking for correct result */ if (R0 != 4. .0. . R1 = 3.00000000000000000000e+00)UpdateStats(1. &units[fpus]). . . . .

it may take several clock cycles before it becomes observable. The embedded board that we are using for the experiments does not have any operating system.1. which is running time divided by the number of errors Total number of errors found up to that point The line of code where the error was detected 10.3 Tool runtime For our tool to be effective in detecting errors.1. which is number of errors divided by run-time of the routine up to the time of error Mean time between errors (MTBE). in our test routines.• • • • Error rate of the particular routine. Also.2 Error latency Once a transient fault occurs. in the order of nanoseconds. empirical results on 34 . error latency is reduced by frequent cross-checking of the results of all computations. followed by their respective consistency checks. Our tool will execute each major functional unit of the processor and keep it busy most of the time. Therefore. This is done also to keep to the functional unit busy most of the time. our program requires most of the CPU time when it is executing. other programs running on the same processor might lead to problems in locating the source of errors. For example. it is of utmost importance to have minimum error latency. Since transient errors appear for only very short periods of time. thereby increasing the chances of more transients in the unit. In our tool. As explained earlier. MPC750 can perform one floating-point operation per clock cycle. The number of computations that are performed in each routine along with the approximate number of operations executed every second are given in Table 4. except its operating system. Therefore in the worst case it takes 11 clock cycles to detect the transient error in the floating-point unit as we are doing 11 independent operations in sequence. However. 10. it is required that there be no other major workload on the target processor. a group of computations are performed and then the results of these computations are crosschecked with the expected values. The best-case error latency for the floating-point unit is only one clock cycle. Error latency in our context is the time it takes to detect a transient error.

67x106 195.000.61 s 16.000 It should be noted that our software tool does not incorporate any kind of error recovery mechanisms.76x106 34.000 200.000 100.67x106 52.88 s 0.000.000.000.000.09x106 13.92 s 1.000 10. which is used to compile our tool and create the ROM build file.66x106 230.000 10.81 s 1.000.25x106 2.09x106 0.59 s Approx.000.49x106 11.24x106 35.200.000.000 Total Operations 400.000 10. a variable power supply (HP E3631A).000.000 200. Also.000 320.000 200.65x106 10.000.000.000.81 s 7.000. 192 stores 2 10.13x106 188. The hardware consists of a target board (WindRiver’s SBC 603/740/750).000.000.000.54 s 15. If the program hangs or crashes. A HyperTerminal software (WindRiver’s vDesktop) is used to communicate with the target from the host computer.000 10.06 s 3.000 10.75 s 0.100.000 Approx. Operations Per Second 50.86x106 227.000 10. is used in our experimental setup. and the operating system used 20% of the CPU cycles.19x106 3.84 s 60.07 s 11.000. The host computer is connected to the target board both by a serial communication cable 35 .000.68 s 11.06 s 3.000 10.000 70.000.11x106 25.000 580.000.000.000 10.39 s 22.000 5.000.000.000 800.000.75 s 42. The experimental setup is shown in Figure 10.000 500.96 s 116.12x106 9.2 Hardware and Software Setup To carry out our experiments.000 10. the target has to be rebooted.200.000.000 200.000.000 10.the HP PA-RISC Unix machine showed that our tool utilized approximately 80% of the CPU cycles.000.000.000 10.000 200.000 200.000 400. and a host computer. a comprehensive hardware and software system has been set up. Running Time 7.000 10. Routine Register Unit Instruction Fetch Unit Integer Addition Integer Subtraction Integer Multiplication Integer Division Logical AND Logical OR Logical XOR Integer Unit 2 Floating Point Add Floating Point Subtract Floating Point Multiply Floating Point Divide Branch Processing Unit Load/Store Unit Data Cache Operations Number of per loop loops 40 32 40 40 58 50 20 20 20 40 adds & multiplies 20 20 20 20 7 320 loads.000 10. Table 4: Operations and Running Times for Each Routine.49x106 188.000.000 51. 10. a software development kit (WindRiver’s visionWARE Development Kit).000 10.27x106 266.000 200.000.000 400.13 s 0.

Once our tool starts running. Once this ROM version of the tool is on the target. Target Embedded Board. Our tool can then be run by vDesktop with just the serial communication.and an Ethernet cable. is downloaded to the target board using the Ethernet connection.2V 1. any future changes in the tool can be updated to the same binary file without the need for building everything from scratch again. information about the current status of the processor is continuously outputted to the vDesktop. This pin supplies the power only to the MPC750 processor on the target board.12A + Host with vDesktop and VisisonWARE Development Kit - Figure 10: Experimental Setup. After the above setup is completed. Now the output voltage 36 . This downloading is done using the vDesktop software on the host. after compiling and building all the modules in our tool. Then this ROM build file is integrated into the firmware in the target board (SBC 603/740/750) by programming the EEPROM from the vDesktop.SBC603/740/750 Et h ern et T ransceiver MPC750 CPU Serial Connection Variable Power Supply 2. the ROM build file obtained from the visionWARE development kit. The core supply voltage pin on the target board is connected to the variable power supply.

They can be classified into control flow and data errors: ♦ Control Flow errors: They produce a change in the instruction flow that is not possible under correct operation of the program. which connects only to the CPU on the target board. the following terminology is used to identify different types of errors in the processors: • Effective Errors: They are propagated to any of the external signals and they can change the processor state. variations in the supply voltage of a processor might result in inducing some transient errors in the processor.3 Experimental Results Results of various experiments on the system setup mentioned in Section 10. 37 . we got significant results only from the power supply variations. 10. More details on power supply disturbances are given in the next section.1 Power supply variations As explained earlier in Section 9. The variable power supply (HP E3631A) is connected to the supply pin.1.3. They are usually caused by an error in the instruction or upsets in the control registers such as program counter (PC).2 are presented in this section. Our experiments are based on the following physical fault injection techniques: • • • power supply variations weak radiation exposure changes in temperature (along with the power supply variations) However. The other two techniques did not lead to any conclusive results. Our experiments are based mostly on decreasing the supply voltage below the nominal operating voltage of the processor. 10.of the variable power supply can be varied to see the error behavior of the processor for different voltages. A user’s manual for setting up the hardware and running the tool is included in Appendix A. Serial communication is good enough to invoke the tool and see the output in vDesktop. For our experiments. Note that Ethernet connection is required only for file transfers.

The processor state is not changed. Both these voltages are established empirically for all the four processors. They can be classified into overwritten or latent: ♦ Overwritten errors: value. For our study. below which their respective embedded boards can not be booted. experiments on these four boards are done independently. They might eventually become effective. effective errors are the most interesting as they change the processor state. In this table.♦ Data errors: They are caused by an error in the result of any arithmetic or logical computation and also data memory upsets. They might lead to incorrect results in the subsequent dependent operations. Results of these experiments are summarized in Table 5. Processors (MPC750s) in these four boards have different lower threshold voltage. which can be detected by our tool. Also. the fifth column lists the number of tests that resulted in a control flow error detected by the tool in the voltage range specified in the third column. which is the highest voltage above which the processor is very unlikely to hang or show errors. No data errors were detected. The third column lists the range of voltages used in the experimentation. the first column gives the board number. These voltage variations are not necessarily uniform variations. Finally. because a different action has masked or overwritten the faulty 38 . Our tool seems to have detected only the control flow errors. the second column lists the total number of tests resulting in a hang or detectable control flow error. We used four target boards for our experiments. Also there is an upper threshold voltage. The fourth column represents the number of tests that resulted in a hang in the specified voltage range. Then the supply voltage of each processor is varied between these two thresholds. Note that the higher threshold voltage is also usually lower than the nominal supply voltage of the respective processor. Error is not detected. • Noneffective Errors: They do not change the processor state. ♦ Latent errors: These faults remain dormant in the subsystem that is not being in used at the moment.

10 – 2.001 0. In these experiments voltage is decreased from the “start voltage” uniformly with one of the three decimals in the supply voltage.1 0.157 2.3 2.2 2.124 2.2 2.001 0.01 0. For example.25 2.001 0.08 2.001 0.134 2.001 0.165 2.28 2.3 2.00 – 2. However.2 V uniformly in the steps of 0.139 2.3 2. Table 6: Results of Uniform Power Supply Variation Experiments.128 39 .29 2.20 No. Board No.001 0.001 0.151 2. This resulted in a hang at 2.3 2.14 2.001 0.20 2.2 2.01 0.124 V.3 2.5 2.Table 5: Results of Experiments Based on Nonuniform Power Supply Variations. of Control Flow Errors Detected by the Tool 19 9 7 17 Some of the results of the uniform power variations are presented in Table 6. if you consider first experiment reported in Table 6.001 0.12 2. not all the results are reported because of the amount of space they occupy. Experiment No.3 2. LSU Integer Unit 2 Integer Unit 2 Start Voltage (V) 2. of Hangs 46 26 18 53 No. Integer unit 1 is tested in this experiment and voltage is decreased from 2. 1 2 3 4 Total Number of Tests 65 35 25 70 Voltage Range (V) 1.124 2.99 – 2.127 2. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Functional units tested Integer Unit 1 Integer Unit 1 Integer Unit 1 Integer Unit 1 Integer Unit 1 Integer Unit 1 All Units All Units FPU FPU BPU. LSU BPU.001 0.118 2.2 Voltage steps (V) 0.136 2.08 – 2.25 2.2 2.001 V.001 Hang Voltage (V) 2.

2 Weak radiation exposure In another set of experiments. Our tool is run at a voltage near the lower threshold voltage and temperature is increased. which in turn might result in some transient errors. we conducted only few such experiments.3. A small amount of Americium-241 is inside the ionization chamber.3. No errors or program hangs were detected in the processor in spite of exposing the processor to this weak radiation for a few days.3. 10.These results reported in Table 6 do not provide much information about the error behavior of the target processor except the hang voltages. we changed the ambient temperature of the processor along with the power supply variations mentioned in Section 10.1. we exposed the processor to a weak radiation of the ionization chamber. which did not result in any hangs or detectable errors. as we are not sure if any transient error was actually induced in the processor by this weak radiation. The ionization chamber is taken out of the smoke detector before keeping it on the microprocessor. It is very likely that the weak alpha particle radiation of the smoke detector may not penetrate the microprocessor’s metal shielding.3 Changes in temperature In this set of experiments. More such experiments might result in detectable errors. which is present inside a smoke detector. 10. Higher temperatures might lead to extra delay in the gates of the processor. We are not able to conclude anything from these experiments. However. 40 . Therefore. we did not have any facility to control the temperatures.

More test routines. However. Our tool is presently in the prototype stage. This tool is still in the preliminary stage. 41 . Some low-level routines (in assembly language) can be added to the tool. Some error recovery techniques should be incorporated in this tool to overcome the target processor “reboot” after the program hangs or crashes. which are increasingly being pulled on-chip of the commercial processors. Significant results are obtained using the power supply variation experiments. but also can help in locating the cause of these transients. which controls the execution of this tool.11 CONCLUSIONS AND FUTURE WORK The goal of our research was to develop a self-contained error monitoring software tool. This tool can be improved in many aspects by taking some of the following steps in the future: Radiation-based fault injection experiments must be conducted to obtain more information about the performance of the tool in real space-like environments. This overcomes the high cost and time involved in designing processor-specific monitoring tools for various commercial processors. which can help in testing hard-to-test circuits in the processors. The final goal was to make this tool portable across different commercial processors. The primary goal was to detect the transient errors effectively. Such information is very useful in outer space applications and might help in providing early warning and avert the fatal errors or system crashes. the results obtained justify its efficacy in the space applications. which characterizes the transient error behavior of the commercial off-the-shelf processors (COTS) used in space applications. can be included to address the problem of testing vector units. for example vector test routines for different processors. But any step in this direction means some loss in the portability of the tool. These results not only verify that our tool is effective in detecting the errors. Most of the detected transient errors are control flow errors. The secondary goal was to provide adequate run-time status information and error statistics so that these transient errors can be traced down to the particular functional block in the processor. Another advantage of this tool is that it requires minimal communication (just serial communication) with the host computer. Results of the physical fault injection based experiments are presented. Such low-level routines should be added only when absolutely necessary.

Fault injection experiments on several other commercial processors might help in understanding the effectiveness of this tool across various platforms. 42 .

The intention of this tool is to be compiled and run on any processor. See Section A.doc” file on the “SINGLE BOARD COMPUTER Software & Documents” CD under \hsidocuments\visionWARE. the Ethernet connection is required. NOTE: The Ethernet connection is needed only if file transfer is to be done. Read the following sections to learn how to use vDESKTOP and to run the 43 . connect the serial port on the test board with either the COM1 or COM2 port on the host computer.” The setup file is located in the vDESKTOP directory on the CD. There is already a version of firmware flashed on the board containing the software tool.2 below.APPENDIX A REETOOL USER’S GUIDE The REETool is intended to run continuously in the background of a computer system. connect the board to the network by first attaching the Ethernet transceiver to the JP28 slot on the board and then connecting the network cable to the transceiver.1 Setting Up the Board with the Serial Connection To set up the board and run the program. install the vDESKTOP software that comes on the CD entitled “SINGLE BOARD COMPUTER Software & Documents. It is currently being tested using a MPC750 PowerPC processor on a single board computer (SBC603/740/750). however. the following things are needed: • A host computer with the vDESKTOP software installed o The software for vDESKTOP is on the CD entitled “SINGLE BOARD COMPUTER Software & Documents” • • • An available serial port An available Ethernet port The SBC750 board along with the Ethernet transceiver that comes with it First. this user’s guide describes how to run the tool on the SBC603/740/750. However. Second. program. Additional information for setting up the serial connection and using vDESKTOP can be found in the “visionWARE SBC. A. Third. to update the firmware along with a new version of the tool. Only a serial connection is needed to run the tool.

A. VDESKTOP uses tftp to transfer files between the host and the target board. • • • first click on the vCOM icon to bring up the HyperTerminal window. there will be a BKM> prompt in the vCOM window. If no file transfer is to be done.2 Downloading the visionWARE file and updating the firmware The tool is incorporated with the firmware file that was created with the visionWARE development kit.2. it can be used to obtain a serial connection with the board. make the connection by either clicking the circular arrow button. try resetting the board.2 Using vDESKTOP Once vDESKTOP has been installed. Any input and output will be done in this window. VSHELL sets up the network parameters for the board. in general. and vFILE. When a connection is made. and gateway are needed for the Ethernet connection to work. The development kit creates a binary file that can be used to update the memory containing the current firmware. If it does not work at first. is named update_projectname. There are three components in vDESKTOP: vCOM. or selecting connect in the Tools drop-down menu. The Ethernet transceiver is required by the tftp daemon. make sure the correct COM port is selected by right clicking in the HyperTerminal window and choosing Communications.2. To transfer the file to the target. A.1 Establishing a serial connection To establish a serial connection. make sure the target is on and bring up the vDESKTOP application. This file. 44 . In this case it is named update_REETool.bin. VCOM uses a HyperTerminal to establish a serial connection with the board. vSHELL. The vFILE component brings up a file viewer so that files may be transferred between the host and the target.A. then the Ethernet transceiver and network connection do not need to be connected. A valid Ethernet address. disconnecting and reconnecting in the HyperTerminal window. subnet mask.bin.

4 Statistical Output Output from the program is accomplished with print statements that send the output through the serial port to the HyperTerminal window (vCOM) in vDESKTOP. you should be able to see a list of all the files located in RAM on the target to confirm that it is there. or type ‘help’ to see all available commands. To see a listing of the options from vDESKTOP. click on the vFILE button.bin file on the host directory structure. and choose download.1). This will start the tftp daemon and allow the transfer of files. Table 7 shows the command line options available when running the REETool. type ‘help reetool’. Once the file is downloaded. The order has no effect on how the program is run. Next. In the file browser window. A. right-click on it. For example. go to the vCOM window and click in it next to the BKM> prompt so that you can type in this window. if you wanted to run all of the routines. If errors are found. Or.Once a serial connection is made (see Section A. find the update_REETool. Type the following command to update the firmware: update \\your_hostname\update_REETool.2. type ‘reetool’ followed by any command line options.3 Running the Program To run the program.bin When the firmware has been updated. if you wanted to test only the instruction fetch unit and the branch processing unit. then type ‘reetool –all’. just reset the board and the new firmware should take effect. It does not matter what order the command line options are typed. error rate and the mean time between errors (MTBE). you would type ‘reetool –ifu – bpu’. there will be a floating-point exception that has occurred. a statement with the check that failed is printed (in verbose mode) along with the current 45 . This is not a problem. A.

the project for REETool can be loaded from the Project Open Project menu. The source code for all the available drivers and diagnostics is included in this project. The ROM version creates the update_projectname. lo. is.Table 7: Command Line Options for REETool. This can also be chosen in the build menu or from the icon bar. After the changes have been made.5 Making Changes to the Program Changes to the REETool can easily be accomplished with the visionWARE development kit. 46 . not a RAM version.prints out more detailed information about any errors found. it can be loaded to the target as described above in Section A. or by clicking the build icon in the icon bar. lx) Run the routines specified only once. Make sure that the build is a ROM version.2. Once this kit is installed. the project should be built by choosing the Build Build vWARE command. id. Once the new build file has been created. Command Line Option -all -ru -ifu -ia -is -iu2 -la -lo -lx -im -id -bpu -fpua -fpus -fpum -fpud -fpu -lsu -dc -alli -alll -once -v Description Check all units Check register unit Check instruction fetch Check integer addition Check integer subtraction Check integer unit 2 Check logical AND Check logical OR Check logical XOR Check integer multiplication Check integer division Check branch processing unit Check floating point unit addition Check floating point unit subtraction Check floating point unit multiplication Check floating point unit division Check all floating point unit operations Check load/store unit Check data cache Check all the integer units (ia. im. iu2) Check all the logical units (la. The source code for REETool is located under the User Components REETool Source Files folder. Verbose mode . A.bin file that is used to update the firmware on the board.

Karoui and T. “SEU induced errors observed in microprocessor systems. December 1994. October 1997. May 1988. Seidleck. 46. Sprehn.” IEEE Transactions on Nuclear Science. C. 1. J. pp. vol. ESTEC. A. K. Torin. 51-57. and M. Firer. 2000. A.. Katz.REFERENCES [1] J. S. Koga. R. “Predicting the rate and effects of single event upsets on satellite application software using a microprocessor simulator. J.” in 2nd Round Table on Micro/Nano Technologies for Space. K. 328-335. 6. Oldfield.. L.” in Proceedings of the IEEE International Conference on Dependable Systems and Networks. V.E. R. Beahan. Kimbrough et al. 2876-2883. Crawford. D. vol. G. P. testing of 32-bit microprocessors”. 16-20. Gates. vol. McGraw. pp. Crain. I. A. A. B. “Single event effects and performance predictions for applications of RISC processors. Chapuis. 107-116. and R. Edmonds. and A. pp. J. “Single event effects test results for the 80C186 and 80C286 microprocessors and the SMJ320C30 and SMJ320C40 digital signal processors. Jou and J. K.” IEEE Transactions on Computers. 279-281. S. no. H. R. 5. 2706-2714. S. Karlsson. vol. no. Miremadi. July 1992. R.U. J. 43. “Single event effect testing of the Intel 80386 family and the 80486 microprocessor. and R. pp. Baril. vol.Velazco. Granat. pp. W. no. Hansel. 2000.” in IEEE Aerospace Conference.” in Proceedings of the 22nd Annual International Symposium On Fault-Tolerant Computing. D. Ferraro. S. pp. workshop record. ESA. Moran. [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] 47 . Hiemstra. 548-561. and J. no. “S. J. Crain. December 1998. 879-885. pp. Katz. M. pp. Radiation Effects Data Workshop. “Single event upset characterization of the Pentium MMX and Pentium II microprocessors using proton irradiation.” IEEE Transactions on Nuclear Science. Gunneflo. R. and D. Abraham. 45. “Two software techniques for online error detection. Asenek et al. 1453-1460. June 1996. Asenek. Yu. M. D. pt. Broida. Turmon. M. “Fault-tolerant FFT networks. “Detailed radiation fault modeling of the Remote Exploration and Experimentation (REE) first generation testbed architecture. S. pp. “Software-implemented fault detection for highperformance space applications.-Y. R. V.” IEEE Transactions on Nuclear Science.” in Proceedings of the IEEE Radiation Effects Data Workshop.” IEEE Transactions on Nuclear Science. M. 3. J. C. 6. 41. July 1992. Johnston. December 1999. pp. and S. 1998. LaBel. U. Some. R. Underwood. 37.

W. “A portable software tool for measurement of transient errors in commercial microprocessors. “Design and evaluation of preemptive control signature (PECOS) checking for distributed applications. J. July 1987. Bagchi. Gaisler. “Honeywell radiation hardened 32-bit processor central processing unit. M.A. Z. Levendel. K.” in digest of papers. “A framework for database audit and control flow checking for a wireless telephone network controller. S. Wells. [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] 48 .” to appear in Proc. PA: Addison-Wesley. B.[12] P. Liu. B.H. pp. Iyer. vol. July 1997. J. pp.” master’s thesis. 1990. “A minimum test set for multiple-fault detection in ripplecarry adders. K. Johnson.” submitted to IEEE Transactions on Computers. 12. Abromovici. Bagchi. Department of Electrical and Computer Engineering. D. PA: IEEE Press. pp. June 1997. 42-46. 272-276.C. K. Kalbarczyk. Iyer. M. 2. R. floating point processor. and J. Fault-Tolerant Computer System Design. of Conference on Dependable Systems and Networks. 203-224. ” in Proc. 2001. S. “Extensive testing of floating point unit. Levendel. 1989. J. Kalbarczyk.F. G. 1998. University of Illinois at Urbana-Champaign. 891-895. Special Issues on Embedded Fault-Tolerant Computer Systems. Bagchi. R. Votta. and Y. Reading.R. pp. Z. March 2000. K. pp.D. K. Reading. Brown. July 2001. Duba and R. of 26th Euromicro Conference. vol. Iyer. “Evaluation of a 32-bit microprocessor with built-in concurrent errordetection. DSN’01. Friedman. no. and cache memory dose rate and single event effects test results. Leavy. PA: Prentice Hall. Srinivasan. S. Mogensen. Bech. K. 180-187. Design and Analysis of Fault-Tolerant Digital Systems. Pradhan. Brichacek. 1996. vol. and L. “Hierarchical error detection in a software implemented fault tolerance (SIFT) environment. 110-115. Sosnowski and T. Kalbarczyk. Iyer.” IEEE Transactions on Computers. Twenty-Seventh Annual International Symposium on Fault-Tolerant Computing. Digital System Testing and Testable Design. Breuer and A. and R.1. Y. C-36. Y. Hoffmann.A case study. Reading.” Radiation Effects Data Workshop.” IEEE Transactions on Knowledge and Data Engineering. Patel. Cheng and J. “Transient fault behavior in a microprocessor . W. S. K.A. 2001. pp. L. September 2000.-T.” in Proceedings of the IEEE International Conference on Computer Design: VLSI in Computers & Processors. Z. Whisnant.

K. 1046-1049. pp. G. of International Symposium on Circuits and Systems. vol. “Properties of transient errors due to power supply disturbances. Miremadi and J. May 1986. L. D. “Evaluating processor-behavior and three error-detection mechanisms using physical fault-injection. 44. J. September 1995.” in Proc. Torin. J. 441-454. 3. [25] 49 . Cortes. Lu. McCluskey.” IEEE Transactions on Reliability. Wagner. pp.[24] M. no. and D. IEEE. E.

50 .