You are on page 1of 15

www.ietdl.

org

Published in IET Computers & Digital Techniques


Received on 1st February 2009
Revised on 21st July 2009
doi: 10.1049/iet-cdt.2009.0011

In Special Issue on selected papers from the 18th


International Conference on Field Programmable
Logic and Applications (FPL 2008)
ISSN 1751-8601

Fault tolerance and reliability in field-


programmable gate arrays
E. Stott P. Sedcole P. Cheung
Department of Electrical and Electronic Engineering, Imperial College, Exhibition Road, London, UK
E-mail: edward.stott07@imperial.ac.uk

Abstract: Reduced device-level reliability and increased within-die process variability will become serious issues for
future field-programmable gate arrays (FPGAs), and will result in faults developing dynamically during the lifetime
of the integrated circuit. Fortunately, FPGAs have the ability to reconfigure in the field and at runtime, thus
providing opportunities to overcome such degradation-induced faults. This study provides a comprehensive
survey of fault detection methods and fault-tolerance schemes specifically for FPGAs and in the context of
device degradation, with the goal of laying a strong foundation for future research in this field. All methods and
schemes are quantitatively compared and some particularly promising approaches are highlighted.

1 Introduction faults that are caused by device degradation. This is a less


established field of research, although techniques developed
As process technology scaling continues, integrated circuits in defect and SEU tolerance schemes are highly relevant to
face greater challenges from defects, process variability and degradation fault tolerance. This aspect of fault tolerance is
reliability. Field-programmable gate arrays (FPGAs) are no set to become increasingly important with the continuing
exception to this; one recent study suggested defect development of silicon technology. This paper is based on
tolerance will be necessary in future large FPGAs at and material previously published by the authors in [4]. It is
beyond the 45 nm technology node [1]. FPGAs have some extended with new sections on fault modelling and future
key advantages over application specific integrated circuits work in the field. Existing sections are discussed at greater
(ASICs) for achieving fault tolerance. Firstly, they are depth, with 17 additional papers surveyed and eight new
(mostly) composed of regular arrays of generic resources, figures.
giving them inherent redundancy. Secondly, they can be
reconfigured in the field. These have been exploited in a A fault tolerant system consists of two main components;
wealth of research and some promising fault tolerant these are fault detection and fault repair. Section 3 surveys
systems have been developed. fault detection methods and Section 4 considers fault
repair. Causes of faults, modelling and application issues
There have been different motivations for designs of fault are discussed in Section 2 and the possibilities for future
tolerant FPGA systems. Early work was concentrated on development of the field are explored in Section 5.
increasing manufacturing yield through defect tolerance and
some of this has found its way into commercial use [2]. The
advent of SRAM FPGAs presented the problem of single- 2 Background
event upsets (SEUs), which are sporadic flips of configuration
bits causing connectivity, logic and state errors. This has also 2.1 Causes of degradation
lead to a great deal of research, the benefits of which can be Degradation is the permanent deterioration of a circuit over
widely found in space and nuclear applications [3]. time, resulting in a negative impact on performance. The
effects can be progressive, a gradual change of a circuit
The focus of this study, however, is on work relating to the parameter or catastrophic, a sudden onset of a failed state in a
reliability of FPGAs and in-field tolerance of permanent circuit component. Degradation in VLSI circuits can be

196 IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 3, pp. 196– 210
& The Institution of Engineering and Technology 2010 doi: 10.1049/iet-cdt.2009.0011

Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on June 26,2010 at 07:00:23 UTC from IEEE Xplore. Restrictions apply.
www.ietdl.org

attributed to a number of mechanisms [5]. The hot-carrier 2.3 Modelling of faults


effect leads to a build up of trapped charges in the gate-
channel interface region [6]. This causes a gradual reduction In order to effectively detect, locate and repair faults a model
in channel mobility and increase in threshold voltage in is needed of how they affect the circuit. Fault modelling has
CMOS transistors. The effect on the circuit is that switching several aspects including (a) determining which fault
speeds become slower, leading to delay faults. Negative-bias mechanisms may occur; (b) simulating the effect that
temperature instability (NBTI) has similar consequences for possible faults will have on the system; (c) predicting the
circuits and is also caused by a build up of trapped charges [7]. rate and distribution of failures; and (d) establishing fault
scenarios for evaluating potential repair strategies.
Electromigration is a mechanism by which metal ions
migrate over time leading to voids and deposits in Faults can be modelled at different layers of the FPGA, as
interconnects. Eventually, these can cause faults because of shown in Fig. 1. Although faults occur in the silicon
the creation of open and short circuits [8]. structures which make up transistors and interconnect, fault
tolerant systems deal with them at various levels of
Time-dependent dielectric breakdown (TDDB) affects abstraction. A repair at each level of abstraction aims to be
the gates of transistors, causing an increase in leakage transparent to the level above it.
current and eventually a short circuit. The mechanism here
is the creation of charge traps within the gate dielectrics,
Logic: A low-level approach considers the underlying logic of
diminishing the potential barrier it forms [9, 10].
the FPGA and models faults on particular circuit nets.
All of these degradation mechanisms have the potential to
Fabric: Some fault tolerant systems consider faults in the
become more severe with the shrinking of process geometry.
FPGA fabric, that is the set of LUTs, registers,
This is due to increasing gate field strength, higher current
interconnect and so on that is available to the designer
density, smaller feature size, thinner gate dielectrics and
[15]. This has the advantage that these elements are easy to
increasing variability [11]. In the case of TDDB, the
test with reconfiguration and BIST, though the behaviour
situation is made complicated by the introduction of new
of the configuration logic is obscured.
processes such as high-K dielectrics and metal gates [12].
Array: A popular option is to consider the FPGA at an array
level, that is to mark off entire clusters or interconnect lines as
2.2 Other types of fault faulty. This best exploits the regular structure of FPGAs.
In addition to degradation, there are two other types of faults Application: A higher level of abstraction is possible when the
that can affect FPGAs. These are relevant to this study as application is modular and adaptable. This allows the fault
some of the techniques that have been developed in model to extend to other parts of the circuit outside the
response to them can also be applied to faults caused by FPGA for a very robust system.
degradation.

The first of these is manufacturing defects. Manufacturing


defects can be exhibited as circuit nodes which are stuck-at 0 or
1 or switch too slowly to meet the timing specification. Defects
also affect the interconnect network and can cause short or
open circuits and stuck open or closed pass transistors [13].
Test of manufacturing defects is well established in VLSI
and defect tolerance techniques are currently used in some
types of device, including FPGAs [2], to increase yield.

The second class of fault which is widely discussed in relation


to FPGAs comprises of SEUs and single event transients
SETs, caused by certain types of radiation [14]. This is of
particular concern to aviation, nuclear research and space
applications where devices are exposed to higher levels of
radiation and high levels of reliability are required. The most
commonly considered failure mode is the flipping of an
SRAM cell in the configuration memory, leading to an error
in the logic function that persists until the configuration is Figure 1 Design of an FPGA and its application can be
refreshed in a process known as scrubbing. Although this abstracted to several levels
recovery method is not applicable to permanent faults caused Fault modelling and tolerance can be approached at numerous
by degradation, ways of detecting SEU faults are relevant. points in the hierarchy

IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 3, pp. 196– 210 197
doi: 10.1049/iet-cdt.2009.0011 & The Institution of Engineering and Technology 2010
Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on June 26,2010 at 07:00:23 UTC from IEEE Xplore. Restrictions apply.
www.ietdl.org

Within this study, fault repair is defined to be the repair of a


faulty system so that it returns to being fully operational.
Invariably, at some level of the FPGA this repair is achieved
by the replacement of a failed component with a functional
one. The size and nature of the replaced component varies
from scheme to scheme and this represents the granularity
of the approach. All of the studies surveyed here fall into the
fabric-level, array-level or application-level categories.

Approaching fault tolerance at different levels of


abstraction places the burden of dealing with them on
Figure 2 The failure rate of electronic devices varies
different parties over the design, manufacturing and service
over time
phases of the product lifetime. A fabric-level repair, for
Trace a depicts the traditional ‘bathtub’ curve, where most failures
example, may be completely transparent to the engineer occur early or late in the device lifetime. Trace b shows the higher
who designs the application circuit and requires no prevalence of mid-life failures that is caused by increased process
alteration of the configuration data. On the other hand, an variation and degredation
application-level strategy is likely to be embedded into the
system design and be tailored to the application. In some cases, reliability is critical for safety or mission
success. For example, an automotive application was discussed
An important part of fault modelling is to determine at a system level by Steininger et al. [17]. Fast detection and/
the possible failure modes at the design level under or error correction is crucial here so that erroneous data or
consideration. At the circuit level, the simplest of fault state is not acted upon, which could be hazardous.
models assumes that faulty circuit nodes can be either stuck
at 0, stuck at 1, shorted to another node or an open circuit A widely implemented application of fault tolerance in
[13]. Although these hard-failure modes have been an FPGAs is in space missions. Traditionally, this is because
effective approach to defect testing for a long period, (a) they experience significant numbers of SEUs caused by
worsening process variation and degradation require increased radiation; (b) the breakdown of an electronic
marginal and timing faults to be considered [16]. Since system could cause the mission to be lost; and (c) manual
some of the VLSI wear-out mechanisms are progressive in repair is impractical.
nature, marginal faults are likely to be more prevalent in
field failures than in failures because of manufacturing In the light of variability and reliability concerns associated
defects. Examples of marginal faults include slow switching, with future VLSI process nodes, it may become economical
intermittent switching, weak drivers and unstable registers. to use fault tolerance in general purpose, high-volume
applications. In this case, it will be important that the
Another aspect to a degradation fault model is the rate at detection and repair method has the lowest possible
which faults occur and how this varies over time [5]. overhead on timing performance and area. Such
Traditionally, a bathtub curve of failures is described for applications may be able to compromise data integrity and
VLSI circuits, in common with many other manufacturing fault coverage to achieve this, for example infrequent visible
processes. High numbers of ‘infant mortality’ failures occur errors and a small proportion of returns would be tolerable
shortly after manufacture, then the failure rate remains low in a consumer video decoder.
until the end of the design life. Greater process variation and
degradation will make these phases less distinct, for example
a significant background rate of failure may be observed over 3 Fault detection
the entire life of the product [11]. This is illustrated in Fig. 2. The first function to take place in a fault-tolerant scheme
is fault detection. Fault detection has two purposes; firstly,
FPGA systems can be reconfigured in the field, either to it alerts the supervising process that action needs to be
change their functionality or as required by a fault-tolerance taken for the system to remain operational and secondly, it
system. This raises the possibility of dormant faults, faults identifies which components of the device are defective so
that occur on resources which are unused when the fault that a solution can be determined. These two functions
occurs but which may be used in the future. The implication may be covered simultaneously, or it may be a multi-stage
of this is that multiple faults may become apparent on process comprised of different strategies.
reconfiguration and the system must be prepared for this.
Fault detection methods can be categorised into three
broad types:
2.4 Applications of fault tolerance
All of the fault detection and repair methods surveyed have † Redundant/concurrent error detection (CED) uses
individual strengths and weaknesses and which method is additional logic as a means of detecting when a functional
most appropriate depends on the application. block is not generating the correct output.

198 IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 3, pp. 196– 210
& The Institution of Engineering and Technology 2010 doi: 10.1049/iet-cdt.2009.0011

Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on June 26,2010 at 07:00:23 UTC from IEEE Xplore. Restrictions apply.
www.ietdl.org

† Off-line test methods test the FPGA when it is not Practical implementations of TMR partition the design
operating. into multiple sets of modules so that transient errors do not
stick in state machines or registers [20]. The system
† Roving test methods perform a progressive scan of the proposed by D’Angelo et al. [21, 22] is capable of
FPGA structure whilst it is on-line. distinguishing between permanent and transient errors, and
also if the fault has occurred in a module or the voting
The different approaches to fault detection are evaluated logic. In [15], a modular redundancy system is given which
against a set of metrics in Table 1. can detect multiple faults on start-up.

CED allows a more space efficient design than modular


3.1 Functional redundancy and CED redundancy. Extra bits are added to data flows and stores
The principle of using redundancy to guarantee a reliable that are encoded with redundant logic, for example parity
logical system was developed many years before the advent information. Data validation circuitry at the output to
of VLSI [18, 19], and today it is widely used as a method functional blocks can then detect faults that arise. The
of fault detection in FPGAs, particularly in the form of application of CED is very much dependent on the data
triple modular redundancy (TMR). The main driver for flows and algorithms of the design and the logic overhead
error detection of this kind is the need to detect and that is required varies. It is least efficient for small width
correct errors because of SEUs and SETs. However, these signals such as those found in control logic.
methods are also suitable for detecting permanent faults
that occur while the system is operating. CED techniques have been widely researched both
in theory and in application to ASICs [23], but
These detection methods work by invoking more than the implementation on FPGAs presents additional problems.
minimum amount of logic than is necessary to implement the Using traditional CAD tools for FPGAs, there is the
logic function. When an error occurs, there is a disagreement problem that error detection logic can be removed or made
between the multiple parts of the circuit over which a ineffective once it is minimised and implemented on LUTs.
particular calculation is processed and this is flagged by Bolchini et al. [24] sought to address this by mapping a
some form of error detection function. self-testing circuit using standard tools then testing and
redesigning any part where fault coverage had been lost. In
The simplest form of this kind of error detection is [25], a CED method is developed for state machines
modular redundancy. A functional block is replicated, in FPGAs using embedded memory blocks. Here, the
usually two or three times and the outputs compared. If memory blocks are used as ROM look-up tables for next
there are two modules then a difference between the state and output logic with embedded parity data. Any error
outputs indicates that one of the modules is faulty. If there in either the memory or the controlling logic causes
are three modules then, assuming a single fault, one group a parity mismatch and is detected by a parity checker. A
of outputs will differ from the other two. The use of three similar function can be carried out when a state machine is
modules identifies which of them is faulty and allows the encoded using ‘one-hot’ logic; the activation of more than
correct output to be maintained whilst a repair is underway. one state at any time indicates an error. A full fault tolerant

Table 1 Comparison of fault detection methods

Method Speed of Resource Performance Granularity Coverage


detection overhead overhead
modular fast – as soon very large – very small – latency of coarse – limited good – all manifest
redundancy as fault is triplicate plus voting logic to size of errors are detected
manifest voting logic module
concurrent error fast – as soon medium – small – additional medium – medium – not
detection as fault is trade-off with latency of checking logic trade-off with practical for all types
manifest coverage resource of functionality
off-line slow – only very small small – start-up delay fine – possible very good – all faults
when off-line to detect the including dormant
exact error
roving medium – medium – large – clock must be fine – possible very good – multiple
(segmented order of empty test block stopped to swap blocks. to detect the manifest and latent
interconnect) seconds plus test Critical paths may exact error faults are detected
controller lengthen

IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 3, pp. 196– 210 199
doi: 10.1049/iet-cdt.2009.0011 & The Institution of Engineering and Technology 2010
Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on June 26,2010 at 07:00:23 UTC from IEEE Xplore. Restrictions apply.
www.ietdl.org

system using CED was proposed in [26], where the error designed manually by the configuration engineer to fit the
detection is broken down into regions to provide a level of fault-detection requirements of the application, or integrated
fault location. automatically using tools, such as Xilinx’s TMRTool [31].

Redundancy provides a very fast means of error detection In [32], an FPGA architecture was proposed that has
as a fault is uncovered as soon as an error propagates to the error detection built in, so that it is transparent to the
voting or validation logic. In addition, this form of error configuration. The system uses a combination of area and
detection has a small impact on timing performance – only time redundancy to identify errors at the cell level and
the latency of voting or parity logic, or similar. Modular subsequently trigger an automatic repair. The practicality of
redundancy detects all faults that become manifest at the this architecture is limited by the severe timing and area
output of a functional block, including transient errors, overheads of incorporating the error detection logic in each
providing the other block(s) are fully functional. In CED, cell. Hardware error checking is used in Altera’s Stratix
coverage comes at a trade-off with area overhead [27]. FPGAs to detect SEUs in configuration memory. This
These methods provide no means of detecting dormant could be used to detect faults in configuration logic that
faults, which may be relevant if an FPGA is going to be might otherwise be difficult to detect, but would need
reconfigured in the field, either for fault repair or to alter supplementing with soft-logic error detection to provide
the functionality. adequate coverage of the whole device.

The chief drawback of redundancy as a method of error 3.2 Off-line fault detection/built-in
detection is the area overhead needed to replicate
functionality, which can be over three times in the case of
self-test (BIST)
TMR [3]. Furthermore, it provides a limited resolution for Off-line fault detection is another widely used technique,
identification of the faulty component. The fault can only usually as a means of quickly identifying manufacturing
be pinned down to a particular functional block or, in the defects in FPGAs. Any scheme that does this without the
case of TMR, an instance of a functional block. Fault need for any external equipment is known as BIST, and is
resolution can be increased to a certain extent by breaking a suitable candidate for fault detection in the field.
functional areas down and adding additional error detection
logic [26]. BIST schemes for FPGAs work by having one or more
test configurations which are loaded separately to the
Redundancy does not have to be restricted to the circuit- operating configuration. Within the test configuration is a
area dimension. It is also possible to detect errors in a test pattern generator, an output response analyser and,
trade-off with latency/data throughput. Lima et al. [28] between them, the logic and interconnect to be tested
proposed a scheme where operations are carried out twice. arranged in paths-under-test (PUTs). To be fully
In the second operation, operands are encoded in such a comprehensive, a BIST system will have to test not only
way that they exercise the logic in a different way. The the logic and interconnect, but also the configuration
output is then passed through a suitable decoder and network. Specialised features such as carry chains,
compared to the original. multipliers and PLLs also need to be considered. The
Xilinx Virtex series of FPGAs feature a self-configuration
Although most of the work on redundancy has been aimed port which can speed up this process and reduce the need
at detecting and correcting SEUs, there have been some for external logic [33].
notable publications which apply the techniques to fault
detection. Parity checking is used in [29] as part of a fault Compared to traditional built-in and external test methods
tolerant scheme that is structured so that detection is for ASICs, FPGAs have the advantage of a regular structure
applied to small regular networks, rather than being that does not need a new test programme to be developed for
bespoke to the function that is implemented. In the each design. Also, the ability to reconfigure an FPGA
evolutionary system of [30], dual modular redundancy reduces or removes the need for dedicated test structures to
(DMR) is used to grade the ‘fitness’ of competing be built into the silicon [34]. However, with the ability to
configurations. The configurations are chosen in pairs from reconfigure comes a vast number of permutations in the
a pool and the outputs are compared on each clock cycle. way the logic can be expressed, making optimisation of test
The configurations that contain errors or faults cause more patterns important.
frequent output mismatches and accumulate a poor fitness
score. More information on evolutionary fault tolerance is Published BIST methods have competed for coverage, test
give in Section 4.2.4. duration and memory overhead. Many focus on testing
just one subset of FPGA structures, e.g. interconnect,
Redundant and data-checking detection systems are suggesting a multi-phased approach may be appropriate for
generally implemented as part of the application testing the whole chip. Testing of LUTs is a mature field;
configuration, as they fit around the specific data and the BIST scheme for LUTs in [35] is designed to
control functions that make up the circuit. They can be minimise test time, whereas a means of detecting multiple

200 IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 3, pp. 196– 210
& The Institution of Engineering and Technology 2010 doi: 10.1049/iet-cdt.2009.0011

Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on June 26,2010 at 07:00:23 UTC from IEEE Xplore. Restrictions apply.
www.ietdl.org

faults in a single test block or PUT is given in [36]. Alaghi 3.3 Roving fault detection
et al. [37] also addressed multiple faults and concentrates
on achieving the optimum balance between granularity and Roving detection exploits run-time reconfiguration to carry
test logic overhead. out BIST techniques on-line, in the field, with a minimum
of area overhead. In roving detection, the FPGA is split
into equal-sized regions. One of these is configured to
Many publications have focussed on testing interconnect perform self-test, whereas the remaining areas carry out the
in response to the large amount of configuration logic and design function of the FPGA. Over time, the test region is
silicon area it consumes [38, 39]. A method of exhaustive swapped with functional regions one at a time so that the
testing for multiple faults global interconnect using six entire array can be tested while the FPGA remains
configurations is given in [40]. In [41], a BIST system for functional. The process is illustrated in Fig. 3.
interconnect is given that reduces test time through a large
degree of self-configuration. Harris et al. used a hierarchical
Roving test has a lower area overhead than redundancy
approach which locates stuck-at faults, short circuits and
methods; the overhead comprising of one self-test region
open circuits with the highest accuracy [13, 42].
and a controller to manage the reconfiguration process. The
method also gives excellent fault coverage and granularity,
Recent developments have considered timing performance comparable to BIST methods.
as well as stuck-at faults. Some authors have targeted
resistive-open defects in interconnect by testing for delay Roving test is less intrusive than a full system halt to carry
faults in interconnect paths [43, 44]. Girard et al. out off-line BIST and it is usually possible to detect faults
considered the optimum test patterns for exercising delay earlier. However, the speed of detection is not as good as
faults [16, 45]. Methods for analysing the propagation redundancy techniques. The detection latency depends on
delays of logic chains have been proposed in [46], by the period of a complete roving cycle; the best reported
timing the difference between two paths using a ring implementations of roving test have maximum detection
oscillator, and more accurately in [47], by using a built-in latency in the order of a second [49].
PLL unit to match the clock speed to the propagation delay.
Roving test impacts performance in two ways. Firstly, as the
Elements of BIST can be found in roving test systems, test region is moved through the FPGA, connections between
where only a small part of the FPGA is taken off-line for adjacent functional areas are stretched. This results in longer
testing at any point in time. Roving and off-line testing are signal delays and may force a reduction in the system clock
both cited as suitable applications for the delay-test method speed, reported to be in the range of 2.5 – 15%. Secondly,
in [46]. Doumar et al. proposed an off-line test that uses a implementations in current FPGAs require the functional
roving sequence to remove the need for reconfiguration; blocks to be halted as they are switched. A 250 ms window
instead a small self-test area is always present and is shifted for each swapping move has been reported [49].
around the array to gain full coverage [48].
The dominant work in the field of roving test and repair
The advantage of BIST as a fault detection method is that has been carried out by Abramovici, Stroud et al. [49, 50].
it has no impact on the FPGA during normal operation. The Called roving STARs, this system uses two test areas, one
only overhead is the requirement to store test configurations, covering entire rows and one covering entire columns. A
which may be compressed because of their repetitive nature. roving test method was also proposed in [51] – by using
BIST also allows complete coverage of the FPGA fabric, FPGAs with bussed, rather than segmented, interconnects,
including features that may be hard to test with an on-line a system was devised which had no impact on system clock
test system, such as PLLs and the clock network. As the
entire resource set is tested, the BIST process is common
to all systems using a particular FPGA model and can be
extended to cover a whole family of devices with little
modification. The only additional work required to
integrate BIST into a new FPGA design is to provide
storage for the BIST configuration and set up a trigger to
load it at the desired time.

The major drawback of BIST for fault tolerant systems is


that it can only detect faults during a dedicated test mode
when the FPGA is not otherwise operational. Typically, Figure 3 In roving test, blocks of the FPGA are taken off-
this would occur during system start-up, as part of a line one at a time for testing
maintenance schedule, or in response to an error detected By shifting functionality between blocks, the device can remain
by some other means. operational

IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 3, pp. 196– 210 201
doi: 10.1049/iet-cdt.2009.0011 & The Institution of Engineering and Technology 2010
Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on June 26,2010 at 07:00:23 UTC from IEEE Xplore. Restrictions apply.
www.ietdl.org

performance. However, the resulting constraints on the System: A higher level of repair can be carried out at the
system topology and connectivity limit its applicability to system level. When a design is highly modular a fault can
the majority of systems. be tolerated by the use of a spare functional block, or by
providing degraded performance [52]. Such methods are
Like off-line BIST, roving detection does not need not considered in more detail here, as they are not limited
adapting to application circuit, simplifying its deployment in application to FPGAs.
and allowing a market to develop in 3rd party fault-
detection IPs. However, as the test process does have an It should be noted that some fault detection methods also
impact on FPGA resource use and timing performance, provide a level of fault tolerance. The voting system in TMR
careful verification of the application design would be allows the erroneous output of one module to be ignored.
needed once it was integrated. Also, roving test provides fault tolerance by stopping the
roving process if a fault is detected. If the fault stays within
the test area it will not be used by the operational part of
4 Fault recovery the FPGA. In both these situations, the system operates in
Once a fault is detected and located, it must be repaired. a reliability degraded state where another fault would not
Depending on the fault modelling approach (see Fig. 1), be tolerated and may not even be detected. But they do
repair can be approached at a number of different levels: allow the system to carry on functioning whilst a
permanent repair is carried out. Table 2 shows the classes
Hardware: A hardware level repair performs a correction such of fault repair techniques that have been reported and
that the FPGA remains unchanged for the purposes of the evaluates them against a range of metrics.
configuration. The device retains its original number and
arrangement of useable logic clusters and interconnects. To date, manufacturing defects are a far more common type
of fault than failures in the field and have been the focus of
Configuration: A configuration level repair is achieved using practical fault-tolerance efforts in industry. The challenge of
spare resources that the design does not use. The spare attaining an economic yield in FPGA manufacture will
resources can replace faulty ones in the event of a fault. intensify along with the challenge of ensuring reliability, and

Table 2 Comparison of fault repair methods


Method Fault pattern tolerance Resource overhead Performance Complexity of Repair
overhead/ repair level
degradation
hardware poor: limited number and medium: spare low: transparent to low: effected with logic-
distribution tolerated resources required configuration hardware switches array
multiple poor: limited number and low: uses naturally low: each medium: selection array
configurations distribution tolerated. spare resources, but configuration can be and loading of
interconnect tolerance requires ROM for fully optimised configurations
causes complexity configurations
pebble shifting medium: relies on nearby low: uses naturally medium, rerouting high: re-routing array
spare PLBs spare resources causes uncertainty necessary
cluster reconfig. poor: reliant on spare low: uses naturally low: changes only medium: analysis fabric
resource in cluster. Poor spare resources local interconnect, of logic, no
tolerance in interconnect slight uncertainty re-routing
cluster good: flexible solutions low: uses naturally low: usually a fast high: analysis of fabric-
reconfig. þ possible spare resources alternative will be logic and rerouting array
pebble shifting found, medium
uncertainty
constrained poor: limited number and medium: a set of low low: alternative array
(chain shifting) distribution tolerated. interconnect must be routing already
Poor for interconnect reserved reserved
evolutionary good: implementation is large: configuration variable: solution is massive: may take app.
completely flexible grading and storage arrived through a long time to
random mutations repair

202 IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 3, pp. 196– 210
& The Institution of Engineering and Technology 2010 doi: 10.1049/iet-cdt.2009.0011

Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on June 26,2010 at 07:00:23 UTC from IEEE Xplore. Restrictions apply.
www.ietdl.org

it is likely that any scheme for fault tolerance will be used to The first methods of this kind were based on column/row
tackle both. The key difference is that overcoming shifting [53, 54]. Multiplexers are introduced at the ends of
manufacturing defects only needs to be done once and does lines of cells that allow a whole row or column to be
not need to involve the end user of an FPGA or even, for bypassed, by shifting across to a spare line of cells at the
hardware-level repair, the OEM who installs and configures end of the array. A column/row shifting architecture was
it. Therefore, defect tolerance can be viewed as a subset of proposed in [32], which could repair faults in the field by
multi-purpose fault tolerance, where the process is applied shifting the configuration memory. If the FPGA is bus-
during manufacture. The techniques discussed in this section based, the shifted cells can connect to the same lines of
are targetted at in-field faults, but could also be applied to interconnect. For segmented interconnect, bypass sections
manufacturing defects in this way. need to be added to skip the faulty row/column. Today,
column/row shifting has found its way into commercial
FPGA designs for defect tolerance [2].
4.1 Hardware-level repair
The regular structure of FPGAs makes them suitable As illustrated in Fig. 4, adding more bypass connections and
candidates for hardware-level repair, using methods similar multiplexers allows greater flexibility for tolerating multiple
to those used for defect tolerance in memory devices. In faults and makes more efficient use of spare resources [55]. In
the event of a fault, a part of the circuit can be mapped to [56], faults in the configuration logic where considered and
another part with no change in function. the proposed solution was to split the FPGA up into sub-
arrays which can be configured independently.
Hardware-level repair has the advantage of being transparent
to the configuration. This makes repair a simple process, as the 4.2 Configuration level repair
repair controller does not need any knowledge of the placement
and routing of the design. Another benefit is that the timing Although hardware-level repair is attractive for defect tolerance
performance of the repaired FPGA can be guaranteed, as any because it is mostly transparent at the configuration level, the
faulty element will be replaced by a pre-determined proposed schemes have not proved flexible or efficient enough
alternative. Switching in spare resources will change timing for use in reliability enhancement. As the computational
slightly, for example net lengths may change, but the power available to FPGAs increases, the complexity of self-
configuration can be designed to function with the worst-case repair is becoming less of a constraint. For these reasons, the
selection. A fault-tolerant system using hardware-level repair most promising fault tolerant systems have used configuration
with off-line BIST could be packaged with a FPGA level repair. Configuration level repair exploits two key
architecture and require little work to integrate it into an features in FPGAs; reconfiguration and the availability of
application nor maintenance intervention in the field. unused resources. Configuration level repair strategies can be
Hardware-level fault tolerance has a drawback in that it can divided into several subclasses:
tolerate just a low number of faults for a given area overhead
and there are likely to be certain patterns of faults which 4.2.1 Alternative configurations: A straightforward
cannot be tolerated. way of achieving fault tolerance is to pre-compile

Figure 4 Trade-offs in row/column shifting methods

IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 3, pp. 196– 210 203
doi: 10.1049/iet-cdt.2009.0011 & The Institution of Engineering and Technology 2010
Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on June 26,2010 at 07:00:23 UTC from IEEE Xplore. Restrictions apply.
www.ietdl.org

alternative configurations. As long as a configuration exists 4.2.2 Incremental mapping, placement and
in which a given resource is unused, the FPGA can be routing: A popular approach to fault tolerance is to
repaired should that resource become faulty. In [57 – 59] recompute the mapping, placement and routing of the
the FPGA is split into tiles, each with its own set of FPGA in the field. This has the potential to be a very
configurations which have a common functionality and efficient method as it can exploit the residual amount of
interface to the adjacent tiles. The replacement of a faulty spare resource which is found in virtually all practical FPGA
tile is illustrated in Fig. 5. designs. It has also proven to be the only method capable of
handling large numbers of faults in an arbitrary pattern. The
Pre-compilation of the alternative configurations makes challenge to overcome is that the mapping, placement and
a repair simple to compute in the field; the replacement routing tools of an FPGA CAD system must be adapted to
configurations can simply be indexed by the location of the operate autonomously on an embedded platform. This will
unused/faulty resource or resources. It also guarantees the impose constraints on processing power, memory and the
timing performance of the repaired design. Software tools time available to compute a new configuration.
could generate the configuration set for any design, as long
as sufficient spare resources are available, and a simple A simple method of tolerating faults in logic clusters exists
index could select a repair configuration based on the if the cluster can be reconfigured to work around the fault
location of the fault. [49, 60]. A typical example of this kind of repair is the
swapping of a faulty LUT input with an unused one.
This strategy performs relatively poorly in terms of area Repair within a cluster is attractive because it has only a
efficiency and multiple fault tolerance. It is dependent on small impact on the global routing of the device, which
there being a configuration available in which any given makes the repair easy to compute and guarantees only a
resource is set aside as a spare. If only a small amount of slight impact on timing performance. However, a repair of
spare resource is available then a large number of this kind is not always possible; there may be no spare
configurations are needed to cover all possible faults. resource of the type needed or there may be architectural
Allocating more spare resources allows a smaller number of constraints which prevent it being used without changing
configurations, but that reduces the amount of functionality other clusters and global interconnect. This is especially
that can fit in the FPGA. If the system is required to true where hardware optimisations are used such as carry
tolerate multiple faults, the number of configurations can chains and distributed RAM.
easily become prohibitive, especially if they must be stored
and recalled in the field. As the configurations are likely to If there are spare clusters, then these can be used to replace
have a significant amount of commonality, compression can faulty ones. To minimise the impact on timing and routing
be used to mitigate the ROM overhead, though this then in the area around the fault, pebble shifting may be used.
requires decoding logic. Splitting the array into tiles allows An illustration of this method is shown in Fig. 6, where a
multiple faults to be tolerated with superior overhead number of clusters are shifted to carry out the repair. In
efficiency, but introduces complications if a fault occurs on [61], an algorithm is given that calculates the cost of
the interface between adjacent tiles. potential shifting moves as a function of additional routing
distance and congestion in routing channels. Using this
information, the entire repair can be optimised so that it
causes the smallest reduction in timing performance and
the least perturbation to the wider routing of the device.

In order to reconnect displaced clusters or to repair faulty


interconnect an incremental router can be used. One such

Figure 5 An alternative configuration is selected to repair a Figure 6 In pebble shifting, logic functions are shifted to
tile of clusters containing a faulty resource replace a fault cluster with a spare one

204 IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 3, pp. 196– 210
& The Institution of Engineering and Technology 2010 doi: 10.1049/iet-cdt.2009.0011

Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on June 26,2010 at 07:00:23 UTC from IEEE Xplore. Restrictions apply.
www.ietdl.org

algorithm is developed in [62]. The router is optimised for field, the chain can be shifted along to use a spare at
deployment in the field and hence has a small hardware the end. The pre-allocated interconnect is used to restore
and memory requirement. Invalid nets are ripped-up and the original connectivity (see Fig. 7). This method has the
rerouted one at a time within a restricted window of the advantage that it does not require placement and routing in
array. The router determines the placement and routing of the field; the allocation of repair resources can be computed
the existing design by reading back the configuration by software tools within the configuration CAD flow.
memory so it does not need to read or modify a separate Also, the worst-case performance of the repaired design is
netlist database. A similar strategy is adopted by Lakamraju known. However, each chain requires a certain amount of
and Tessier [60] and a performance evaluation is given. spare resource and can only tolerate a given number of
failures. Hence, this method does not use its area overhead
Cluster reconfiguration, pebble shifting and incremental as efficiently as more intelligent approaches.
routing can be combined to implement a flexible and
efficient fault tolerant system. Lakamraj et al. and In [29], a multi-level approach to fault tolerance is
Abramovici et al. both recognised that cluster presented which consists of an array of small, reconfigurable
reconfiguration is a good first line of defence against faults multistage interconnection networks (MINs). Each MIN
because it is simple to evaluate and makes use of contains an amount of redundant logic so that some faults
‘naturally’ spare resources [49, 60]. However, as it cannot can be masked entirely [64]. After this, the system aims
guarantee a repair, pebble shifting and incremental primarily to correct faults by reconfiguring the network. If a
routing are also needed to form a robust system. In [50], repair in this way is not possible the system can resort to
the approaches are merged so that a faulty cluster can still more extensive slice or device-wide methods such as
be used as a spare for logic expressions which do not swapping a whole network with a spare and re-routing the
require the faulty component. This further enhances the design as necessary. The results show that, on average,
efficiency of the system as fewer dedicated spare clusters several faults can be tolerated on each network using
are needed. network-level repair. However, repair at this level cannot be
guaranteed beyond the first fault. The area overhead of this
Although self-redesign in the field is a complex task for system is large, given the redundancy that is built in and
an FPGA-based system, it is becoming increasingly feasible the use of a parity checker to validate every network.
as FPGAs become larger and more powerful. Also, an
increasing number of applications use microprocessor cores
4.2.4 Evolutionary algorithms: Reconfiguration makes
which are either implemented in soft-logic or are
FPGAs well suited to evolutionary algorithms, where
embedded into the silicon as dedicated modules. A
random changes are made to a design to overcome faults.
general-purpose microprocessor platform could be turned
The outcome of each change is monitored so that, over
over to the task of self-repair when a fault arises, provided
time, beneficial changes are retained and the design
it remains operational. An alternative would be to compute
produces fewer errors. Unlike other fault recovery methods,
a repair configuration remotely and download it to the
an evolutionary approach does not need to know the
device; this would require some form of communication
location and nature of all the faults affecting a device.
resource. Providing the necessary computation or
Instead, some form of error checking is used to grade the
communication resources would create a significant
‘correctness’ of each attempted configuration.
workload for designers integrating this form of fault
recovery into a system, though the incremental CAD
Evolutionary algorithms were tested in [65, 66] as a means
algorithm itself would be general purpose.
of synthesising and repairing FPGA configurations. In [66],
stuck-at faults were simulated in a range of simple logic
4.2.3 Constrained and coarse-grained repair: circuits by changing bits in the configuration data. The
Incremental mapping, placement and routing provide a configuration then went through a iterative evolution
high degree of flexibility for dealing with random fault
patterns, especially when cluster reconfiguration and pebble
shifting are used together. However, this comes at the cost
of increased computational effort for the repair which must
be carried out in the field. Some publications have
proposed less complex solutions by structuring the design
such that the repair mechanism can operate over a limited
set of parameters.

In [63], a repair method known as chain shifting is given.


Here, clusters are arranged into chains, each with one or more
spare clusters at the end. The spare clusters are allocated
when the device configuration is compiled along with a set Figure 7 Clusters are shifted along a predefined chain using
of spare interconnects. If a cluster becomes faulty in the spare interconnect that has already been allocated

IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 3, pp. 196– 210 205
doi: 10.1049/iet-cdt.2009.0011 & The Institution of Engineering and Technology 2010
Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on June 26,2010 at 07:00:23 UTC from IEEE Xplore. Restrictions apply.
www.ietdl.org

process. The results showed that in all cases there was an complex modelling. It is best suited to numerical and data-
improvement in the correctness of the circuit output. flow applications, where faults are less likely to be
However, a complete repair was not guaranteed even when catastrophic and error-detection circuitry can be added to
the fault density was low. The authors suggest that a wider quantify the error rate.
redundant scheme could be implemented around the
evolved modules to correct any residual errors. The main disadvantage to evolutionary fault tolerance is
that the overheads are very large. Error detection circuitry
In [65], the use of redundancy is also suggested to must be designed to check the outputs and this could
complement an evolutionary repair mechanism. The focus require a large amount of resource if an error-free output
in this work is the use of an evolutionary algorithm to is required. Also needed is a controller to manage the
generate a configuration that is tolerant to faults through configuration updates. As the process in random, there is
the use of redundant elements. The granularity of system is also no guarantee of how long a solution will take to evolve
considered and the results show a trade-off between the following a fault, or what its timing performance will be –
level of fault tolerance possible and the number of iterations although improved timing could be selected for by the
needed to reach a solution. evolutionary algorithm.

Demara et al. proposed a complete system for achieving 4.2.5 Architectural enhancements for fault
fault tolerance, based on a pool of competitive tolerance: The majority of fault repair methods that
configurations [30]. To detect errors and assign a ‘fitness’ are based on reconfiguration target standard generic or
to each configuration, two functionally identical commercial FPGA architectures. There exists some
configurations are invoked from the pool and the outputs research that considers possible enhancements to FPGA
are compared. Over time, different pairs are selected and architectures to improve the effectiveness of configuration-
each configuration accumulates a score representing the based repair.
number of missmatches it has generated. Configurations
that exceed a certain threshold are put through a mutation In [68], the architecture of switch blocks and switch block
process to attempt to correct the fault. The process is arrays is considered with respect to the ease of re-routing to
illustrated in Fig. 8. avoid an interconnect fault. An algorithm is developed
which evaluates the ‘routability’ of a generic parameterised
Evolutionary hardware is a research field that encompasses interconnect model when faced with different numbers and
more than just FPGAs. A finer-grained system based on types of fault. The results of the analysis show the expected
field-programmable transistor arrays is developed and tested trade-off between better fault tolerance and lower area
in [67]. overhead, and that switch matrices of different topologies
exhibit different fault-tolerance characteristics.
An evolutionary approach allows a large degree of
flexibility with the number and distribution of faults that Switch matrix design was also explored in [69] and the
can be tolerated. It does not need to carry out tests to analysis used to develop a fault tolerant scheme. An
locate and classify faults and it can discover ways to use algorithm is given which evaluates a given routing channel
partially faulty resources that would otherwise require in terms of a connectivity matrix; this shows all the possible
point-to-point connections and the number of these which
have alternative routings. From this, it is possible to add
extra strategic switches to give the greatest increase in
routing redundancy for the smallest overhead. This scheme
is aimed primarily at yield enhancement and does not aim
to give complete fault coverage or tolerance for multiple
faults, as doing these would not be an efficient use of
additional silicon area.

5 Future development
Without methods of achieving in-field fault tolerance, the
growing challenges of process variation and reliability will
limit the progress of FPGA technology. Fortunately, there
remains plenty of scope for future work in this field, both
in developing the promising approaches seen so far and the
exploration of new ideas. This section explores some of the
Figure 8 In this example of an evolutionary repair system possibilities that exist. We start by considering some
[30], candidate configurations are taken from pool, improvements that can be made to detection and repair
graded against each other and mutated if faulty techniques using existing technology, and then consider

206 IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 3, pp. 196– 210
& The Institution of Engineering and Technology 2010 doi: 10.1049/iet-cdt.2009.0011

Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on June 26,2010 at 07:00:23 UTC from IEEE Xplore. Restrictions apply.
www.ietdl.org

how fault tolerance could be aided in the long term through but there are plenty of opportunities beyond this. The
architectural enhancements. growing importance of fault tolerance allows researchers
and, subsequently, vendors to explore FPGA architectures
5.1 Fault detection that are designed to accommodate fault tolerance as a
fundamental part of the system.
BIST schemes for FPGAs have been widely researched and
efficient methods are available for testing connectivity and An example is roving test as a means of on-line fault
logic function. The focus now is on self-testing of timing detection. Current implementations are limited by the
performance, which will become necessary as process short periods when the clock must be stopped to move the
variation and degradation become more severe [47]. For test blocks. This is imposed by the nature of the point-to-
fault tolerance, this will allow the degradation of the chip point interconnect and configuration circuitry of current
(for certain wear-out mechanisms) to be monitored FPGAs. It may be possible to use techniques similar to
throughout the lifetime of the chip. Corrective and those of proposed multi-context FPGAs to design an
adaptive action, such as reducing clock speed or rerouting FPGA architecture that is more sympathetic to on-line test
of critical paths, will then be possible. and repair. Multi-context FPGAs make more efficient use
of hardware resource by allowing instantaneous swapping of
For some applications, off-line BIST is not possible because blocks [70].
fast fault detection is required or maintenance down-time is
not available. Modular redundancy provides a reliable Another limitation which applies to virtually all proposed
solution but it consumes a large amount of FPGA resource. fault detection and repair methods is that they consider
Future development here could take the form of a systematic only faults in the programmable components of the FPGA.
means of adding redundancy to a design in a more efficient There are a few examples that consider other parts of the
way. Possible approaches include an intermediate level in the FPGA, such as the configuration network in [56], however
design hierarchy such as [29], or by analysis of datapath and there remain many components which are not fault
control logic within the automated design process. Fault tolerant. If a truly robust FPGA system is needed, every
detection based on high-level behaviour is another part of the FPGA must either be testable and repairable
possibility; for example, output properties such as SNR in the field or be intrinsically reliable, for example using
could be monitored for unexplained changes. redundancy or more robust feature geometry.

5.2 Fault repair Recently, there has been some discussion of coarse-grained
Current methods of incremental mapping, placement and network-on-chip FPGA architectures that are better
and routing are quite effective at reconfiguring current optimised for computing applications. This notion could
architectures to operate in the presence of faults. Some also be explored as a means of implementing fault
possible enhancements to the technique are improved repair tolerance. Complex FPGA applications are likely to be
speed, awareness of timing requirements for critical paths, modular by design and may be organised in some form of
tolerance of faults in the fault detection/repair kernel and hierarchy, for example a microprocessor with several cores.
tolerance of faults in DSP and memory blocks. If the modules can be reconfigured, tested and repaired
independently and can serve as replacements for one
An alternative to incremental, embedded CAD could another then that provides a method of achieving on-line
be remote repair. If a system already has a means of fault tolerance.
communicating with a remote server, a repair configuration
could be calculated remotely; this way the platform A further avenue exists in the way FPGA configurations
constraints are removed. are described. Currently, fault repair algorithms based on
incremental CAD operate on the affected portion of the
Evolutionary algorithms are proven to be capable, in configuration bitstream and aim to restore the described
principle, of evaluating repairs for FPGAs. However, more netlist. If the design information were stored in terms of
development is needed before they can be considered higher-level functionality, rather than connectivity, the
practical methods of achieving fault tolerance. As well as FPGA would have the freedom to correct faults using any
improving the speed and accuracy of the process, more form of available hardware. This would be particularly
thought is needed as to how evolution can be implemented useful for heterogeneous arrays that contain a variety of
in a complete system. types of different hardware resource. Repairs could also be
carried out at a system level, using resources outside the
FPGA where the fault has occurred.
5.3 Fault tolerant architectures
Much of the work to date on fault tolerance has assumed
an architecture broadly similar to that of contemporary
6 Conclusion
commercial FPGAs. Some publications have considered Fault tolerance in FPGAs has been widely studied with
parametric modifications to current architectures [68, 69], works considering fault modelling, detection and repair.

IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 3, pp. 196– 210 207
doi: 10.1049/iet-cdt.2009.0011 & The Institution of Engineering and Technology 2010
Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on June 26,2010 at 07:00:23 UTC from IEEE Xplore. Restrictions apply.
www.ietdl.org

These have exploited advantageous features of FPGAs such [9] ESSENI D., BUDE J.D., SELMI L.: ‘On interface and oxide
as regularity and the ability to reconfigure. This paper has degradation in VLSI MOSFETs – part I: deuterium effect
sought to present all the important publications in this field in CHE stress regime’, IEEE Trans. Electron Devices, 2002,
and categorise and compare the techniques that they present. 49, (2), pp. 247– 253

The motivation behind the works studied here has varied [10] ESSENI D., BUDE J.D., SELMI L. : ‘On interface and oxide
between recovery from SEUs, resilience to manufacturing degradation in VLSI MOSFETs – part II: Fowler– Nordheim
defects and tolerance of in-field faults because of degradation. stress regime’, IEEE Trans. Electron Devices, 2002, 49, (2),
The latter of these goals is of growing interest because of the pp. 254– 263
increasing challenge of reliability that faces ongoing process
scaling. The development of circuit-level techniques to [11] SRINIVASAN J., ADVE S.V., BOSE P., RIVERS J.A.: ‘The impact of
handle degradation could enable FPGAs to provide new technology scaling on lifetime reliability’. Int. Conf. on
opportunities for reliability-critical applications and further Dependable Systems and Networks, 2004, pp. 177– 186
push the boundaries of VLSI technology improvement.
[12] CHEUNG K.: ‘Can TDDB continue to serve as reliability
test method for advance gate dielectric?’. Int. Conf. on
There is much scope for future work in this field; the Integrated Circuit Design and Technology, 2004
techniques present have differing strengths and weaknesses,
or are suited to limited applications. Practical fault-tolerance [13] HARRIS I., TESSIER R.: ‘Testing and diagnosis of interconnect
schemes will have to be of low overhead and make efficient faults in cluster-based FPGA architectures’, IEEE Trans. CAD
use of circuit resources. As the threat of degradation grows, Integ. Circuits Syst., 2002, 21, (11), pp 1337– 1343
so too will the impetus behind this field.
[14] NORMAND E.: ‘Single event upset at ground level’, IEEE
Trans. Nucl. Sci., 1996, 43, (6), pp. 2742 – 2750

7 References [15] MOJOLI G., SALVI D., SAMI M.G., SECHI G.R., STEFANELLI R.: ‘KITE:
a behavioural approach to fault-tolerance in FPGA-based
[1] CAMPREGHER N., CHEUNG P.Y., CONSTANTINIDES G.A., VASILKO M.: systems’. Int. Workshop on Defect and Fault Tolerance in
‘Analysis of yield loss due to random photolithographic VLSI Systems, 1996, pp. 327– 334
defects in the interconnect structure of FPGAs’. ACM Int.
Workshop on FPGAs, 2005, pp. 138– 148 [16] GIRARD P., HRON O., PRAVOSSOUDOVITCH S., RENOVELL M.: ‘Defect
analysis for delay-fault BIST in FPGAs’. Int. On-Line Testing
[2] MCCLINTOCK C.: ‘Redundancy circuitry for logic circuits’. Symp., 2003, pp. 124 – 128
U.S. Patent 66 166 559, December 2000
[17] STEININGER A., SCHERRER C.: ‘On the necessity of on-line-
[3] BERG M.: ‘Fault tolerance implementation within SRAM BIST in safety-critical applications’. Int. Symp. on Fault-
based FPGA designs based upon the increased level of single Tolerant Computing, 1999, pp. 208 – 215
event upset susceptibility’. Int. On-Line Testing Symp., 2006
[18] NEUMANN J.: ‘Probabilistic logics and the synthesis of
[4] STOTT E., SEDCOLE P., CHEUNG P.: ‘Fault tolerant methods for reliable organisms from unreliable components’. ‘Automata
reliability in FPGAs’. Int. Conf on Field Prog. Logic and Apps., studies’ (Ann. of Math. Studies, vol. 34, 1956), pp. 43– 98
September 2008, pp. 415 – 420
[19] RAMAMOORTHY C., HAN Y.-W.: ‘Reliability analysis of systems
[5] SRINIVASAN S. , MANGALAGIRI P. , XIE Y. , VIJAYKRISHNAN N. , with concurrent error detection’, IEEE Trans. Comput.,
SARPATWARI K. : ‘FLAW: FPGA lifetime awarenes’. Design 1975, C-24, (9), pp. 868 – 878
Automation Conf., 2006, pp. 630– 635
[20] CARMICHAEL C. : ‘Triple module redundancy design
[6] GURIN C., HUARD V., BRAVAIX A.: ‘The energy-driven hot- techniques for Virtex FPGAs’. Xilinx Application Note
carrier degradation modes of nMOSFETs’, IEEE Trans. XAPP197, 2006
Device Mater. Reliab., 2007, 7, (2), pp. 225 – 235
[21] D’ANGELO S., METRA C., PASTORE S., POGUTZ A., SECHI G.R.: ‘Fault-
[7] DIETER K., BABCOCK J.: ‘Negative bias temperature tolerant voting mechanism and recovery scheme for TMR
instability: road to cross in deep submicron silicon FPGA-based systems’. Int. Symp. on Defect and Fault
semiconductor manufacturing’, J. App. Phys., 2003, 94, Tolerance in VLSI Systems, 1998, pp. 233– 240
(1), pp. 1 – 18
[22] D’ANGELO S., METRA C., SECHI G.: ‘Transient and permanent
[8] CLARKE P., RAY A., HOGARTH C.: ‘Electromigration – a tutorial fault diagnosis for FPGA-based TMR systems’. Int. Symp. on
introduction’, Int. J. Electron., 1990, 69, (3), pp. 333 – 388 Defect and Fault Tolerance in VLSI Systems, 1999, pp. 330–338

208 IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 3, pp. 196– 210
& The Institution of Engineering and Technology 2010 doi: 10.1049/iet-cdt.2009.0011

Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on June 26,2010 at 07:00:23 UTC from IEEE Xplore. Restrictions apply.
www.ietdl.org

[23] MITRA S., MCCLUSKEY E.J.: ‘Which concurrent error [38] LIU J., SIMMONS S.: ‘BIST-diagnosis of interconnect fault
detection scheme to choose?’. IEEE Int. Test Conf., 2000, locations in FPGA’s’. Canadian Conf. on Electrical and
pp. 985–994 Computer Engineering, 2003, pp. 207– 210

[24] BOLCHINI C., SALICE F., SCIUTO D.: ‘Designing self-checking [39] CAMPREGHER N., CHEUNG P.Y. , VASILKO M.: ‘BIST based
fpgas through error detection codes’. IEEE Inter. Symp. on interconnect fault location for FPGAs’. Int. Conf. on Field
Defect and Fault Tolerance in VLSI Systems, 2002, Programmable Logic, 2004, pp. 322– 332
pp. 60– 68
[40] SUN X., XU J., CHAN B., TROUBORST P.: ‘Novel technique for
[25] KRASNIEWSKI A.: ‘Low-cost concurrent error detection for built-in self-test of FPGA interconnects’. Int. Test Conf.
fsms implemented using embedded memory blocks of 2000, 2000, pp. 795 – 803
fpgas’. IEEE Design and Diagnostics of Electronic Circuits
and Systems, 2006, pp. 178 – 183 [41] SMITH J., XIA T., STROUD C.: ‘An automated BIST architecture
for testing and diagnosing FPGA interconnect faults’,
[26] HUANG W., MITRA S., MCCLUSKEY E.J.: ‘Fast run-time fault J. Electron. Test. Theory Appli., 2006, 22, (3), pp. 239– 253
location in dependable FPGA-based applications’. IEEE Int.
Symp. on Defect and Fault Tolerance in VLSI Systems, [42] HARRIS I., TESSIER R.: ‘Diagnosis of interconnect faults in
2001, pp. 206 – 214 cluster-based FPGA architectures’. Int. Conf. on Computer
Aided Design, 2000, pp. 472 – 475
[27] LO J., FUJIWARA E.: ‘Probability to achieve TSC goal’, IEEE
Trans. Comput., 1996, 45, (4), pp 450 – 460 [43] CHMELÂR E.: ‘FPGA interconnect delay fault testing’. Int.
Test Conf., 2003, vol. 1, pp. 1239 – 1247
[28] LIMA F., CARRO L., REIS R.: ‘Designing fault tolerant systems
into SRAM-based FPGAs’. Design Automation Conf., 2003, [44] TAHOORI M.B. : ‘Diagnosis of open defects in FPGA
pp. 650– 655 interconnect’. IEEE Int. Conf. on Field-Programmable
Technology, 2002, pp. 328 – 331
[29] ALDERIGHI M., CASINI F., DANGELO S., SALVI D., SECHI G.R.: ‘A fault-
tolerant FPGA-based multistage interconnection network [45] GIRARD P., HRON O., PRAVOSSOUDOVITCH S., RENOVELL M.: ‘High
for space applications’. IEEE Int. Workshop on Electronic quality TPG for delay faults in look-up tables of FPGAs’. Int.
Design, Test and Applications, 2002, pp. 302– 306 Workshop on Electronic Design, Test and Applications, 2004

[30] DEMARA R.F., ZHANG K.: ‘Autonomous FPGA fault handling [46] ABRAMOVICI M., STROUD C.E.: ‘BIST-based delay-fault testing
through competitive runtime reconfiguration’. NASA/DoD in FPGAs’, J. Electron. Test., 2003, 19, pp. 549– 558
Conf. of Evolution Hardware, 2005
[47] WONG J.S.J., SEDCOLE P., CHEUNG P.Y.K.: ‘Self-characterization
[31] Xilinx Inc.: ‘Xilinx TMRTool product brief’, 2006 of combinatorial circuit delays in FPGAs’. Int. Conf. on
Field Programmable Techniques, 2007, pp. 17– 23
[32] DURAND S.: ‘FPGA with self-repair capabilities’. Int.
Workshop on Field Programmable Gate Arrays, 1994, [48] DOUMAR A. , ITO H.: ‘Testing approach within FPGA-
pp. 1 – 6 based fault tolerant systems’. IEEE Asian Test Symp.,
2000, p. 411
[33] Xilinx Inc.: ‘Virtex-5 FPGA configuration user guide’
(vol. v2.5, 2007) [49] ABRAMOVICI M., EMMERT J.M., STROUD C.E.: ‘Roving STARs: an
integrated approach to on-line testing, diagnosis, and fault
[34] STROUD C., KONALA S., CHEN P., ABRAMOVICI M.: ‘Built-in self-test tolerance for FPGAs’. NASA/DoD Workshop on Evolvable
of logic blocks in FPGAs’. VLSI Test Symp., 1996, vol. 14 Hardware, 2001, p. 73

[35] LU S., YEH F., SHIH J.: ‘Fault detection and fault diagnosis [50] EMMERT J.M. , STROUD C.E. , ABRAMOVICI M.: ‘Online fault
techniques for lookup table FPGAs’, VLSI Des., 2002, 15, tolerance for FPGA logic blocks’, IEEE Trans. VLSI Syst.,
(1), pp. 397– 406 2007, 15, (2), pp. 216– 226

[36] ITAZAKI N., MATSUKI F., MATSUMOTO Y., KINOSHITA K.: ‘Built-in [51] SHNIDMAN N.R., MANGIONE-SMITH W.H., POTKONJAK M.: ‘On-line
self-test for multiple CLB faults of a LUT type FPGA’. Asian fault detection for bus-based field programmable gate
Test Symp., 1998, pp. 272 – 277 arrays’, IEEE Trans. VLSI Syst., 1998, 6, (4), pp. 656 – 666

[37] ALAGHI A., YARANDI M.S., NAVABI Z.: ‘An optimum ORA BIST [52] NAKAMURA Y., HIRAKI K. : ‘Highly fault-tolerant FPGA
for multiple fault FPGA look-up table testing’. Asian Test processor by degrading strategy’. Pacific Rim Int. Symp.
Symp., 2006, pp. 293 – 298 on Dependable Computing, 2002, pp. 75– 78

IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 3, pp. 196– 210 209
doi: 10.1049/iet-cdt.2009.0011 & The Institution of Engineering and Technology 2010
Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on June 26,2010 at 07:00:23 UTC from IEEE Xplore. Restrictions apply.
www.ietdl.org

[53] KELLY J., IVEY P.: ‘A novel approach to defect tolerant [62] EMMERT J.M., BHATIA D.K.: ‘A fault tolerant technique for
design for SRAM based FPGAs’. Int Workshop on FPGAs, FPGAs’, J. Electron. Test., 2000, 16, (6), pp. 591 – 606
1994
[63] HANCHEK F. , DUTT S.: ‘Node-covering based defect
[54] HATORI F., SAKURAI T., NOGAMI K., ET AL .: ‘Introducing and fault tolerance methods for increased yield in FPGAs’.
redundancy in field programmable gate arrays’. Custom Int. Conf. on VLSI Design, January 1996, pp. 225– 229
Integrated Circuits Conf., May 1993, pp. 7.1.1 – 7.1.4
[64] ADAMS G., AGRAWAL D., SIEGEL H.: ‘A survey and comparision
[55] KELLY J., IVEY P.: ‘Defect tolerant SRAM based FPGAs’. Int. of fault-tolerant multistage interconnection networks’, IEEE
Conf. on Computer Design, 1994, pp. 479 – 482 Comput., 1987, 20, (6), pp. 30– 40

[56] HOWARD N.J. , TYRRELL A.M. , ALLINSON N.M.: ‘The yield [65] SHANTHI A., PARTHASARATHI R.: ‘Exploring FPGA structures
enhancement of field-programmable gate arrays’, IEEE for evolving fault tolerant hardware’. NASA/DoD Conf. on
Trans. VLSI Syst., 1994, 2, (1), pp. 115 – 123 Evolvable Hardware, 2003, pp. 174– 181

[57] LACH J., MANGIONE-SMITH W.H., POTKONJAK M.: [66] LARCHEV G., LOHN J.: ‘Evolutionary based techniques for
‘Enhanced FPGA reliability through efficient run-time fault tolerant field programmable gate arrays’. Int. Conf.
fault reconfiguration’, Trans. Reliab, 2000, 49, (3), on Space Mission Challenges for Information Technology,
pp. 296– 304 2006

[58] LACH J., MANGIONE-SMITH W.H., POTKONJAK M.: ‘Low overhead [67] ESSENI D., BUDE J.D. , SELMI L.: ‘Fault-tolerant evolvable
fault-tolerant FPGA systems’, IEEE Trans. VLSI Syst., 1998, 6, hardware using field-programmable transistor arrays’, IEEE
(2), pp 212– 221 Trans. Reliab., 2000, 49, (3), pp. 305– 316

[59] LACH J., MANGIONE-SMITH W.H., POTKONJAK M.: ‘Algorithms [68] HUANG J. , TAHOORI M.B. , LOMBARDI F.: ‘Fault tolerance
for efficient runtime fault recovery on diverse FPGA of switch blocks and switch block arrays in FPGA’, IEEE
architectures’. Int. Symp. on Defect and Fault Tolerance in Trans. VLSI Syst., 2005, 13, (7), pp. 794– 807
VLSI Systems, 1999
[69] CAMPREGHER N., CHEUNG P.Y.K., CONSTANTINIDES G.A., VASILKO M.:
[60] LAKAMRAJU V., TESSIER R.: ‘Tolerating operational faults in ‘Reconfiguration and finegrained redundancy for fault
cluster-based FPGAs’. ACM Int. Workshop on FPGAs, 2000 tolerance in FPGAs’. Int. Conf. on Field Programmable
Logic, 2006, pp. 455 – 460
[61] NARASIMHAN J., NAKAJIMA K., RIM C.S., DAHBURA A.T.: ‘Yield
enhancement of programmable ASIC arrays by [70] HARIYAMA M., OGATA S., KAMEYAMA M.: ‘Multi-context FPGA
reconfiguration of circuit placements’, IEEE Trans. CAD using floating-gate-MOS functional pass-gates’, IEICE
Integ. Circuit Syst., 1994, 13, (8), pp. 976– 986 Trans. Electron., 2006, E89-C, (11), pp. 1655 – 1661

210 IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 3, pp. 196– 210
& The Institution of Engineering and Technology 2010 doi: 10.1049/iet-cdt.2009.0011

Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on June 26,2010 at 07:00:23 UTC from IEEE Xplore. Restrictions apply.