You are on page 1of 7

J Electron Test (2011) 27:627633

DOI 10.1007/s10836-011-5245-4

Reliability Limits of TMR Implemented in a SRAM-based


FPGA: Heavy Ion Measures vs. Fault Injection Predictions
Gilles Foucard & Paul Peronnard & Raoul Velazco

Received: 2 September 2010 / Accepted: 17 August 2011 / Published online: 10 September 2011
# Springer Science+Business Media, LLC 2011

Abstract This paper presents experimental results putting


in evidence the potential weaknesses of a state-of-the-art
fault tolerance strategy, the Triple Modular Redundancy
(TMR), when implemented in SRAM-based FPGAs. HW/
SW fault injection campaigns and accelerated radiation
ground tests were performed to quantify the number of
faults, Single Event Upsets (SEUs) required to obtain such
critical failures.
Keywords TMR . FPGA . SRAM . SEUs . Laser tests . Fault
injection . Heavy ions

1 Introduction
Field Programmable Gate Arrays (FPGAs) are much
appreciated by designers because they offer low cost, high
performance, fast time to market and great flexibility for the
design. Among the available technol7ogies, SRAM-based
FPGAs are well suited for space and avionics applications,
this due to their on-site reconfiguration, feature which is not
available in Application Specific Integrated Circuits
(ASICs). Although ASICs offer better performances and
require less energy to operate, they are not the most suitable
candidates for small production markets such as space
and avionics fields, this because of their initial production
cost and their lack of reconfiguration.
Despite the previously mentioned attractive characteristics, designers are reluctant to use SRAM-based FPGAs
Responsible Editor: F. Vargas
G. Foucard (*) : P. Peronnard : R. Velazco
Laboratoire TIMA,
Grenoble, France
e-mail: gilles.foucard@imag.fr

for critical applications as the mission profiles of these


systems may include harsh environments, in which they are
exposed, among other constraints, to the effects of ionizing
radiation [3, 7, 11]. Single-event effects (SEE) induced by
the interaction of energetic particles with integrated circuits
are a well-known threat for space systems directly exposed
to cosmic rays and solar flares. Moreover, they are
nowadays also a concern for applications built from devices
manufactured using modern process technologies and
operating in the earths atmosphere. Indeed, such devices
were demonstrated as being sensitive to the effects of
atmospheric neutrons, even if they operate at ground level.
The most probable consequences of impinging energetic
particles on SRAM-based FPGAs are the Single-Event
Upsets (SEUs) and Multiple-Bit Upsets (MBUs) occurring
either in the embedded design or in configuration memories.
This kind of effect provokes the so-called bit-flips,
corrupting thus content of memory cells. Such errors are
not destructive and so they can be recovered by a reset.
However faults induced in the FPGAs configuration memory
are permanent and the only way the cope with them is to
reconfigure the device. Such faults may directly result in a
mutation of the function implemented in the FPGA [10]
whose consequences may be critical.
Protection against SEUs in configuration memory must be
taken into account by the designer. Design level solutions,
such as the well-known Triple Modular Redundancy (TMR)
technique, are often adopted in order to build fault-tolerant
architectures in SRAM-based FPGAs [5], but they always
require finding a compromise between resource overhead /
performance penalty and fault tolerance. However, the TMR
technique has a weakness: its final voter. Indeed, this part of
the design is the only one which does not tolerate faults.
Previous work presented in [15] allowed developing a
platform to inject faults on the same test vehicle as the one

628

used in this study, the Xilinx Virtex II SRAM-based FPGA,


by synchronizing the ATLAS laser platform [13, 14] with
THESIC+tester. This type of facility is able to generate
SEUs and MBUs from a single laser pulse. Nevertheless no
correlation has been found between the beam energy and
particle stopping power. Thus, such a beam cannot be used
to quantify the sensitivity of integrated circuits. However
a mapping of the sensitivity of the different devices
resources can be obtained from a static test by doing a
step-by-step laser scan of the full width of the die. An
evolution of the test platform was achieved in order to
synchronize the laser pulse with the application clock
pulses. Hence it was possible to accurately trigger the
beam on a selected resource and at a chosen time. Such
dynamic test showed that the time parameter of the fault
injection has a significant influence on the output value.
Indeed for a selected resource it was possible to corrupt
or not the applications result only by modifying the laser
triggering delay.
The testbed used for laser testing was adapted to perform
additional experiments using different test platforms. As a
result this paper presents experimental results issued from
fault injection campaigns performed on four different
versions of a representative application: a crypto-core and
three derived fault-tolerant versions: duplex, TMR and XTMR implemented in a test vehicle. The results of fault
injection experiments performed on the TMR implementation were complemented by the ones issued from radiation
ground testing performed by means of a cyclotron, during
which the tested applications were exposed to the effects of
heavy-ions beams.
The paper is organized as follows. In section 2 are
provided details of the Device Under Test (DUT). Section 3
describes the experimental platform used in this work. The
crypto-core application implemented in the tested FPGA is
presented in section 4. The used tests methodologies are
presented in section 5. Experimental results issued from
particle accelerator experiments and fault injection campaigns are given in section 6. Finally conclusions and
perspectives are summarized in section 7.

J Electron Test (2011) 27:627633


COM FPGA
Power Supply

Ethernet
LEON2 IP

Latchup Mgt

SRAM
Chipset FPGA

DUT

User Design

Fig. 1 THESIC+ Block diagram

The specificity of SRAM-based FPGAs is the configuration memory used to configure all the DUT resources. In
the case of the DUT chosen for this work, the memory size
is 4,082,592 bits. Each resource has its configuration bit
located next to it, therefore these bits are spread all over the
core area.
A bit-flip in the configuration memory may have a
serious impact on the design itself, for instance, changing
the behavior of the application. Moreover faults in the
configuration memory will be permanent remaining until
the next reconfiguration. As significant examples can be
mentioned:
&
&

SEUs occurring in the configuration of a Look-Up


Table (LUT) may change its logic function.
SEUs occurring in an interconnection may either create
a shortcut between two routes or open a route.

Virtex FPGAs offer the possibility to read the configuration


memory at any time. This process, called readback, will be
used in the following to detect SEUs occurring in the
configuration SRAM.

2 The Test Vehicle


The DUT used throughout these experiments belongs to the
Xilinx Virtex-II family. The XC2V1000-4FF896C is a 1
million system gates component mounted in a 896 BGA
(Ball Grade Array) flip-chip package [16]. The chip is
issued from a 0.15 m / 0.12 m CMOS 8-layer metal
technology. Heavy ions produced in particle accelerators
have energies much lower than natural ions, thus it is
mandatory to thin down the die to 90 m for ground testing
in order to allow the particles reach the sensitive volumes.

SRAM

Fig. 2 Fault injection experiment setup

J Electron Test (2011) 27:627633

629

Table 1 FPGA resource


utilization

Resources

Single

Slice FF
LUTs
Slices

798
2176
1434

Duplex
7%
21%
28%

1064
4397
2607

TMR
10%
42%
50%

1330
6831
3868

X-TMR
12%
66%
75%

2199
8189
4543

21%
79%
88%

3 THESIC+ Tester

4 Tested Applications

THESIC+ (Testbed for Harsh Environments Studies on


Integrated Circuits) is an upgraded version of the generic
and flexible test platform presented in [4]. Figure 1 depicts
the block diagram of the THESIC+ testbed. Its architecture is
based on two FPGAs. The COM FPGA contains a LEON2
processor and is in charge of the communication between the
user computer and the resources available on the board. It also
monitors the DUT current in order to protect it against Single
Event Latchups (SELs). Data transfers are performed over the
100Mbits Ethernet network. The Chipset FPGA contains the
user design capable of interfacing the DUT with the tester.
A daughterboard embeds the DUT in order to easily
change the component to be tested. Figure 2 depicts the
experimental testbed.
Radiation ground testing and fault injection campaigns
were automatically performed by means of a controller
implemented in the Chipset FPGA which has in charge the
following tasks:

The chosen application is based on a triple DES (Data


Encryption Standard) algorithm, also named DES3, written
in Verilog [12]. It is based on the DES algorithm which
encrypts 64-bit data using a 56-bit key in 16 clock cycles.
In fact a DES3 encryption is achieved by three consecutive
DES encryptions, thus requiring three 56-bit keys and 48
clock cycles.
Loading the data and the keys in parallel would require
more Inputs/Outputs than the platform can offer. A
controller was added in order to sequentially initialize the
DES3 to cope with this technical limitation.
The first implemented application, named single, is
composed of two identical and chained DES3. The first one

&
&
&
&
&
&

Fig. 4 Heavy ion campaign


flow diagram

Run TEST

Close shutter

Load FPGA configuration binary file and input vectors


for the DUT application
Configure the FPGA
Provide input vectors to the DUT application
Check the DUT application results
Readback FPGA configuration memory
Send results to the computer

data
loade

DUT bitstream to
THESIC+
Configure DUT

Open shutter

Run application
DES3

DES

Single
no
DES3

DES3

data

Output
error?

comparator

loader

yes
DES3

DES3

Close shutter

Duplex

data
loader

DES3

DES3

DES3

DES3

DES3

DES3

TMR

Fig. 3 The three tested applications

Store application
outputs

comparator

Store DUT
Readback
Results to
computer

630

J Electron Test (2011) 27:627633

Fig. 5 Fault injection


flow diagram

Run TEST

&

DUT bitstream to
THESIC+

&

Injection vectors
to THESIC+

The fourth version, called X-TMR, was generated with


TMRtool software [17]. This tool, proposed by Xilinx, is
able to automatically implement a TMR scheme for an
application embedded in their SRAM-based FPGA devices.
Robust applications produced by TMRtool were proven to
be more efficient than a conventional TMR. However the
consequence of this improvement is an increased number of
the used resources [8]. It is important to note that unlike the
TMR version, this application does not offer the possibility
to monitor the internal behavior of the implemented
application. Consequently the fault diagnosis will be
limited to the identification of a wrong result, by comparing
the DUT outputs to the ones of reference values.
The FPGA resource utilization is presented in Table 1.
As expected the amount of resources required to upgrade
from the single to duplex is the same for duplex to TMR as
each upgrade requires an additional branch composed of
two DES3 modules. On the other hand the application
proposed by TMRtool requires more resources than the
basic TMR implemented in the TMR application.
The architecture of the three hand-made applications is
given in Figure 3.
In all the cases, the application output validity is
confirmed by the THESIC+ tester as it is not exposed
neither to the particle beams nor to fault injections.

Configure DUT

Start application

Halt application
Readback FPGA
Fault
injection

Inject fault in
readback
Configure FPGA
Resume
application
Store application
outputs
Results to
computer

encrypts the data and the second one does the reverse
process. Consequently the output data should be the same
as the input one.
The second application, named duplex, is composed of
two identical single applications. Thus, 2 chains of DES3
process the same data at the same time. A comparator tells
whether the two outputs are the same (output value 0) or
if they are different (output value 1).
The third application, named TMR, is composed of
three single chains doing the same calculation. A majority
voter compares the three outputs and provides the status of
the result through five significant values:
&
&
&

0 when all the results are the identical


1 if chain 1 provides a result different from the
two others
2 if chain 2 provides a result different from the
two others

3 if chain 3 provides a result different from the


two others
4 if the three outputs are different

5 Test Methodologies
This section presents the methodologies used for radiation
ground testing and for fault injection campaigns.
5.1 Heavy Ion Campaign
The radiation ground testing campaign was performed at
the Heavy Ion Facility (HIF) located in Louvain-la-Neuve
(Belgium) [1, 2]. This cyclotron has a shutter which allows
stopping the particles beam before it reaches the DUT. This
feature allows performing some mandatory operations on
the DUT which must be done off-beam. For instance,

Table 2 Number of errors obtained on the TMR under heavy ions


Particle

1 way error

3 way error

Unexp. TMR error

Critical errors

Nb. of runs

Total fluency

Carbon
Argon

51
1,278

0
1

0
3

0
34

187,469,275
750,688,226

158,543
437,095

J Electron Test (2011) 27:627633


Table 3 Mean number of
seconds for the occurrence
of each fault

631

Particle

Non-critical error

TMR failure

TMR critical failure

Critical errors

Carbone
Argon

7.81
1.25

N/A
1,595

N/A
532

N/A
46

configuring the DUT before executing the application and


reading the SRAM configuration memory to detect the
effects of radiation.
Due to the allocated beam time, the TMR application
was exposed only to two different particle beams: Carbon
and Argon. When performing radiation ground testing on
integrated circuits, an important feature of particles beam is
the energy they deposit in Silicon, magnitude so-called LET
(Linear Energy Transfer). A high LET particle may
generate a high SEU rate in the configuration memory,
thus provoking errors on the application outputs after a few
runs. The shutter requires some hundred of milliseconds to
respond. Consequently the exposure time of the DUT must
be significantly greater than the shutter response time in
order to get an accurate measure of the fluency, magnitude
which is required to the evaluation of the DUT sensitivity to
the used beam. One of the final results of radiation ground
testing is the so-called device cross-section curve which
provides for a scope of particles LETs, the average number
of particles required to get an event (single or multiple bitflip). In the frame of our experiments, the selected particles
LET was located in the bottom and in the elbow of the DUT
cross-section curve obtained in reference [6]. The Carbon
ion and the Argon ion have a LET of 1.2 MeV/mg/cm and
10.1 MeV/mg/cm respectively.
When performing radiation experiments with heavyions, the THESIC+ platform and the Virtex-II daughterboard were both placed in a vacuum chamber but only the
DUT was exposed to the particles. The so-called accelerated test consists in exposing the application while data are
encoded continuously (Figure 4). After each encryption
sequence the results and the voter outputs are checked by
THESIC+ to confront them to the reference values.
Whenever an error is detected by the TMR voter and/or
by THESIC+ the beam is automatically stopped by a
mechanical shutter. This allows performing the readback
followed by the reconfiguration of the FPGA with the
guarantee that particles are not perturbing this important
step. Afterwards the shutter is removed and the encryption
process starts again.
Table 4 Mean number of
particles required between 2
occurrences of each fault

5.2 Fault Injection Campaign


Fault injection campaigns were conducted by hardware/
software means on the four studied applications. As
depicted in Figure 5, a test starts by copying the FPGA
bitstream from the computer in THESIC+ memory. A set of
injection vectors is then randomly generated by the
computer using a Mersenne random number generator [9].
Each set of vectors is composed of three values: the clock
cycle during which the fault will be injected, the target byte
and the bit mask to be applied to the target byte. The DUT
is then configured with its bitstream and the applications
execution starts. THESIC+ tester monitors the applications
clock cycle in order to halt the cryptocore when the
encountered value is equal to the instant of injection
defined in the vector set. When the application is halted
the FPGAs configuration is retrieved and the fault is
injected in the bitstream before reconfiguring the device
with this modified file. The applications execution is then
resumed. Both the encoded value and the status value
provided by the comparator are stored in the memory.
Finally, the computer retrieves these results and the
procedure starts again.

6 Experimental Results
6.1 Heavy Ion Campaigns Results
The radiation test campaign provided results on the
behavior of TMR application when exposed to the selected
particles beams: Carbon and Argon. Four types of errors
were considered and detected:
&

&

1 way errors are errors detected by the TMR putting


in evidence that one of the TMR branches outputs a
wrong result. The other two branches being right, the
voter gives a correct result.
3 way errors occur when the three branches output
different results. So the voter cannot provide the correct

Particle

Non-critical error

TMR failure

TMR critical failure

Critical errors

Carbone
Argon

3,108
342

N/A
437,094

N/A
145,698

N/A
12,855

632

J Electron Test (2011) 27:627633

Table 5 Percentage for each type of error according to the number of SEUs in the configuration memory
Particle

1 way error

Carbon
Argon

51
1,278

&

&

10.83%
6.08%

3 way error

Unexpected TMR error

Critical errors

Number of RB SEUs

0
1

0
3

0
34

471
21030

0%
0.005%

answer but it warns that the encryption results cannot be


trusted and so they must be done again.
Unexpected TMR errors occur when the TMR
provides an error code which does not exist in the
application. Thus, the result cannot be considered as
relevant and the encryption must be done again.
Critical errors occur when the TMR states that the
result is correct while while the external THESIC+
controller detects an error. This puts in evidence a
critical weakness of the TMR architecture as an
undetected error is propagated in the system.

Table 2 presents the total number of encryptions done,


the particles fluency received by the tested FPGA during
the radiation experiment and the number of errors induced
for the two ions beams. Table 3 shows respectively the
mean time between two occurrences of each type of fault.
Table 4 provides the average number of particles required to
generate each fault. In reference [8] it was shown that the
Virtex II has a low sensitivity to Carbon ions. Indeed it
requires about nine times more Carbon ions than Argon
ions to generate an error. The time ratio is different as the
selected fluxes where different.
Table 5 presents the percentage for each type of errors
according to the number of configuration memory SEUs
generated by the particles.
6.2 Fault Injection Campaign
Table 6 presents the error rates obtained from fault injection
campaigns applied on the three previously described
applications. The first observation is that the percentage of
injected faults provoking an error increases when applying
mitigation schemes. This is due to the increase of the
amount of resources required by the applications.
The single application provides a raw output, so the
only type of error observed are critical errors detected by

0%
0.014%

0%
0.16%

THESIC+ when the result is wrong. Only 0.50% of injected


faults provoked an error on the outputs.
The duplex application is able to detect two different
results, representing 2.35% of the injected faults. However it is important to notice that 0.066% of the errors
were detected by THESIC+ being not detected by the
application.
Finally the TMR application was able to detect and
propose a correct result for 3.42% of the faults. 0.06% of
the injected faults provoked the detection of an error
although the result was correct and 0.08% of the injections
provide an error which was not seen by the voter.
Depending on the final application, such faults escaping
to the fault mitigation scheme may obviously have
critical consequences.
As shown in the Table 6, the robustness of X-TMR
version is significantly higher than the one of basic
implementation of TMR. Indeed, to provoke a X-TMR
fault, is required a number of bit-flips about three times
than the one required for standard TMR.

7 Conclusion and Perspectives


The work presented in this paper confronted the results
issued from radiation ground testing to those resulting from
fault injection campaigns, this for a cryptocore application
and three fault tolerant versions based on duplex and TMR
approaches implemented in an SRAM-based FPGA.
The obtained results put in evidence that some faults on
the configuration memory may provoke an application
mutation which results in the inability of the TMR
comparator to detect the fault. Such faults constitute a
critical challenge for applications requiring high reliability
such as those devoted to operate in harsh environments
(space, avionics, etc). For ground level applications, the
probability of occurrence of such faults is very low, but, the

Table 6 Number of errors predicted with fault injection campaigns


Applications

Detected faults

Wrong error detections

Critical errors

% of faults provoking an error

Number of runs

Single
Duplex
TMR
X-TMR

N/A
4,853
14,564
N/A

N/A
N/A
237
N/A

1,339
141
319
191

0.50%
2.35%
3.55%
N/A

267,438
212,530
426,217
592,457

0%
2.28%
3.42%
0%

0%
0%
0.06%
0%

0.50%
0.07%
0.08%
0.03%

J Electron Test (2011) 27:627633

omnipresence of integrated circuits and systems gives a non


negligible probability of occurrence even if the impinging
particles, basically neutrons, have very low fluxes.
It is important to note that in this preliminary fault
injection study, multiple faults (MBUs), that may occur in
advanced integrated circuits as the result of the impact of a
single particle, were not considered. Such a conjecture
should be considered but requires a precise knowledge of
the layout of the considered memories in future work.
The weakness of the TMR application, the comparator,
can in principle be identified by means of a laser test.
Indeed, in such a platform were developed and performed
experiments showing that this kind of facility is able to
inject faults in specific resources of the application. A
cartography obtained from an exhaustive step-by-step laser
fault injection experiment may provide precise information
of the weak physical points of the TMR application. Such
experiments will be explored in future research works.
Acknowledgment Authors thanks Dr. Gary Swift from Xilinx
Corporation for his support to the X-TMR version of the application,
Dr. Vincent Pouget, from IMS/CNRS, for his support for laser test
experiments and finally M. Guy Berger, from UCL, for his support
will performing heavy ion ground testing.

References
1. Berger G, Ryckewaert G, Harboe-Sorensen R, Adams L (1996)
The heavy ion irradiation facility at CYCLONE - a dedicated SEE
beam line. IEEE NSREC Workshop
2. Berger G, Ryckewaert G, Harboe-Sorensen R (1997) CYCLONE
A Multipurpose Heavy Ion, Proton and Neutron SEE Test Site,
RADECS Workshop, 5155
3. Caffrey M, Graham P, Johnson E, Wirthlin M (2002) Single-event
upsets in SRAM FPGAs. in Proc. of the military and aerospace
applications of programmable devices Intl conference (MAPLD)
4. Faure F, Peronnard P, Velazco R (2002) Thesic+: A flexible system
for see testing, in Proc. of RADECS
5. Kastensmidt FL, Sterpone L, Carro L, Reorda MS (2005) On the
optimal design of triple modular redundancy logic for SRAMbased FPGAs, Proc. Of Design, Automation and Test in Europe
(DATE) 2005. 2:12901295
6. Koga R, George J, Swift G, Yui C, Edmonds L, Carmichael C,
Langley T, Murray P, Lanes K, Napier M (2004) Comparison of
Xilinx Virtex-II FPGA SEE sensitivities to protons and heavy
ions. IEEE Trans Nucl Sci 51(5):28252833
7. Ma T, Dressendorfer P (1989) Ionizing radiation effects in MOS
devices and circuits. Wiley, New York
8. Manuzzato A, Gerardin S, Paccagnella A, Sterpone L, Violante
M (2008) Effectiveness of TMR-based techniques to mitigate

633
alpha-induced SEU accumulation in commercial SRAM-based
FPGAs. Nuclear Science, IEEE Transactions on 55(4):1968
1973
9. Mersenne twister: http://www.math.sci.hiroshima-u.ac.jp/~m-mat/
MT/emt.html
10. Morgan K, Caffrey M, Graham P, Johnson E, Pratt B, Wirthlin M
(2005) SEU-induced persistent error propagation in FPGAs. IEEE
Trans Nucl Sci 52(6):24382445
11. Normand E (1966) Single-event effects in avionics. IEEE Trans
Nucl Sci 43(2):461474
12. Opencores: http://www.opencores.org/project,des
13. Pouget V, Fouillat P, Lewis D (2006) Using the SEEM software for
SET testing and analysis, Radiation effects in embedded systems,
to be published by Springer
14. Pouget V, Wan D, Jaulent P, Douin A, Lewis D, Fouillat P (2006)
Recent developments for SEE testing at the ATLAS laser facility.
Proc. of 15th Single-Event Effects Symposium
15. Pouget V, Douin A, Lewis D, Fouillat P, Foucard G,
Peronnard P, Maingot V, Ferron JB, Anghel L, Leveugle R,
Velazco R (2007) Tools and Methodology Development for
Pulsed Laser Fault Injection in SRAM-Based FPGAs, 8th
Latin American Test Workshop (LATW 2007), [Cusco (Peru),
1114].
16. Xilinx (2005) Virtex-II platform FPGAs: complete data sheet,
http://www.xilinx.com, March
17. XTMR Tool User Guide (2004) Xilinx User guide UG156

Gilles Foucard received M.S.E. and Ph.D degrees in MicroElectronics and Computer Science from the University of Grenoble
(France) in 2005 and 2010 respectively. He is currently doing a PostDoc in TIMA Laboratory (Grenoble, France) where he has in charge
the writing of a handbook devoted to help designers choosing and
implementing mitigation techniques in their applications.
Paul Peronnard received a Ph.D degree in Micro-Electronics and
Computer Science from the University of Grenoble (France) in 2010.
He is currently doing a Post-Doc in TIMA Laboratory (Grenoble,
France) where he has in charge the HW/SW development of a
platform devoted to evaluate the sensitivity of commercial SRAM
memories to high-altitude natural radiation.
Raoul Velazco is with the CERN (French Research Agency) since
1984, where he is Director of Researches, Dr. Raoul Velazco is the
Co-Leader at TIMA Laboratory (Grenoble, France) of ARIS research
group: Architectures for complex and Robust Integrated Systems.
His main research topics are the study of the effects of radiation on
integrated circuits, the development of test methods and tools for
complex circuits (processors, FPGAs, ASICs, microcontrollers, etc.)
and the design and exploitation of experiments devoted to operate at
high altitude (balloons, airplanes, satellites) with the goal of putting in
evidence the faults induced by the energetic particles (present in the
Earths atmosphere and in space) and explore the efficiency of
hardware and/or software fault tolerance solutions. He has more than
180 publications, 36 of them in the prestigious IEEE Transactions on
Nuclear Science.