Professional Documents
Culture Documents
3, MARCH 2016
Abstract— Radiation-induced multiple bit upsets (MBUs) is required to detect and correct multiple errors in memory
are a major reliability concern in nanoscale technology nodes. arrays. More specifically, SRAM-based FPGAs are more prone
Occurrence of such errors in the configuration frames of a field- to soft errors as a particle strikes in a configuration frame1
programmable gate array (FPGA) device permanently affects
the functionality of the mapped design. Periodic configuration has a permanent impact on the functionality of the mapped
scrubbing combined with a low-cost error correction scheme is design [10]. Since the configuration frames constitute the
an efficient approach to avoid such a permanent effect. Existing majority of SRAMs in an FPGA device (e.g., >80% for Xilinx
techniques employ error correction codes with considerably Virtex-6 VLX240T), mitigation of MBUs in configuration
high overhead to mitigate MBUs in configuration frames. In this frames is of decisive importance.
paper, we present a low-cost error-detection code to detect MBUs
in configuration frames as well as a generic scrubbing scheme Several schemes have been presented to address the
to reconstruct the erroneous configuration frame based on the increasing soft error concern in the FPGA configuration
concept of erasure codes. The proposed scheme does not require frames. The main objective of these schemes is to reduce
any modification to the FPGA architecture. Implementation of error latency, and hence, to avoid error accumulation within
the proposed scheme on a Xilinx Virtex-6 FPGA device shows configuration frames. Costly modular redundancy [11]–[13]
that the proposed scheme can detect 100% of MBUs in the
configuration frames with only 3.3% resource occupation, while is a conventional technique to tolerate soft errors in both
the recovery time is comparable with the previous schemes. configuration frames and functional logic. However, accumu-
Index Terms— FPGA, multiple bit upsets, reliability, soft lated errors in both data and configuration bits dramatically
errors. limit the mean time to failure of such schemes [14]. Another
technique is to optimize the configuration frame circuitry
I. I NTRODUCTION for soft errors as detailed in [15] and [16]. However, such
A. InD Parity
In memory arrays of typical microprocessors such as cache B. I2D Parity
units, the parity bits employed for error detection in each In the complete (traditional) 2-D parity, a parity bit is
memory entry (i.e., word) are computed during each memory associated for each row (column) which is constructed by
access. Hence, from the performance perspective, it is crucial XORing all the bits in that particular row (column). In the
that each memory entry has its own error detection coding. I2D parity technique, each horizontal (vertical) parity bit is the
EBRAHIMI et al.: LOW-COST MBU CORRECTION IN SRAM-BASED FPGA CONFIGURATION FRAMES 935
TABLE I TABLE II
PARITY C OMPUTATION FOR C OMPLETE 2-D AND I2D E RROR D ETECTION C OVERAGE V ERSUS I NTERLEAVING
D ISTANCE FOR I2D PARITY
Fig. 5. Three major MBU patterns that cannot be detected by either I2D or
traditional 2-D parity and their occurrence probability.
Fig. 7. Maximum detection coverage obtained by I2D and I3D parity for
Fig. 6. Example of I3D with vertical, horizontal, and diagonal distance different number of parity bits.
of 4, 3, and 5, respectively.
D. Comparison of I2D and I3D
TABLE III The number of parity bits required by I2D to reach its
I NTERLEAVING D ISTANCE V ERSUS E RROR D ETECTION maximum detection coverage is equal to that of I3D (both
C OVERAGE FOR I3D PARITY require 10 parity bits). However, the error detection coverage
of I2D is less than that of I3D. The maximum error detection
coverages for various number of parity bits employed in the
I2D and I3D parity techniques are shown in Fig. 7. This is
obtained by comparing the detection coverage of all possible
cases that result in the desired number of parity bits. For
instance, for four parity bits in I2D parity, we need to compare
the detection coverage for five cases which are (v = 4, h = 0),
(v = 3, h = 1), (v = 2, h = 2), (v = 1, h = 3), and
(v = 0, h = 4). As it can be seen in this figure, the I3D parity
always provides higher detection coverage than I2D parity
when the number of parity bits is more than two. Although the
I3D parity technique provides more detection coverage with
the same number of bits, it occupies more logical resources
on the FPGA. This is because more parity bits are required
to be stored for the third dimension (diagonal parity) and the
complexity of the controller unit increases for the generation
of such additional parity bits.
E. Extension to n Dimensions
Computation of the horizontal and the vertical parity bits In the smaller technology nodes, the sizes of MBU patterns
in I3D parity follows the same rules explained for I2D parity. grow, hence, a larger interleaving distance or even more parity
Similar to the interleaving technique employed in I2D parity dimensions might be required. However, it should be noted
to reduce the number of horizontal and vertical parity bits, that this growth is not linear with the maximum size of MBU
in I3D parity, several interleaving groups are formed for the patterns. For example, in the MBU patterns of the investigated
diagonals as well. Then for each group, only one parity bit is 45-nm technology, there are MBU patterns with up to 24
computed. In order to uniformly distribute d diagonal parity affected bits (maximum affected row and columns of 11
bits for a configuration frame of size g × f , the interleaving and 8, respectively). However, such large MBU patterns can
group for the bit position of (i, j ) could be computed by be detected by at least one of the horizontal, vertical, and
[i + f ·( j −1)] mod d. An example of the I3D parity technique diagonal parity bits, which serves the purpose of this paper.
with the vertical, horizontal, and diagonal interleaving distance As mentioned earlier, our intention is not to locate and correct
of 4, 3, and 5, respectively, is shown in Fig. 6. each erroneous bit in a configuration frame rather to only
The error detection coverage of I3D parity for different detect the existence of the error in a configuration frame and
horizontal, vertical, and diagonal interleaving distances is mark it as erased, i.e., the granularity of error detection is
reported in Table III. As it can be seen, the maximum error a configuration frame, not an individual bit. Thus, even for
detection coverage of this coding technique is 100% and could smaller technology nodes, the interleaving distance will not
be achieved with at least 10 parity bits (h = 5, v = 2, d = 3). increase drastically.
Although the number of parity bits is much less than the While a larger interleaving distance might increase the
complete 2-D parity (i.e., 113), this detection technique is able detection coverage at the expense of some additional parity
to detect all MBUs. bits, an additional dimension for the virtual interleaving could
EBRAHIMI et al.: LOW-COST MBU CORRECTION IN SRAM-BASED FPGA CONFIGURATION FRAMES 937
TABLE IV
C OMPARISON OF D IFFERENT C ONFIGURATION F RAME S OFT E RROR M ITIGATION S CHEMES
scheme provides a very high error coverage, the employed An important point is that although these schemes remove
FPGA device has only 416 BRAMs and there are not enough the permanent effect of soft errors from configuration frames,
BRAM units to store all of the required Hamming bits. the errors could affect the mapped design functionality
Furthermore, this scheme cannot correct some of the MBUs until the end of the recovery operation. Since all these schemes
in the Hamming data which significantly reduce its correction have the same behavior from this perspective, this is not
coverage compared with our proposed scheme. included in the comparison.
Our proposed scheme with I2D parity detection and 50 clus-
ters needs only four BRAM units and is able to correct VI. C ONCLUSION
99.30% of the soft errors. By employing the proposed
I3D parity technique, the error correction probability increases Radiation-induced MBUs are a serious reliability concern
to 100%. As explained earlier, for both detection techniques, in nanoscale technology nodes. Aggressive transistor
the parity bits could be stored in the redundant bits of downscaling and emerging dense integration schemes make
the corresponding frame, and hence, do not impose any FPGAs prone to MBUs. The configuration frames are the
additional resource overhead. The proposed scheme only most vulnerable resources on the FPGA fabric to soft errors
occupies 1% of the available BRAMs for storing erasure as they constitute the majority of the FPGA memory bits
frames and is a good candidate to be used alongside with and once affected by soft errors, permanently change the
very large designs. Besides the area overhead for storing functionality of the mapped design.
the error detection and correction data, all schemes have a In this paper, we presented a cost-efficient scheme based
common requirement for performing the scrubbing. Since it is on erasure codes for MBU detection and correction in the
a common requirement among all schemes, it is not reported configuration frames of SRAM-based FPGAs. This scheme
in Table IV. is implemented as a generic soft core alongside with the
As reported in Table IV, the error detection time of all the user design and does not require any changes to the existing
schemes are almost the same. Indeed, all the schemes read FPGA architecture. Compared with the previous solutions, our
the entire configuration frames using the ICAP interface and scheme provides the highest level of MBU protection at very
perform the error checking in parallel with reading frames; low costs with a negligible recovery time. The implementation
hence, the type of coding scheme does not affect the timing. results reveal that the proposed scheme occupies only 1%
The slight increase in the error detection time of our proposed of memory and 3% of logic resource on Xilinx Virtex-6
scheme is due to the time required for scrubbing 50 additional device. Furthermore, the error correction latency is also very
erasure blocks employed for clusters. The error recovery time small (0.35 ms for 50 clusters). These results confirm that the
of the previous schemes is negligible as the error correction proposed scheme is a practical solution for MBU mitigation
data are already loaded to the scrubber unit. Thus, the only in FPGA configuration frames.
added time would be due to the writing of corrected contents to
the erroneous frame. In contrast, our proposed scheme requires R EFERENCES
some additional time to read all frames in the affected cluster [1] A. Dixit and A. Wood, “The impact of new technology on soft
and compute the erased frame contents. However, the sum of error rates,” in Proc. IEEE Int. Rel. Phys. Symp. (IRPS), Apr. 2011,
the error detection and the error recovery time determines the pp. 5B.4.1–5B.4.7.
[2] H. Kaul, M. Anders, S. Hsu, A. Agarwal, R. Krishnamurthy, and
MTTR which is the average time required by the system to S. Borkar, “Near-threshold voltage (NTV) design: Opportunities and
return to its normal operation. The MTTR of previous schemes challenges,” in Proc. 49th Annu. Design Autom. Conf. (DAC), 2012,
is 9.343 ms while our scheme has an MTTR of 9.694 ms. The pp. 1153–1158.
3.75% overhead in the MTTR is reasonable considering the [3] C. Hu and S. Zain, “NSEU mitigation in avionics applications,”
Xilinx Corporation, San Jose, CA, USA, Appl. Note XAPP1073, 2011,
high error recovery coverage and the low area overhead of pp. 1–12.
the proposed scheme. [4] P. Dorsey, “Xilinx stacked silicon interconnect technology delivers
Although the Xilinx CRC + Reload scheme can provide breakthrough FPGA capacity, bandwidth, and power efficiency,”
Xilinx Corporation, San Jose, CA, USA, White Paper Virtex-7 FPGAs,
100% recovery coverage, it requires an additional external 2010, pp. 1–10.
nonvolatile memory for storing contents of configuration [5] Altera Corporation, “Meeting the performance and power imperative of
memory. Our proposed technique eliminates the need for the zettabyte era with generation 10,” Altera Corporation, San Jose, CA,
USA, White Paper WP-01200-1.0, 2013.
such an external memory. On the other hand, it significantly [6] N. Seifert et al., “Soft error susceptibilities of 22 nm tri-gate devices,”
reduces MTTR compared with that scheme. IEEE Trans. Nucl. Sci., vol. 59, no. 6, pp. 2666–2673, Dec. 2012.
942 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 24, NO. 3, MARCH 2016
[7] M. Ebrahimi, H. Asadi, and M. B. Tahoori, “A layout-based approach [31] L. Rizzo, “Effective erasure codes for reliable computer communication
for multiple event transient analysis,” in Proc. 50th Annu. Design Autom. protocols,” ACM SIGCOMM Comput. Commun. Rev., vol. 27, no. 2,
Conf. (DAC), 2013, pp. 1–6. pp. 24–36, 1997.
[8] E. Ibe, H. Taniguchi, Y. Yahagi, K. Shimbo, and T. Toba, “Impact of [32] J. S. Plank and M. G. Thomason, “A practical analysis of low-density
scaling on neutron-induced soft error in SRAMs from a 250 nm to parity-check erasure codes for wide-area storage applications,” in Proc.
a 22 nm design rule,” IEEE Trans. Electron Devices, vol. 57, no. 7, IEEE Int. Conf. Dependable Syst. Netw., Jun./Jul. 2004, pp. 115–124.
pp. 1527–1538, Jul. 2010. [33] A. BanaiyanMofrad, M. Ebrahimi, F. Oboril, M. B. Tahoori, and N. Dutt,
[9] D. Radaelli, H. Puchner, S. Wong, and S. Daniel, “Investigation of “Protecting caches against multiple bit upsets using embedded erasure
multi-bit upsets in a 150 nm technology SRAM device,” IEEE Trans. coding,” in Proc. Eur. Test Symp. (ETS), 2014.
Nucl. Sci., vol. 52, no. 6, pp. 2433–2437, Dec. 2005. [34] J. Kim, N. Hardavellas, K. Mai, B. Falsafi, and J. C. Hoe, “Multi-bit
[10] F. Siegle, T. Vladimirova, J. Ilstad, and O. Emam, “Mitigation of error tolerant caches using two-dimensional error coding,” in Proc. 40th
radiation effects in SRAM-based FPGAs for space applications,” ACM Annu. IEEE/ACM Int. Symp. Microarchitecture, Dec. 2007, pp. 197–209.
Comput. Surv., vol. 47, no. 2, 2015, Art. ID 37. [35] J. M. Park, E. K. P. Chong, and H. J. Siegel, “Efficient multicast stream
[11] C. Carmichael, “Triple module redundancy design techniques authentication using erasure codes,” ACM Trans. Inf. Syst. Secur., vol. 6,
for Virtex FPGAs,” Xilinx Corporation, San Jose, CA, USA, no. 2, pp. 258–285, 2003.
Appl. Note XAPP197, 2001. [36] S. Baeg, S. Wen, and R. Wong, “SRAM interleaving distance selection
[12] Y. Ichinomiya, S. Tanoue, M. Amagasaki, M. Iida, M. Kuga, and with a soft error failure model,” IEEE Trans. Nucl. Sci., vol. 56, no. 4,
T. Sueyoshi, “Improving the robustness of a softcore processor against pp. 2111–2118, Aug. 2009.
SEUs by using TMR and partial reconfiguration,” in Proc. 18th IEEE [37] M. Ebrahimi, A. Evans, M. B. Tahoori, R. Seyyedi, E. Costenaro,
Annu. Int. Symp. Field-Program. Custom Comput. Mach. (FCCM), and D. Alexandrescu, “Comprehensive analysis of alpha and neutron
May 2010, pp. 47–54. particle-induced soft errors in an embedded processor at nanoscales,”
in Proc. Design, Autom., Test Eur. Conf. Exhibit. (DATE), Mar. 2014,
[13] B. Pratt, M. Caffrey, J. F. Carroll, P. Graham, K. Morgan, and
pp. 1–6.
M. Wirthlin, “Fine-grain SEU mitigation for FPGAs using partial TMR,”
[38] D. Alexandrescu, “A comprehensive soft error analysis methodology
IEEE Trans. Nucl. Sci., vol. 55, no. 4, pp. 2274–2280, Aug. 2008.
for SoCs/ASICs memory instances,” in Proc. 17th Int. On-Line Test.
[14] M. Ebrahimi, S. G. Miremadi, H. Asadi, and M. Fazeli, “Low-cost Symp. (IOLTS), Jul. 2011, pp. 175–176.
scan-chain-based technique to recover multiple errors in TMR systems,” [39] K. Rupnow, W. Fu, and K. Compton, “Block, drop or roll(back):
IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 21, no. 8, Alternative preemption methods for RH multi-tasking,” in Proc. 17th
pp. 1454–1468, Aug. 2013. IEEE Symp. Field Program. Custom Comput. Mach. (FCCM), Apr. 2009,
[15] S. Srinivasan, A. Gayasen, N. Vijaykrishnan, M. Kandemir, Y. Xie, pp. 63–70.
and M. J. Irwin, “Improving soft-error tolerance of FPGA configuration [40] D. K. Pradhan and N. H. Vaidya, “Roll-forward and rollback recovery:
bits,” in Proc. IEEE/ACM Int. Conf. Comput. Aided Design (ICCAD), Performance-reliability trade-off,” in 24th Int. Symp. Fault-Tolerant
Nov. 2004, pp. 107–110. Comput. (FTCS), Dig. Papers, Jun. 1994, pp. 186–195.
[16] B. S. Gill, C. Papachristou, and F. G. Wolff, “A new asymmetric SRAM [41] M. Ebrahimi, S. G. Miremadi, and H. Asadi, “ScTMR: A scan chain-
cell to reduce soft errors and leakage power in FPGA,” in Proc. Design, based error recovery technique for TMR systems in safety-critical
Autom., Test Eur. Conf. Exhibit. (DATE), Apr. 2007, pp. 1–6. applications,” in Proc. Design, Autom., Test Eur. Conf. Exhibit. (DATE),
[17] M. Berg et al., “Effectiveness of internal versus external SEU scrubbing Mar. 2011, pp. 1–4.
mitigation strategies in a Xilinx FPGA: Design, test, and analysis,” IEEE [42] Xilinx Corporation, “Virtex-6 FPGA configuration user guide,”
Trans. Nucl. Sci., vol. 55, no. 4, pp. 2259–2266, Aug. 2008. Xilinx Corporation, San Jose, CA, USA, User Guide UG360 (V3.6),
[18] A. Sari, M. Psarakis, and D. Gizopoulos, “Combining checkpointing and 2013.
scrubbing in FPGA-based real-time systems,” in Proc. IEEE 31st VLSI [43] K. Chapman, “SEU strategies for Virtex-5 devices,” Xilinx Corporation,
Test Symp. (VTS), Apr./May 2013, pp. 1–6. San Jose, CA, USA, Appl. Note XAPP864 (v2.0), 2010.
[19] A. Sari and M. Psarakis, “Scrubbing-based SEU mitigation approach
for systems-on-programmable-chips,” in Proc. Int. Conf. Field-Program.
Technol. (FPT), Dec. 2011, pp. 1–8.
[20] G. A. Vera, S. Ardalan, X. Yao, and K. Avery, “Fast local scrubbing for
field-programmable gate array’s configuration memory,” J. Aerosp. Inf.
Syst., vol. 10, no. 3, pp. 144–153, 2013.
Mojtaba Ebrahimi received the B.Sc. degree in
[21] L. Jones, “Single event upset (SEU) detection and correction computer engineering from Shahed University,
using Virtex-4 devices,” Xilinx Corporation, San Jose, CA, USA, Tehran, Iran, in 2008, and the M.Sc. degree in
Appl. Note XAPP714, 2007. computer engineering from Sharif University,
[22] Altera Corporation, “Enhancing robust SEU mitigation with Tehran, in 2010. He is currently pursuing the
28-nm FPGAs,” Altera Corporation, San Jose, CA, USA, Ph.D. degree with the Chair of Dependable and
White Paper WP-01135-1.0, 2010. Nano Computing, Karlsruhe Institute of Technology,
[23] S. P. Park, D. Lee, and K. Roy, “Soft-error-resilient FPGAs using Karlsruhe, Germany.
built-in 2-D Hamming product code,” IEEE Trans. Very Large Scale He was a Research Assistant with the Dependable
Integr. (VLSI) Syst., vol. 20, no. 2, pp. 248–256, Feb. 2012. System Laboratory, Sharif University, from
[24] M. Lanuzza, P. Zicari, F. Frustaci, S. Perri, and P. Corsonello, “A self- 2010 to 2011. His current research interests include
hosting configuration management system to mitigate the impact of soft error rate estimation of microprocessors and selective protection
radiation-induced multi-bit upsets in SRAM-based FPGAs,” in Proc. techniques.
IEEE Int. Symp. Ind. Electron. (ISIE), Jul. 2010, pp. 1989–1994.
[25] Xilinx Corporation, “LogiCORE IP soft error mitigation controller v3.4,”
Xilinx Corporation, San Jose, CA, USA, Product Guide PG036, 2012.
[26] R. H. Morelos-Zaragoza, The Art of Error Correcting Coding.
New York, NY, USA: Wiley, 2006.
[27] P. M. B. Rao, M. Ebrahimi, R. Seyyedi, and M. B. Tahoori, “Protecting Parthasarathy Murali B. Rao received the
SRAM-based FPGAs against multiple bit upsets using erasure codes,” bachelor’s degree in electrical engineering from
in Proc. 51st ACM/EDAC/IEEE Design Autom. Conf. (DAC), Jun. 2014, Bharathiar University, Coimbatore, India, and
pp. 1–6. the M.Sc. degree in electrical engineering from
[28] E. Costenaro, D. Alexandrescu, K. Belhaddad, and M. Nicolaidis, Linköping University, Linköping, Sweden.
“A practical approach to single event transient analysis for highly He was a Research Assistant with the Chair
complex design,” J. Electron. Test., vol. 29, no. 3, pp. 301–315, 2013. of Dependable Nano Computing, Karlsruhe Insti-
[29] JEDEC89C Standard, document JEDEC89C. [Online]. Available: tute of Technology, Karlsruhe, Germany, from
http://www.jedec.org/standards-documents, accessed Apr. 2015. 2012 to 2014. His current research interests include
[30] J. S. Plank, “Erasure codes for storage applications,” in Proc. 4th Usenix reconfigurable systems, reliability issues in field-
Conf. File Storage Technol., 2005, pp. 1–74. programmable gate array, and multicore systems.
EBRAHIMI et al.: LOW-COST MBU CORRECTION IN SRAM-BASED FPGA CONFIGURATION FRAMES 943
Razi Seyyedi received the B.Sc. degree in computer Mehdi B. Tahoori (S’02–M’04–SM’08) received
engineering from Shahed University, Tehran, Iran, the B.S. degree in computer engineering from
in 2011, and the M.Sc. degree in computer the Sharif University of Technology, Tehran, Iran,
engineering from Bonn University, Bonn, Germany, in 2000, and the M.S. and Ph.D. degrees in electrical
in 2014. He performed the master’s thesis with engineering from Stanford University, Stanford,
the Chair of Dependable and Nano Computing, CA, USA, in 2002 and 2003, respectively.
Karlsruhe Institute of Technology, Karlsruhe, He was an Assistant Professor with the
Germany, under the supervision of Prof. Tahoori. Department of Electrical and Computer Engineering,
His current research interests include computer Northeastern University, Boston, MA, USA, in 2003,
architecture and reliability. where he became an Associate Professor in 2009.
He was a Research Scientist with the Fujitsu
Laboratories of America, Sunnyvale, CA, USA, from 2002 to 2003, where he
was involved in advanced computer-aided research, including reliability issues
in deep-submicrometer mixed-signal VLSI designs. He is currently a Full
Professor and the Chair of Dependable Nano Computing with the Department
of Computer Science, Institute of Computer Science and Engineering,
Karlsruhe Institute of Technology, Karlsruhe, Germany. He has authored
over 140 publications in major journals and conference proceedings
on a wide range of topics, from dependable computing and emerging
nanotechnologies to system biology. He holds five pending and granted
U.S. and international patents. His current research interests include
nanocomputing, reliable computing, VLSI testing, reconfigurable computing,
emerging nanotechnologies, and systems biology.
Prof. Tahoori was a Program Committee Member, and a Workshop, Panel,
and Special Session Organizer of various conferences and symposia in VLSI
testing, reliability, and emerging nanotechnologies, such as the Conference
on Information, Telecommunication and Computing, the International
Conference on Computer-Aided Design, Design, Automation and Test in
Europe Conference, European Test Symposium, the International Conference
on Intelligent Computing, Communication and Devices, the Asia and South
Pacific Design Automation Conference, Great Lakes Symposium on VLSI,
and VLSI Design Conference. He was a recipient of the National Science
Foundation Early Faculty Development Award. He is an Associate Editor of
the ACM Journal of Emerging Technologies for Computing. He is also the
Chair of the ACM SIGDA Technical Committee on Test and Reliability.