238

IEEE TRANSACTIONS ON RELIABILITY, VOL. 62, NO. 1, MARCH 2013

Using Single Error Correction Codes to Protect Against Isolated Defects and Soft Errors
Costas Argyrides, Member, IEEE, Pedro Reviriego, Member, IEEE, and Juan Antonio Maestro, Member, IEEE
Abstract—Different techniques have been used to deal with defects and soft errors. Repair techniques are commonly used for defects, while error correction codes are used for soft errors. Recently, some proposals have been made to use error correction codes to deal with defects. In this paper, we analyze the impact on reliability of such approaches that use error correction codes, which in addition to soft errors can resolve defects, at the cost of reduced ability to correct soft errors. The results showed that low defect rates or small memory sizes are required to have a low impact on reliability. Additionally, a technique that can improve reliability is proposed and analyzed. The results show that our new approach can achieve a similar reliability in terms of time to failure as that of a defect free memory at the cost of a more complex decoding algorithm. Index Terms—Defects, error correcting codes, fault tolerance, soft errors.

ACRONYMS ECC 1-D 2-D SEU SEC MCU SEC-DED METF MTTF Error correcting codes One-dimensional redundancy Two-dimensional redundancy Single-event upsets Single error correction Multiple cell upsets Single error correction-double error detection Mean number of events to failure Mean time to failure I. INTRODUCTION

those issues due to their high level of integration. Current techniques to address those reliability issues in memories include the use of redundant elements to repair manufacturing defects, and the use of Error Correcting Codes (ECC) to deal with soft errors once the device is in operation. Different techniques are used to deal with defects versus soft errors. ECC can also be used to correct errors caused by defects, but then their ability to correct soft errors may be compromised leading to a reduced reliability. However, to the best of our knowledge, there is no previous work on how the use of ECC to deal with defects affects the reliability of memory in the field. In this paper, an effective technique to use ECC to deal with isolated defects and soft errors on memory chips is presented. A technique that can cope with either stuck-at-defects, soft errors, or both at the same time is illustrated. The following analysis has shown that the reliability is approximately the same as when the code is used to correct soft errors only. II. RELATED WORK The technology scaling process provides high-density, low cost, high-performance integrated circuits. These circuits are characterized by high operating frequencies, low voltage levels, and small noise margins with increased defect rate [1]. To cope with defects in memory chips, many different techniques have been proposed, all of them based on the use of redundant elements to replace defective ones. Those techniques vary from those applied during the manufacturing process, in the test phase, to the use of built-in circuits able to repair the memory chips even during normal operation in the field, with different tradeoffs in terms of cost and speed. The use of redundant rows and columns has been widely used in memory design to cope with this problem. One-dimensional (1-D) redundancy is the simplest variation in which only redundant rows (or columns) are included in the memory array and used to replace the defective rows (or columns) detected during test. The main advantage of this approach is that its implementation does not require any complex allocation algorithms. Unfortunately, its repair efficiency can be low because a defective column (row) containing multiple defective cells cannot be replaced by a single redundant row (column). Examples of such techniques are presented in [2], and [3]. In [4], and [5], authors proposed a two-dimensional (2-D) redundancy approach which improves the efficiency of the 1-D approach. This approach adds both redundant rows and columns to the memory array to provide more efficient repair when multiple defective cells exist in the same row or column of the array. When multiple faulty cells are detected, the choice between the use of a redundant row or a redundant column to replace them

A

S TECHNOLOGY scales, reliability becomes a challenge for CMOS circuits. Reliability issues appear, for example during device manufacturing, as defects that can compromise production yield. Once the devices are in the field, other reliability issues appear in the form of soft errors or age induced permanent failures. Memory devices are among those affected by

Manuscript received February 19, 2012; revised August 24, 2012; accepted August 24, 2012. Date of publication January 29, 2013; date of current version February 27, 2013. Associate Editor: J.-C. Lu. C. Argyrides is with the Research Division, EVOLVI.T., 3010 Limassol, Cyprus (e-mail: costas@computer.org). P. Reviriego and J. A. Maestro are with the Departmento de Ingenieria Informatica, Universidad Antonio de Nebrija, 28040 Madrid, Spain (e-mail: previrie@nebrija.es; jmaestro@nebrija.es). Digital Object Identifier 10.1109/TR.2013.2240901

0018-9529/$31.00 © 2013 IEEE

The same operation is then done for the all-ones pattern. if an uncorrectable error is detected. Another technique to cope with SEUs is the scrubbing process [18]. depending on which part of the memory is to be used. a mechanism is proposed to identify if detected errors are permanent. Stuck-at defects are the faults where memory cells (as well as lines or transistors) permanently store (stuck-at) the same value regardless what is supposed to be saved. III. a defect will be detected. they will cause single errors in different words that can be corrected by the Single Error Correction-Double Error Detection (SEC-DED) codes. However. if a memory cell holds “1. But. could change the voltage level of the node [1]. when all remaining defective cells are located in one half of the array. The proposed technique is as follows. • detection of defects that appear on the field (after the manufacturing process. and therefore not detected in the test processes). so that cells that belong to the same logical word are separated. ECC would correct the soft errors. This change in the voltage level will change the state of the transistor. when the proposed technique is used. a failure is triggered as the error is in fact uncorrectable. For example. making this technique useless. the other half can still be used as a memory with reduced capacity. This approach enables us to achieve a similar reliability as that of a defect free memory. When these particles hit the silicon bulk. we propose to use the standard memory protection approach of SEC-DED plus interleaving to deal not only with soft errors but also with isolated stuck-at defects. In these cases. and then writes all-zeros into the word and reads it back to check that there are no errors. and not clustered in one half of the array. Traditionally. When the word is read. An example is illustrated in Fig. for isolated errors. if there is a defect. in most cases. Following the proposed technique. which if collected by the source-drain diffusions. then it is corrected as in a normal SEC-DED memory. these techniques will fail in the appearance of multiple cell upsets (MCU). for defects that manifest as isolated stuck-at failures such that a cell that is read always gives the same value. two errors are detected. In the permanent case. In this case. for example. If there is a stuck-at defect on that word. using ECC to handle a permanent defect would leave that particular word unprotected if a soft error affects it. the last alternative before discarding the defective chip is to try to use it as a downgraded version of memory. memories have been protected with Single Error Correction (SEC) codes [9]–[12] that can correct up to one error per memory word. they create minority carriers. 1. as a single soft error will cause a failure. the procedure will detect the defect and locate it. which will result in a change of the value in a memory cell. The most common approach to deal with multiple errors has been the use of interleaving in the physical arrangement of the memory cells. However.ARGYRIDES et al. For both redundancy approaches. or a stuck-at defect. If there is no defect. and corrects the errors so that only if two errors arrive in the same scrubbing period can a failure occur. Unfortunately. As the errors in an MCU are physically close as discussed in [15]. This alternative would provide several benefits: • detection of defects. redundancy is used when the number of defects is large enough to utilize a whole spare row or column. and another by a defect. However. or both simultaneously. the corresponding bit in the register is inverted. and • address isolated defects. as discussed in [16]. The scrubbing process periodically reads the memory words. leaving SEC-DED codes to handle soft errors.: USING SEC CODES TO PROTECT AGAINST ISOLATED DEFECTS AND SOFT ERRORS 239 is made based on the maximum repair capability of each alternative.” Radiation induced soft errors are a major issue for reliability. two bits are affected by errors: one by a soft error. and that defective bit will be inverted. However. When a word is read and an error is detected. PROPOSED TECHNIQUE The main problem when using SEC-DED correction to deal with manufacturing defects is that words in which a cell has a defect are left unprotected. There have been some approaches that propose to use redundancy (as explained before) together with Error Correction Codes (ECC) to improve the protection effectiveness. However. ECC is proposed as a better option. thus uncorrectable. a procedure is triggered to detect if the word contains defects. even if they affect a word . it is still difficult to develop built-in repair implementations using them. and the modified word is decoded again. This approach is usually a valid solution to prevent error accumulation. The main drawback of this approach is that the optimal redundancy allocation problem becomes NP-complete as discussed in [6] and [7]. In [13]. this approach produces an excessive waste of redundant bits. For example. Per-word parity bits are also commonly used when the objective is only to detect errors. However. the remaining defective cells are evenly distributed across the whole array. in small memories or register files. Another major issue in designing memory chips in submicron technologies is the susceptibility to single-event upsets (SEU) produced by atmospheric neutrons and alpha particles. In this paper. redundancy would be used to solve the problem. the word will be correctly decoded as there is only a single error (the soft error). This procedure stores the contents of the word in a register. and is a critical factor for memories when they operate in environments where there are many sources of error [8]. when the number of defective cells in the array exceeds the repair capability through the use of redundant elements. [17]. Although many heuristic algorithms have been proposed to solve this problem. This reduction is done by permanently setting the most significant bit of the addresses either to 0 or 1. its use may have an impact on floor-planning. In this case. interleaving cannot be used. single bit errors will not cause failures. This outcome can significantly reduce memory reliability. Therefore. In [14]. access time. and power consumption.” an SEU will force it to “0. or temporary (soft). an alternative correction scheme can be used to improve reliability. and in other cases. if it is classified as a single error. A technique is proposed to locate the defects such that they do not compromise the ability of SEC-DED codes to correct single soft errors. and clustered defects that are transformed into isolated defects at a logical level when interleaving is used. even when a word has a stuck at defect. This technique can effectively correct words that contain either a soft error. In the temporary case.

Obviously. which in turn can provoke a miscorrection. This result is obviously a drawback of the proposed technique. In this case. it is corrected. 2. the proposed technique will transform the double error into a triple one. The cost of that logic should be negligible compared to the rest of the memory. Therefore. causing a failure with silent data corruption. causing a failure. This case is illustrated in Fig. The analysis starts by considering a memory in which the proposed approach is not used. Example of a correctable error. If a single error is detected. this last type of failure is more dangerous. there are two possible outcomes: 1) an uncorrectable error is flagged. For case 2) a miscorrection will be performed. and some control logic to implement the correction algorithm. 3. two additional read operations and two additional write operations are required to detect the defects. we study the reliability of a memory on which SEC-DED is used to deal with soft errors and isolated stuck-at failures. When trying to decode this word. In the case of memories. the procedure will correct the defect. Fig. and they are uniformly distributed among all memory cells as in previous studies [10]. there will be a failure. However. Finally. for the proposed technique to work. In both cases. In Fig. In this case. an uncorrectable error will be detected. This approach would greatly increase the reliability when ECC is used to correct defects. [11]. 4 may be implemented not only in memories but in existing devices that implement SEC-DED as well. 62. an exhaustive analysis of the different sorts of errordefect combinations is presented showing how the algorithm would be applied. 1. In those cases. including the SEC-DED bits. the use of the proposed technique should have a negligible impact on average access time as it only slows down accesses when there are multiple errors in a word. and in this case. 2. NO. MARCH 2013 Fig. The technique described in Fig. For case 1). a situation that should occur in a very small percentage of the accesses. 4. There is another situation for a word that suffers two soft errors and a defect. In the proposed approach. whose impact will be analyzed in the next section. IV. it requires a register to store the word while the defect detection procedure is used. Fig. and end with a double and therefore uncorrectable error. Failures caused by soft errors only will be dominant when (3) If we define the METF ratio as (4) then (3) holds when (5) that contains a stuck-at defect. VOL. and two soft errors. and its outcome (only up to two soft errors are considered). it needs access to all bits in each word. but in the second the failure will not be detected. Example of a situation in which the proposed method can provoke a miscorrection or a failure. and otherwise a failure is flagged. or 2) a correctable error is detected. there is no additional cost as the proposed technique can be implemented in the system processor. For the analysis. 3. ANALYSIS In this section. In this case. In terms of speed. This last outcome is due to the fact that SEC-DED codes have a Hamming distance of four. 1. as illustrated in Fig.240 IEEE TRANSACTIONS ON RELIABILITY. the defect does not cause an error as the original data matches the stuck-at value. and words are decoded normally. a word will contain a defect with probability F (per-word defect rate). Example of a situation in which the proposed method can provoke a miscorrection or a failure. failures will occur when two soft errors affect a word that contains a defect. For the second case. soft errors are assumed to arrive following a Poisson process. . there are three errors: a defect. if we consider only failures caused by an error falling on a word that contains a defect. the mean number of events to failure (METF) would be (1) And the METF for soft errors only would be (see [10]) (2) where M is the number of memory words. when reading the word. and therefore triple errors can be interpreted as single errors and miscorrected.

01 is shown as a straight line. As an example. smaller defect rates are needed to ensure that defects do not affect memory reliability. the time to failure would be exactly the same as for a defect free memory protected with SEC-DED. Procedures when a word with defects or errors or both is read. Value of the METF ratio in (5) for different values of F and M. The main issue with the proposed approach is that failures that are . the defect rates (F) and memory sizes (M) for which condition (5) is valid are illustrated. further restricting the values of F and M for which (3) is valid.ARGYRIDES et al. The plots show how as memory sizes increase lower defect rates are required to ensure that defects do not affect reliability. failures would occur only when at least two soft errors affect a given word. in Fig. This impact of defects becomes more prominent if scrubbing is used. This result means that. As a reference. Therefore. a value of 0. a defect rate of less than is needed to ensure that defects do not affect memory reliability. The same result applies when scrubbing is used as the same number of soft errors are required to cause a failure in a cell with or without defects. When the proposed technique is used.: USING SEC CODES TO PROTECT AGAINST ISOLATED DEFECTS AND SOFT ERRORS 241 Fig. Therefore. Fig. 5. 5. If a memory with 32 M-words is considered. as the METF for soft errors will increase substantially while the METF for defects would remain the same. the use of standard SEC-DED to deal with isolated stuck-at failures will not be effective unless the defect rate is very small. 4. as technology scales and larger memory sizes are used.

Etoh.. Y. Hsiao. M. Y. and C.-T. Pham. Available: http://public. Ottavi. and R. and C. “SRAM interleaving distance selection with a soft error failure model. On-Line Testing Symp. 1998. 1990. Goodman. W.” in Proc. Some of the values have been obtained from reference [21]. Itoh. 22nd IEEE VLSI Test Symp. 1991. (VLSI) Syst. Yeh. “Multiple transient faults in logic: An issue for next generation ICS?. pp.-C. An analysis has been presented to evaluate the defect rates and memory sizes for which SED-DED codes with no modifications can be used to deal with defects without compromising reliability. Toma. in memory technology.net/ Jul. Symp. Defect Fault Tolerance in VLSI Syst. Higgins. leading to silent data corruption. 3. “Built-in self repair for embedded high density SRAM. no. This approach would provide a complete approach to protect against defects and soft errors in memories. 37. Huang. 2006. Re.-H. Techn. no. 2008. Int. 21. Oct. Achouri.-W. pp. Pontarelli. Aoki. VOL. Huang. and large Mean Time to Failures (MTTF). .” IEEE Electron Device Lett. “An algorithm for row–column self-repair of RAMs and its implementation in the Alpha 21264. Cardarilli. “Multiple bit upset tolerant memory using a selective cycle avoidance based SEC-DED-DAEC code. A. 476–491. pp.” in Proc. Komoriya. 60–64.S. Anghel. International Technology Road Map for Semiconductors [Online]. C. and R. Defect Fault Tolerance VLSI Syst. 884–896. pp. Lewandowski. Wang. no. 1970. Inf. Testing. 4.-K. [11] R. vol. or provokes silent data corruption that affects the system (if not detectable). and F. Under this assumption.. vol.” IBM J. 26. vol. 1995. Sayano. M.” IEE Proc. K. “New approaches for the repairs of memories with redundancy by row/column deletion for yield enhancement. “Reliability of scrubbing recovery techniques for memory systems. pp. 81–89. and A. Dec. Sep.-W. The miscorrection probability depends on the type of SEC-DED code. 114–122. 37. vol. Computer-Aided Des.. Dig. Serrano.242 IEEE TRANSACTIONS ON RELIABILITY. Theory. Y.Develop. Y. “Reliability of semiconductor RAMs with soft-error scrubbing techniques. 4. Zorian. Goessel. 14.” in Proc. 313–318. Richter. M.. Yang. pp. Saleh. NO. Wu. Apr. Nicolaidis. 5.-N. [12] M. “The reliability of semiconductor ram memories with on-chip error-correction coding. no.” IEEE Trans. pp. 142. Comput. “Efficient built-in redundancy analysis for embedded memories with 2-D redundancy.” in Proc. Mar. 62. Touba. 114–119. vol. To analyze the probability that a failure is not detected. [3] I. the proposed technique can effectively deal with stuck-at defects without compromising reliability. “An integrated ECC and redundancy repair scheme for memory reliability enhancement. 2005. pp. K.. “A class of optimal minimum odd-weight column SEC-DED codes. G..” IEEE J. V.-H. Rel. 323–328. and J. CONCLUSIONS In this paper. vol. pp. 1. the probability that a failure is an undetected failure that affects the system is equal to the defect rate F multiplied by the miscorrection probability for the SEC-DED code used. no.T. 2004. and L. and T. REFERENCES [1] I. R. J. Symp. 14th IEEE Int. 34–42. Lu. K. Kim. no. Nuclear Sci. 39. vol. “A diversified memory built-in self-repair approach for nanotechnologies. no. 310–312. no. pp. Salsano. Blaum. May 1991. (VTS04). 9. “Built-in self-test and repair (BISTR) techniques for embedded RAMs. Wen. 1.. 1112–1119. 1. The main issue of the proposed approach is that a small percentage of the failures will be undetected. Rel.. 2010.” IEEE Trans. J. 52. and isolated defects are handled by the SEC-DED codes. “New linear SEC-DED codes with reduced triple bit error miscorrection probability. 1988. Y. 20th IEEE Int. no.-K. [9] G. 3.” in Proc. pp. Hamming. [7] S. Matsumoto. 2003. [5] M. 352–360. Shen. [14] A. Baeg. 337–344. J. However. 2. Bhavsar. 1. 2005. F. vol. Aug. Omana. 1950. pp. Tosaka. [18] M. 25th IEEE VLSI Test Symp. for small defect rates and low failure probabilities during device operation.” IEEE Trans.” IEEE Trans. Aug. 2009. [13] S. “Error detecting and error correcting codes. [16] G. 147–160. [10] A. Workshop Des. [8] D. The miscorrection probabilities for different SEC-DED codes are illustrated in Table I. Solid-State Circuits. the use of SEC-DED to deal with both soft errors and isolated stuck-at defects has been studied. Integr. Patel. it will be assumed that as soon as a failure occurs it is detected (if detectable). Marinucci. (IOLTS’08). H. MARCH 2013 PERCENTAGE OF TABLE I 3 ERRORS MISCORRECTION TECHNIQUES FOR DIFFERENT CODING The proposed technique can also be combined with traditional 2-D repair approaches such that row and column failures and defects on multiple bits are repaired. Jan. [20] R. 12–17. “Geometric effect of multiple-bit soft errors induced by cosmic ray neutrons on DRAM’s. [15] S. pp.. pp. pp. (DFT 2005). Wu. Goodman and M. Horiguchi. Jun. M. OWC: Odd–Weight Column Codes. Res. Test Conf. 349–354. Jul. while those would occur only with three soft errors on a word for a defect free memory. [4] W. S. 2000. 37–42. no. 395–401. Circuits Syst. “The reliability of singleerror protected computer memories. and M. “Design of a fault tolerant solid state mass memory.” Bell Syst. Then a technique has been proposed that can deal with both types of errors effectively by applying a modified error correction process. 2111–2118. Test Conf. 1. May 2007. Satoh. Apr. “A flexible redundancy technique for high-density DRAMs. 6. Oct.” in Proc. 2004. Int. pp. Jan... 4. [6] S. Y. (DFT 2005). Tsai. Comput. The analysis shows that the mean time to failure for the proposed technique would be the same as that of a defect free memory that incorporates SEC-DED. no. and can be minimized by appropriately selecting the code at the cost of a more complex encoder-decoder. and S. Oberlaender. and J. vol.” in Proc. Su. Hsu. Leandri. not detected would appear with two soft errors in a word that contains a defect. the probability that an undetectable failure occurs during memory operation can be negligible. vol. pp. Lu and S. [17] C.R.-L. Metra.” in Proc. update [2] D. showing that linear SEC-DED codes provide the lowest values. [21] M.-C. F. Rossi. S. Very Large Scale Integr. Lombardi.itrs.” IEEE Trans. pp. Wong. no. and C. Dutta and N. 26. For small values of the defect rate. [19] M. In those cases.” IEEE Trans... McEliece. vol. 20th IEEE Int. 14. Jan. Records of the 2004 Int. vol..-C. Wender. P. The rest of the entries have been simulated. pp. 56. N.” IEEE Trans. Techn. pp. 1990. the probability of undetected failures will be negligible. K. 311–318. 1999.

. managing projects as a Project Management Professional. From 2004 to 2007.: USING SEC CODES TO PROTECT AGAINST ISOLATED DEFECTS AND SOFT ERRORS 243 Costas Argyrides (S’07–M’10) received the B. Madrid. signal processing. in 1994.D. His research interests include fault-tolerant systems. the Universidad Nacional de Educación a Distancia (Open University). working on router implementation. He is the author of numerous technical publications. Pradhan presented at the 22nd Symposium on Integrated Circuits and Systems Design SBCCI 2009. he joined Massana to work on the development of 1000 BaseT transceivers. During 2003. Madrid. he served as a Research Assistant at the Universities of Bristol. and Cambridge. Madrid. in 1994. working on the development of Ethernet transceivers. fault tolerance. From 1997 to 2000. K. Madrid. His research interests include fault-tolerant computer systems. where he currently manages the Computer Architecture and Technology Group. and D. algorithmic based fault tolerance.Sc. Dr. Aside from this work. and Ph. degrees (Hons) in telecommunications engineering from the Technical University of Madrid. Madrid.ARGYRIDES et al.Sc. performance evaluation of communication networks. he was a Distinguished Member of Technical Staff with the LSI Corporation. error correcting codes. degree in informatics and computer science from Moscow Power Engineering Institute-Technical University (MPEI-TU)— Moscow. Spain. His current activities are oriented to the space field. degree in advanced computing. Pedro Reviriego (A’03–M’04) received the M.3 standardization for 10 GBaseT. In 2000. as well as collaborations with the European Space Agency. Argyrides received a Best Paper Award for his paper “Reliability Aware Yield Improvement Technique for Nanotechnology Based Circuits” with C. Spain. degree in computer science from the University of Bristol.D. he was a Visiting Professor with the University Carlos III.Sc. degree in computer science from Universidad Complutense de Madrid. He is currently with the Universidad Antonio de Nebrija. with several projects on reliability and radiation protection. and reliability. he was an R&D Engineer with Teldat. and 1999. Lisboa. Madrid. He is currently a Validation Engineer at Intel Corporation. Prior to this.D. Madrid. both in journals and international conferences. and nanotechnology-based designs.K. His areas of interest include high-level synthesis and cosynthesis. and the design of physical layer communication devices. respectively. and 1997. Carro. Juan Antonio Maestro (M’07) received the M. L. Spain. respectively. and the Universidad Antonio de Nebrija. Leganés.Sc. He received the M. Bristol. he has worked for several multinational companies. with distinction. reliability improvement. He is the author of numerous papers in international conference proceedings and journals. Warwick. and real-time systems. He has served both as a Lecturer and a Researcher at several universities. He has also participated in the IEEE 802. and the Ph. Madrid. Oxford Brookes. Russia. degree in physics and the Ph. and organizing support departments. software fault tolerance. U. such as the Universidad Complutense de Madrid. He is the author or coauthor of more than 40 technical papers. Saint Louis University.