You are on page 1of 5

A Cost-Effective Self-Healing Approach for

Reliable Hardware Systems


Kasem Khalil, Omar Eldash, Magdy Bayoumi
The Center for Advanced Computer Studies School of Computing and Informatics
University of Louisiana at Lafayette, Louisiana, USA
Emails: kmk8148,oke1206,mab0778@louisiana.edu

Abstract—In this paper, self-healing concept for hardware sys- healing capabilities in mind [10]. DMR and TMR are two
tems is investigated and a new approach is proposed. Hardware types of redundancy techniques. In DMR technique, there are
systems have been proposing imitations to biological organisms two identical instances running in parallel for the same appli-
in the way they offer healing and recovery abilities. Digital
systems with inspired homogeneous architecture have improved cation and their outputs are connected to a comparator that
capabilities to compensate for any faults. Self-healing is defined triggers a mismatch whenever a fault occurs. TMR has three
by the ability of a system to detect faults or failures and fix them. identical modules running in parallel to provide fault masking
One of the main problems in current self-healing approaches is by a majority voter at their outputs to mask a signal fault. Em-
area overhead and scalability for complex structures considering bryonic hardware is a hardware system with self-healing based
they are based on redundancy and spare blocks. This paper pro-
poses a different approach for self-healing based on embryonic on replication of small building blocks with exact architecture
structures without a need for spare cells. The area overhead is and it is inspired by the multicellular organisms cell division
lower compared to other approaches relying on spare cells. The and differentiation mechanisms. The drawbacks of current
proposed approach relies on time multiplexing two functions in techniques are area and power consumption where they require
one cell within one clock cycle. The reliability of the proposed redundant hardware which is an extra overhead. EmHW is the
technique is studied and compared to conventional system with
different failure rates. This approach is capable of healing up same principle of self healing mechanism of multi-cellular
to 50% of the cells where each cell can cover another neighbor organisms for fault tolerance hardware system. Similar to
failed cell at most. The area overhead is 9% for the proposed multi-cellular organisms in nature the structure of circuits
approach which is much lower compared to other approaches based on EmHW is two-dimensional array of electronic cells
using spare cell. The proposed approach is applied to investigate (e-cell) with hardware reconfiguration capability. E-cells have
two case studies; ALU array, and neural network.
the abilities of adaptive self-healing (recovering) based on its
Keywords: Self-Healing, Embryonic hardware, Fault toler- structure. The structure is homogeneous where every cell can
ant, Evolvable hardware. perform all set of functions and can be configured to operate
one of them as shown in Fig. 1. Every cell is configured
I. I NTRODUCTION to performs only a specific function that is decided by the
configuration information in each cell. When a cell fails the
Nowadays, the electronic circuit industry is increasing in responsible module of self-diagnosis triggers self-healing and
complexity very rapidly with new devices and architectures the faulty cell will be replace by the spare cell. Fig. 1 shows
of hardware systems. Hardware failures happening -while general cellular structure, it is composed of six modules;
running some real time tasks- are common during operation for I/O module, address module, configuration module, control
heating, aging or surrounding environment parameters. Such module, function module, and detection module. The operation
failures will be a hindrance to the system in order to work of each one is as follows; I/O module is used for transmission
properly. Self-healing is a key approach for reliable electronic signals between different cells. Address module determines
systems that run under harsh environments, such as cosmic gene information from its coordinate of location information
radiation and cruel temperature in space applications. Self- and determines the location of the cells. Configuration module
healing is done via multiple approaches and many of them are stores the configuration information of all the cells which
based on redundancy which relies on recovering faulty cells mimics DNA in biological cells. it also provides information
and the reintegration of them back into the system. Self-repair for realization of self healing. Control module controls all
is the replacement of faulty cells by functioning spare cells. operations of the cell such as fault case, transparent case, and
Self-healing term can be used for both concepts. The role of idle case. Function module is the processing unit or the cor-
self healing is to recover the fault in the system to maintain the responding function which is determined by the configuration
operation with highest performance and longest time possible. information. Detection module detects cell working status, if
Some of the self healing research in the circuit level is based it is in a normal working condition or not.
on the circuit replication [1], [2]. Where, the faulty cell is
replaced by the spare one after the detection of a faulty cell Self-healing is not only crucial for reliability but it’s also
by the controlling part. Self-healing can be achieved in systems one of the main aspects for an intelligent system [10]. Other
over different levels. The first is application level healing [3], self-healing approaches inspired from hardware intelligence
which is the ability of an application, or a service, to heal paradigms include evolvable hardware, where a system can
itself. Next comes the System level healing [4], [5] which be evolved from an initial operating point with errors or low
depends on a programming language and design patterns that fitness until it finds a way around faulty cells through rearrang-
we apply internally. Finally, the level of hardware healing from ing cells with genetic programming. Another mechanism can
architecture and down to circuit techniques is the lowest level be considered where an online synthesis process takes place
of healing [6]. allowing the system to find a new synthesized architecture
There are multiple methods for self-healing such as Dual whenever a fault is detected. Such systems rely on a similar
Modular Redundancy (DMR) [7], Triple Modular Redundancy uniform organization like FPGAs or Genetic Programming
(TMR) [8], Embryonic Hardware EmHW [6], [9], as well units [11].
as other techniques like evolvable hardware and Intelligent The rest of the paper is organized as follows.Section II in-
Hardware System (IHW) framework that are designed with troduces proposed self-healing techniques. Section III presents

978-1-5386-4881-0/18/$31.00 ©2018 IEEE


Fig. 2. Proposed cell structure.

Fig. 1. Architecture of Embryonic Hardware.

evaluation indexes on the proposed work. Implementation and


analysis results are presented in Sections IV followed by the
conclusion in Section V.
II. P ROPOSED S ELF -H EALING T ECHNIQUE
Most of self healing approaches are based on redundancy
by adding spare cells. When a failure happens, the faulty cells
are replaced by the spare cells after detection of faults. Along
with the mentioned area overhead and scalability, the mapping
of spare cells may cause delay overhead after self-healing
because the nearest spare cell is very far away from the faulty
one. For improved mapping and system’s performance, adding
more spare cells will help but on the other side the area will Fig. 3. Dual edge triggered circuits.
be higher. The area overhead can be reduced by adding less
spare cells. Less spare cells leads to major mapping challenges
and more complex healing system with increasing number of
faults over time and whole system failure or delays.
The proposed approach provides self-healing without adding
spare cells which overcome area overhead problems. The
proposed approach treats each cell as a spare cell for its
neighbor. Also, each cell can do its task and the task of the
neighbor cell. The proposed technique is based on time divi-
sion multiplexing for self-healing. Each cell will be capable
to run two tasks within the same clock when a fault happens.
One operation runs during the first half of the clock cycle
and the second during the other half of the clock. If the fault Fig. 4. Block diagram of the proposed technique.
detection module detects a fault then the neighbor of the faulty
one will run its task along with the task of the faulty cell. Fig. 4 shows a simple array to demonstrate the idea. Assume
The area overhead is very small compared to the previous there is a fault in cell 1, according to that the detection block
work. The mapping problem becomes similar to the case of it sends a signal for the neighbors to check there activation
having a spare cell between each two neighbor cells which is and select one on them. Now cell2 will compensate the faulty
a reasonable organization for mapping. cell1 and control signal C1 will be one. Cell2 has two inputs
The operation of the proposed technique can be explained applied to DET’s input, one comes from the normal input and
as follows; When a fault is detected in a cell, the neighbor the second which is supposed to go the input of cell1. The
cell will receive a control signal from controller to run this selection input of DET selects one input to enter the cell 2
faulty cells task. The active cell divides run time between its depending on the clock value. Cell2 has the operation of cell
task and faulty cells task using a dual edge triggered ”DET” 1 task such that when cell2 receives a control signal C1, the
cell. So, at the first half of the clock cycle with the positive cell2 selects the operation of cell 1 from configuration module
edge the active cell run its original task. In the second half of to be done. For the negative value of clock and whether C1
the clock cycle it runs the task of the faulty one. Fig. 2 shows equals 1 or 0 the DET block selects inp2 and runs a function
the cell structure of the proposed work. There are two inputs of cell 2. During the positive value of clock and C1 equals
applied to DET cell and selection of one of them depends on 1, the DET selects inp1 and runs a function of cell 1. Also,
the value of control signal (C) and clock (clk) value. When a the output of cell 2 during the process time of cell 1’s task is
fault happens, the detection block detects fault location and the fed to cell 3. The same will happen on any cell where typical
control block pulls the signal C up for this specific location. every cell has the same set of functions to be configured. So,
Thus, the DET cell drives normal input (I1) when clk value instead of adding spare cells and increase area overhead, the
rises and selects input of faulty cell (I2) when clk value drops self-healing is done on the system using the active cells itself.
as presented in Fig. 3. If there are multiple faults other active cell will compensate

978-1-5386-4881-0/18/$31.00 ©2018 IEEE


faults as shown in Fig. 5. Fig. 5 shows that there is fault in
cell 2 and cell 9 and after self-healing organization cell 6 will
run task of cell6 and task of cell2. Also, cell 13 will run task
of cell 13 and dead cell 9. The timing diagram is shown in
Fig. 6, where for normal operation out1 is A and out2 is B.
In case of faulty cell1, signal of fault will rise to one and cell
two runs the operation of two cells as shown in Fig. 6. Out1
is Z (floating) and out2 is the output of its task (B) and output
task of faulty cell (A).
Throughput of the system depends on the clock operation
of the system. For the system which relies on full cycle multi-
plexing where only one function operates per cycle, throughput
will decrease to the half. The proposed method however, keeps Fig. 5. Cellular architecture under fault case and self-healing using the
proposed technique.
the cell which runs two tasks (its original task and faulty cell’s
task) to work with twice the rate of the normal operation.
This cell will finish two tasks within one clock cycle. So, the
proposed approach works with the same throughput with self
healing. This type of self-healing belongs to the architecture
level of intelligent hardware stack [10] where the monitoring
happens on the level of e-cells and the controller decides to
replace a faulty cell with one of the available nearby cells
in a multiplexing fashion as explained. The reconfigurable
fabric here is the same as an EmHW based structure. This Fig. 6. Timing Diagram.
can be further extended to use Genetic Programming units
and evolving techniques for decision making to enhance the TABLE I
PROPOSED WORK
self-healing further.
Evaluation Indexes for Proposed Embryonic Self-Healing Proposed work
Device Utilization 156 out of 1268
III. E VALUATION I NDEXES FOR P ROPOSED E MBRYONIC Area overhead 9%
S ELF -H EALING
Mechanism EmHW
Evaluation index [12] for self-healing determines an evalu- Reliability High
ation of self healing strategy, analysis and comparison on self-
healing strategy. The evaluation indexes which are covered in Maximum number of repairing 50%
this paper are redundancy rate, maximum number of repair,
and self-healing time consumption.
A. Redundancy Rate
Redundancy rate is the ratio of the number of spare cells
and the number of active cells in embryonic array. Designer
of embryonic array needs to make sure that spare cells of the
full array of hardware resources is lesser. Also, make sure the
ability of self-healing of full array is higher. The proposed
approach provides each cell as spare cell to its neighbor. So,
the redundancy rate is 50% without adding spare cells.
B. Maximum Number of Repair
Maximum number of repair refers to the maximum number
of fault repair that can be repaired by self-healing strategy in Fig. 7. The Architecture of cells include ALU operations.
embryonic array to return system to normal working state. For
the proposed technique the maximum number of repair is half A. First Case Study: ALU Array
of the total number of cells in embryonic array. The modules are implemented using VHDL and the simu-
lation results are obtained using ISE Xilinx 14.4 on vertex 5.
C. Self-Healing Time Consumption Each cell includes ALU operation inside it and each one has
Self-healing time consumption is the time consumption a specific task that is determined by configuration block using
from failure of cell to the fault fixed to retrieve the system the cell number Fig.7. Each cell includes ALU block operation
back to normal work. So, the repair speed is important aspects and ROM to store the configuration of operation. In case of
in fault repairing stage. The proposed technique is based there is fault in any block, the neighbor one will compensate
on neighbor repairing which make the mapping very simple this fault using self-healing approach and configuration of each
doesnt take more delay compared to the redundancy technique block.
for self-healing.
B. Second Case Study: Neural Network
IV. I MPLEMENTATION A ND A NALYSIS R ESULTS An Artificial Neural Network is based on a collection
The proposed technique relies on neighbors for repairing of nodes (neurons) and each connection between nodes can
which make the mapping very simple and doesnt require transmit a signal from one to another as shown in Fig.8. Each
routing delay compared to other redundancy techniques for input is multiplied by a weight then send the result to the
self-healing. These features make the proposed approach an equivalent of a cell body. Using adder, the weighted signal s
attractive approach for self-healing. Table I declares the pro- are summed together to supply a node activation or activation
posed work attributes. function [13], [14]. The unit provides a high value output, if

978-1-5386-4881-0/18/$31.00 ©2018 IEEE


the activation function is higher than the threshold. The unit
provides a low value output, if the activation function is lower
than the threshold. Output of each neuron is calculated by:
n
X 
ai = f Wji ∗ Xj + bi (1)
j=0

Where Xj is the j th neuron output in the proceeding layer,


Fig. 8. Neural Network architecture.
and n is the number of neurons, Wji synaptic weight from jth
neuron to the ith neuron in the proceeding layer, f is activation
function, and b is bias. 1

Self-healing approach provides a healing to any faulty node 0.9


Failure rate of 0.05
in NN. If any node has fault, the neighbor node will run 0.8 Failure rate of 0.1
Failure rate of 0.2
two function one is its function and the second one is the 0.7 Failure rate of 0.3
Failure rate of 0.4
faulty node function. We implemented NN with input layer 0.6

Reliability
which has three nodes, three hidden layer with 4 nodes 0.5
in each layer, and output layer with one node. This NN 0.4
is used for classification between two different classes. NN
0.3
is implemented using VHDL and the simulation results are
0.2
obtained using ISE Xilinx 14.4 on vertex 5.
0.1

C. Reliability and MTTF 0


0 10 20 30 40 50 60 70 80
The reliability is one of the important indicators for the Time (hours x10E6)

system. The reliability is the ability of the system to perform Fig. 9. Reliability of conventional system.
its function within a certain period of time. In this paper the
reliability analysis is considered. As many analyses are done
1
in previous work on reliability [15], [16]. The probability of
success for system is 0.9

0.8

p(t) = exp(−λt) (2) 0.7


Failure rate of 0.05
0.6
Failure rate of 0.1
Reliability

Where all units are identical and p(t) is assumed to be an 0.5


Failure rate of 0.2
Failure rate of 0.3
exponential distribution failure and is the failure rate. Each 0.4
Failure rate of 0.4

cell can run two functions at same clock period for neighbor 0.3
fault cell case then the reliability of the system is given by
0.2

Xm  −λit   −λt m−i 0.1


i
R(t) = Cm exp 1 − exp (3) 0
2 2 0 10 20 30 40
Time (hours x10E6)
50 60 70 80
i=k

Where m is number of active unites for k function out from Fig. 10. Reliability of the proposed technique.
these unites. Fig. 9 and Fig. 10 show the reliability of the
conventional and proposed technique for different failure rate. 350

The proposed technique has more reliability compared to the 300


conventional one which makes it attractive for most applica-
tions. The assessment of system dependability is determined 250

by MTTF. MTTF is the whole time operation of a number 200


Conventional Method
Proposed Method
MTTF

component divided the total failures number. So, MTTF refers


150
to the average time before system fails. The MTTF is given
by: 100

Z ∞ 50

MTTF = R(t)dt (4) 0


5 10 15 20 25 30 35
0 Number of Cells

The MTTF of the traditional and proposed design with fail- Fig. 11. MTTF of the proposed and conventional technique.
ure rate 0.00315(times/year) is shown in Fig. 11 . Therefor, the
MTTF indicates proposed method has a better performance.
V. C ONCLUSION multiplexing the active cell will perform its task operation and
Digital systems performance reduces with time in case of the task of the faulty cell. So, it gives the flexibility to repair
faulty parts in the system. So, researchers work on self-healing 50% of cells without using spare cells. Also, the reliability of
system to save performance with its age. Many of approaches proposed technique is studied and compared to conventional
are done on self healing which is based on redundancy; the system in different failure rate. The proposed approach is
spare parts are used instead of faulty parts. The main drawback implemented for two cases; ALU array and NN. The proposed
of these approaches is area overhead. This paper presented method is implemented using VHDL and the simulation results
an approach of self healing without redundancy. The idea is are obtained using ISE Xilinx 14.4 vertex 5. For area overhead
based on instead using spare blocks, every block (cell) will the proposed technique has 9% overhead which is small as we
perform as spare cell for its number and using time division compared it with the previous work in the paper.

978-1-5386-4881-0/18/$31.00 ©2018 IEEE


R EFERENCES
[1] M. Furdek, M. Danko, P. Glavica, L. Wosinska, B. Mikac, N. Amaya,
G. Zervas, and D. Simeonidou, “Efficient optical amplification in self-
healing synthetic ROADMs,” in 2014 International Conference on
Optical Network Design and Modeling, May 2014, pp. 150–155.
[2] S. M. Bowers, K. Sengupta, K. Dasgupta, B. D. Parker, and A. Ha-
jimiri, “Integrated self-healing for mm-wave power amplifiers,” IEEE
Transactions on Microwave Theory and Techniques, vol. 61, no. 3, pp.
1301–1315, 2013.
[3] I. Ivan, C. Boja, and A. Zamfiroiu, “Self-healing for mobile applica-
tions,” Journal of Mobile, Embedded and Distributed Systems, vol. 4,
no. 2, pp. 96–106, 2012.
[4] K. Khalil, O. Eldash, and M. Bayoumi, “Self-healing router architecture
for reliable network-on-chips,” in IEEE International Conference on
Electronics, Circuits and Systems (ICECS). IEEE, 2017.
[5] S. Narasimhan, S. Paul, R. S. Chakraborty, F. Wolff, C. Papachristou,
D. J. Weyer, and S. Bhunia, “System level self-healing for parametric
yield and reliability improvement under power bound,” in Adaptive
Hardware and Systems (AHS), 2010 NASA/ESA Conference on. IEEE,
2010, pp. 52–58.
[6] M. R. Boesen, J. Madsen, and P. Pop, “Application-aware optimiza-
tion of redundant resources for the reconfigurable self-healing eDNA
hardware architecture,” in 2011 NASA/ESA Conference on Adaptive
Hardware and Systems (AHS), June 2011, pp. 66–73.
[7] Q.-Z. Zhou, X. Xie, J.-C. Nan, Y.-L. Xie, and S.-Y. Jiang, “Fault tol-
erant reconfigurable system with dual-module redundancy and dynamic
reconfiguration,” vol. 9, no. 2, 2011, pp. 167–173.
[8] T. Koal, M. Ulbricht, P. Engelke, and H. T. Vierhaus, “On the feasibility
of combining on-line-test and self repair for logic circuits,” in 2013 IEEE
16th International Symposium on Design and Diagnostics of Electronic
Circuits Systems (DDECS), April 2013, pp. 187–192.
[9] K. Khalil, O. Eldash, and M. Bayoumi, “A novel approach towards less
area overhead self-healing hardware systems,” in International Midwest
Symposium on Circuits and Systems (MWSCAS). IEEE, 2017.
[10] O. Eldash, K. Khalil, and M. Bayoumi, “On on-chip intelligence
paradigms,” in IEEE 30th Canadian Conference on Electrical and
Computer Engineering (CCECE), April 2017.
[11] L. Shao, L. Liu, and X. Li, “Feature learning for image classification
via multiobjective genetic programming,” IEEE Transactions on Neural
Networks and Learning Systems, vol. 25, no. 7, pp. 1359–1371, 2014.
[12] F. Meng, J. Cai, Y. Meng, S. Wu, and T. Wang, “Evaluation index system
for embryonic self-healing strategy,” in 2016 International Conference
on Integrated Circuits and Microsystems (ICICM), Nov 2016, pp. 86–90.
[13] Z. Zhang, “Artificial neural network,” in Multivariate Time Series
Analysis in Climate and Environmental Research. Springer, 2018, pp.
1–35.
[14] V. Vapnik and R. Izmailov, “Knowledge transfer in svm and neural
networks,” Annals of Mathematics and Artificial Intelligence, vol. 81,
no. 1-2, pp. 3–19, 2017.
[15] C. Wongyai and P. Nilagupta, “Improving reliability in cell-based evolve
hardware architecture using fault tolerance control,” in Control System,
Computing and Engineering (ICCSCE), 2014 IEEE International Con-
ference on. IEEE, 2014, pp. 190–195.
[16] Z. Zhang and Y. Wang, “Method to self-repairing reconfiguration
strategy selection of embryonic cellular array on reliability analysis,” in
2014 NASA/ESA Conference on Adaptive Hardware and Systems (AHS),
July 2014, pp. 225–232.

978-1-5386-4881-0/18/$31.00 ©2018 IEEE

You might also like