Professional Documents
Culture Documents
Published in:
Design, Automation, and Test in Europe Conference
Publication date:
2008
Document Version
Publisher's PDF, also known as Version of record
Citation (APA):
Eles, P., Izosimov, V., Pop, P., & Peng, Z. (2008). Synthesis of Fault-Tolerant Embedded Systems. In Design,
Automation, and Test in Europe Conference (pp. 1117-1122). DOI: 10.1109/DATE.2008.4484825
General rights
Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners
and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.
• Users may download and print one copy of any publication from the public portal for the purpose of private study or research.
• You may not further distribute the material or use it for any profit-making activity or commercial gain
• You may freely distribute the URL identifying the publication in the public portal
If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately
and investigate your claim.
Synthesis of Fault-Tolerant Embedded Systems
Petru Eles1, Viacheslav Izosimov1, Paul Pop2, Zebo Peng1
1 2
{petel|viaiz|zebpe}@ida.liu.se Paul.Pop@imm.dtu.dk
Dept. of Computer and Information Science Dept. of Informatics and Mathematical Modelling
Linköping University Technical University of Denmark
SE–581 83 Linköping, Sweden DK-2800 Kongens Lyngby, Denmark
1. The number of faults k can be larger than the number of processors in the system. Figure 2. Active Replication (b) and Primary-Backup (c)
P1(1) and P1(2) are executed in parallel, which, in this case, im- WCET
proves system performance. However, active replication occu- P1
N1 N2
pies more resources compared to primary-backup because P1(1)
and P1(2) have to run even if there is no fault, as shown in
N1 N2 P1 20 30
P2 P3
P2 40 60
Fig. 2b1. In the case of primary-backup (Fig. 2c), the “backup”
P3 60 X
replica P1(2) is activated only if a fault occurs in P1(1). However,
P4 40 60
if faults occur, primary-backup takes more time to complete, P4 P5 P5 40 60
compared to active replication, as shown in Fig. 2c2 and Fig. 2b2.
a) b) c)
In our work, we are interested in active replication. This type
of replication provides the possibility of spatial redundancy, Figure 3. A Simple Application and a Hardware Architecture
which is lacking in rollback recovery. Moreover, rollback recov- Processes are non-preemptable. They send their output values
ery with a single checkpoint is, in fact, a restricted case of pri- encapsulated in messages, when completed. All required inputs
mary-backup where replicas are only allowed to execute on the have to arrive before activation of the process. Fig. 3a shows an
same computation node with the original process. application represented as a graph composed of five nodes.
Time constraints are imposed with a global hard deadline D,
3.3 Transparency at which the application A has to complete. Some processes may
Tolerating transient faults leads to many alternative execution also have local deadlines dlocal.
scenarios, which are dynamically adjusted in the case of fault The mapping of an application process is determined by a
occurrences. The number of execution scenarios grows expo- function M: V →N, where N is the set of nodes in the architec-
nentially with the number of processes and the number of toler- ture. The mapping will be determined as part of the design opti-
ated transient faults. In order to debug, test, or verify the system, mization. For a process Pi, M(Pi) is the node to which Pi is
all its execution scenarios have to be taken into account. There- assigned for execution. Each process can potentially be mapped
fore, debugging, verification and testing become very difficult. on several nodes. Let NP ⊆ N be the set of nodes to which Pi
i
A possible solution against this problem is transparency. can potentially be mapped. We consider that for each Nk ∈ NP ,
i
Originally, Kandasamy et al. [19] propose transparent re-exe- we know the worst-case execution time (WCET) C PNik of process
cution, where recovering from a transient fault on one computa- Pi, when executed on Nk.
tion node is hidden from other nodes. In our work we apply a Fig. 3c shows the worst-case execution times of processes of
more flexible notion of transparency by allowing the designer to the application depicted in Fig. 3a when executed on the archi-
declare arbitrary processes and messages as frozen (see tecture in Fig. 3b. For example, P2 has the worst-case execution
Section 4). Transparency has the advantage of fault containment time of 40 ms if mapped on computation node N1 and 60 ms if
and increased debugability. Since the occurrence of faults in cer- mapped on node N2. By “X” we show mapping restrictions. For
tain process does not affect the execution of other processes, the example, process P3 cannot be mapped on computation node N2.
total number of execution scenarios is reduced. Therefore, less In the case of processes mapped on the same node, message
number of execution alternatives have to be considered during transmission time between them is accounted for in the worst-
debugging, testing, and verification. However, transparency can case execution time of the sending process. If processes are
increase the worst-case delay of processes, reducing perfor- mapped on different nodes, then messages between them are
mance of the embedded system [14, 16]. sent through the communication network. We consider that the
worst-case size of messages is given, which, implicitly, can be
4. Application Model translated into a worst-case transmission time on the bus.
We consider a set of real-time periodic applications Ak. Each ap- The combination of fault-tolerance policies to be applied to
plication Ak is represented as an acyclic directed graph each process (Fig. 4) is given by four functions:
Gk(Vk,Ek). Each process graph is executed with the period Tk. • P: V →{Replication,Checkpointing,Replication&Checkpointing}
The graphs are merged into a single graph with a period T ob- determines whether a process is replicated, checkpointed, or
tained as the least common multiple (LCM) of all application pe- replicated and checkpointed.
riods Tk. This graph corresponds to a virtual application A, • The function Q: V → Ν indicates the number of replicas for
captured as a directed, acyclic graph G(V, E). Each node Pi ∈V each process. For a certain process Pi, and considering k the
represents a process and each edge eij ∈ E from Pi to Pj indi- maximum number of faults, if P(Pi) = Replication, then Q(Pi)
cates that the output of Pi is the input of Pj. = k; if P(Pi) = Checkpointing, then Q(Pi) = 0; if P(Pi) =
kk == 22
N1 P1(1)
N1 P1(1) χ11==55 ms
ms
N3 P1(3) µ11 == 55 ms
ms
m iS are also connected through edges to regular and conditional 5.2 Schedule Table
processes and other synchronization nodes. The output produced by the FT-CPG scheduling algorithm is a
Edges e ijmn ∈ EC are conditional edges and have an associated schedule table that contains all the information needed for a distrib-
condition value. The condition value produced is “true” (denot- uted run time scheduler to take decisions on activation of pro-
ed with F Pim ) if P im experiences a fault, and “false” (denoted cesses. It is considered that, during execution, a non-preemptive
with F Pim ) if P im does not experience a fault. Alternative paths scheduler located in each node decides on process and commu-
starting from such a process, which correspond to complemen- nication activation depending on the actual values of conditions.
tary values of the condition, are disjoint1. Regular and condition- Only one part of the table has to be stored in each node, name-
al processes are activated when all their inputs have arrived. A ly, the part concerning decisions that are taken by the corre-
sponding scheduler. Fig. 6 presents the schedules for the nodes
1. They can only meet in a synchronization node.
N1 true F 1 F 1 F 1 ∧F 2 FP1 ∧ FP 2 FP1 ∧ FP 2 ∧ FP 4 FP1 ∧ FP2 ∧ FP4 FP 1 ∧ FP 1 F 1 ∧F 1 ∧F 2 F 1 ∧F 1 ∧F 2 F 1 ∧F 1
P1 P1 P1 P1 1 1 1 1 2 1 1 2 1 2
P1 P2 P2 P1 P2 P2 P1 P2
N2 true F 1 F 1 F 1 ∧ F 2 F 1 ∧ F 2 F 1 ∧ F 2 ∧ F 4 F 1 ∧ F 2 ∧ F 4 F 1 ∧ F 1 F 1 ∧ F 1 ∧ F 2 F 1 ∧F 1 ∧F 2 F 1 ∧ F 1 F1 F 1 ∧F 2
P1 P1 P 1 P P 1
P P 1 P P 1 P 1 P P
1 P P P
4 P 1 P P P1 P P P 4 1 4 1 4 4 1 4 4 1 4
P3 P3 P3
P3 136( P38 ) 136( P 31 ) 136( P 31 ) 136( P 31 ) 136( P 31 ) 136( P 31 ) 161( P32 ) 186( P33 )
P4 36( P41 ) 105( P46 ) 71( P44 ) 106( P45 ) 71( P42 ) 106( P43 )
70
the functionality implemented by the process, the required level
60
of reliability, hardness of the constraints, legacy constraints, etc.
Many processes, however, do not exhibit particular features or 50