Professional Documents
Culture Documents
August 2019
c 2019 Anand Ganpat Bhat.
All rights reserved.
iii
Dedication
This thesis is dedicated to my parents Ganpat and Mukta Bhat, who have been a constant
source of inspiration. To my adivsor, colleagues, and friends, without their guidance this
would not have be possible. Lastly, to my wife, Neha Hegde, for her constant love and
support.
iv
Acknowledgements
This dissertation would have not been possible without the help and support of many
people. First and foremost, I would like to thank my advisor, Prof. Raj Rajkumar. I con-
sider myself extremely fortunate to have had an opportunity to work with Prof. Rajku-
mar. Working closely under his guidance and expertise has definitely made me a better
thinker, engineer and researcher. I am grateful for the opportunity to work on several
diverse and exciting projects to demonstrate my research, ranging from system-level
ones like building fault-tolerant system architectures for various platforms including
the CMU autonomous driving platform to application-level projects like SysAnalyzer to
analyze and deploy real fault-tolerant systems. Prof. Rajkumar also gave me several
opportunities to work with and mentor several other students, which led me to expand
my horizons and become an independent thinker.
I am grateful to the members of my thesis committee, Prof. Anthony Rowe, Prof.
Pei Zhang and Dr. Soheil Samii for their time, effort and inputs in completing this
dissertation.
I would like to thank Dr. Soheil Samii for his constant feedback with regards to
several aspects of my work and his guidance towards making my research practical
and industry relevant. It has been a great pleasure working closely with him. I would
also like to thank Tom Furhman and Dr. Massimo Ossella for their insights on various
aspects of my work.
A special thanks to General Motors (GM) for funding my research. I wish to thank the
members of the CMU’s autonomous driving team: Prof. John Dolan, Jongho Lee, Tianyu
Gu, Chiyu Dong, Adam Werries, and all other former members. Their passion and
efforts made me proud of being part of the team and contributing to our autonomous
car.
Most of my time during my doctoral studies was spent at the Real-Time and Mul-
timedia systems Lab (RTML). Thanks to all the members of RTML who shared their
v
time with me: Gaurav Bhatia, Hyoseung Kim, Junsung Kim, Reza Azimi, Alexei Colin,
Sandeep D’souza, Iijoo Baek, Shunsuke Aoki, Peter Jan, Mengwen He and Weijing Shi.
Also, I would like to thank Toni M. Fox, Chelsea Mendenhall, Brittany Frost and Brid-
gette Bernagozzi for their kind support on administrative matters.
Besides the RTML members, I am grateful to my friends: Rupesh Mehta, Fiona Britto,
Mihir Dattani, Naman Jain, Oliver Shih, Ashvin Swaminathan, Swati Rajendran and
Abhijeet Mishra, without these people, I could not have fully enjoyed my time at CMU.
I would like to thank my parents for their constant love and guidance. Lastly, my thanks
go to my wife, Neha Hegde. She has been a great source of love and support through
these final stages of my PhD.
vi
Abstract
With advances in sensing, machine learning, and computing systems, various semi-
autonomous and autonomous driving applications have become feasible. This has re-
sulted in a dramatic increase in the amount and complexity of computational resources
needed in vehicles. Tasks such as perceiving the environment using sensors like li-
dars, radars, and cameras, fusing data from these sensors to create a road-world model,
route planning, and modeling behaviors, are all computationally intensive and safety-
critical. Conventionally, system reliability in safety-critical applications including avi-
ation is achieved by replicating hardware and running multiple instances of the same
software on different pieces of hardware. Often, a voting mechanism is used to generate
the output, and measures are taken in the system design to ensure that hardware compo-
nents fail independently. However, this approach is extremely inefficient in terms of cost,
weight, space and power, especially for the automotive industry. High automation levels
impose more stringent fault-tolerance requirements in terms of the number of tasks that
need redundancies (standbys), as well the number of failures that are required to be
tolerated for each task (i.e., the number of standbys for each task). Also, the operational
design domain (ODD) of the automated vehicle has a significant impact on the fault-
tolerance requirements. This motivates the need for adaptive cost-optimized software
fault-tolerance solutions to reduce overall resource utilization. This dissertation aims to
achieve this objective in the context of resource-constrained fault-tolerant autonomous
driving applications by considering a comprehensive set of system-level design consid-
erations together. First we present a family of optimal and sub-optimal harmonic search
algorithms and heuristics for selecting task execution parameters. We then present a
framework to derive replication parameters for a given task set and a family of heuris-
tics to allocate tasks to computing nodes while optimizing CPU resource utilization. We
next design and implement our software architectures to support fault-tolerant execu-
tion on popular automotive platforms and demonstrate that our primitives are practical
vii
through experimental evaluations. Finally, we also present the tools and methodologies
we use to test and verify the safe operation of an autonomous driving vehicle.
Contents
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
Contents viii
List of Tables xi
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Scope and Approach of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . 3
viii
CONTENTS ix
4 System Model 48
4.1 Computation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2 Task Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3 Fault Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
9 Conclusions 164
9.1 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
9.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
A Glossary 170
Bibliography 173
List of Tables
xi
List of Figures
3.1 Evaluation: Branch And Bound Harmonic Search Algorithm (Best Viewed in
Color) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Rational Cost Function Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3 Brute-Force Geometric Series Search Plot . . . . . . . . . . . . . . . . . . . . . . 34
3.4 Brute-Force Geometric Series Search tables (Best Viewed in Color) . . . . . . . 35
3.5 Discrete Piecewise Function Example . . . . . . . . . . . . . . . . . . . . . . . . 37
3.6 Run-time Performance Evaluation DPHS vs BBHS vs PRHS vs Sr (Best Viewed
in Color) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.7 FOE Evaluation: DPHS vs BBHS vs PRHS vs Sr (Best Viewed in Color) . . . . 44
3.8 Run-time Performance Evaluation: DPHS vs BBHS vs PRHS vs Sr (Best Viewed
in Color) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.9 MPE Evaluation: DPHS vs BBHS vs PRHS vs Sr (Best Viewed in Color) . . . . 46
xii
List of Figures xiii
Introduction
1.1 Motivation
With advances in sensing, machine learning, and computing systems, various semi-
autonomous and autonomous driving applications have become feasible. This has re-
sulted in a dramatic increase in the amount and complexity of computational resources
needed in vehicles. Tasks such as perceiving the environment using sensors like li-
dars, radars, and cameras, fusing data from these sensors to create a road-world model,
route planning, and modeling behaviors, are all computationally intensive and safety-
critical. Conventionally, system reliability in safety-critical applications including avi-
ation is achieved by replicating hardware and running multiple instances of the same
software on different pieces of hardware. Often, a voting mechanism is used to derive
the output, and measures are taken in the system design to ensure that hardware com-
ponents fail independently. However, this approach is extremely inefficient in terms of
cost, weight, and space for many applications, especially for the automotive industry,
where reliability requirements can be diverse. For example, five driving automation
levels (DALs) have been defined in SAE J3016 standard to characterize the spectrum
of self-driving features. To put such systems in context with redundancy requirements,
consider a Level 2 system active on highways only. In such a system, although the driver
1
CHAPTER 1. INTRODUCTION 2
is not in direct control of the vehicle motion, the driver has a supervisory role: the driver
is expected to take over control in case of any subsystem or component failure. In such
systems, only a small subset of all software tasks need redundancy. Now, consider a
Level 4 system active in the same operational domain (highways). The system itself is
now responsible to bring the vehicle to a safe stop in case of failures. High automation
levels may impose more stringent fault-tolerance requirements in terms of the number
of tasks that need redundancies (standbys), as well the number of failures that are re-
quired to be tolerated for each task (i.e., the number of standbys for each task). Also,
the operational design domain (ODD) of the automated vehicle has a significant impact
on the fault-tolerance requirements. For example, in urban-driving applications, certain
tasks like pedestrian detection, are perhaps more safety-critical than highway-driving
applications. Hence, the type and number of standbys required for every task also de-
pends heavily on the ODD. This motivates the need for adaptive cost-optimized software
fault-tolerance solutions to reduce overall resource utilization.
The selection of task periods is also driven by the safety and performance speci-
fications of a real-time application [4]. This choice naturally has a direct impact
on system schedulability. For example, a feedback control application can perhaps
produce very accurate control if it runs at a very high frequency, i.e., if the period
assigned to the task running the feedback control application is small. However,
since smaller periods mean higher CPU utilization, system schedulability is re-
duced [5].
There are several advantages for these periods to he harmonic, i.e., when every
period in the task set is an integer multiple of its shorter periods. It has been
shown that the exact schedulability analysis for the RMS policy is an NP-Complete
problem [6], unless the periods are harmonic [7]. Also, polynomial-time solutions
exist for the response-time analysis of systems with harmonic task sets [8]. Hav-
ing harmonic periods allows for phase optimizations that reduce communication
latencies [9] and also enables energy-saving optimizations [10]. Phase optimiza-
CHAPTER 1. INTRODUCTION 4
tions can improve recovery latencies [11] and allow for optimal checkpointing for
recovery [12]. Harmonic task sets also play an important role in reducing the com-
plexity in the design of distributed time-triggered embedded systems [13]. Given
all these distinct advantages, in practice, harmonic task sets are widely chosen in
real-time safety-critical systems like automobiles and avionics [14–17]. Given these
wide-ranging implications of the choices, task periods must be carefully selected
to meet all the safety and application requirements while ensuring that the advan-
tages mentioned above are also gained.
These recovery time requirements in turn influence the type of replication strategy
chosen.
The type of run-time framework also heavily depends on the underlying software
architecture in use. Many automotive applications use AUTOSAR (AUTomotive
Open System ARchitecture [22]), an open and standardized automotive software
architecture jointly developed by automobile manufacturers, suppliers, and tool
vendors. The AUTOSAR Classic Platform (CP) [22] standard, which is a widely-
accepted standardized software architecture for automotive electronic control units
(ECUs), addresses the needs of deeply-embedded low-complexity devices. The
AUTOSAR Adaptive Platform provides mainly high-performance computing and
communication mechanisms. It also offers flexible software configuration, e.g. to
support software updates over the air. Features specifically defined for the Classic
Platform, such as access to electrical signals and automotive-specific bus systems,
can be integrated into the Adaptive Platform as well. Compared to the AUTOSAR
Classic platform, the Adaptive platform supports Service-Oriented Architecture
(SOA), which allows the dynamic linking of services and clients during runtime,
making it flexible for application developers. The Adaptive Platform supports
manycore processors and heterogeneous computing platforms that offer parallel
processing, as well as fast and high-bandwidth communication technologies such
as the Ethernet. The Adaptive Platform also supports several safety and security
features like priority-based scheduling, execution of authenticated code and con-
trolled allocation of memory and cpu resources. It is therefore important to design
and implement support for fault-tolerance primitives on these platforms.
not static, and vary significantly based on the ODD. It is therefore important for
the system infrastructure to enable dynamic reconfiguration of system resources.
Thus, there is a need for a system-level framework to support these changes in
operational mode and manage their impact on the fault-tolerance requirements.
This chapter presents the background and related work on the following
topics: (a) Selection of task execution parameters, (b) Selection of replica-
tion parameters, (c) Fault-tolerant assignment of tasks to computing nodes,
(d) Software architecture to support and maintain fault-tolerance guaran-
tees and (e) Testing of self-driving safety-critical automotive systems. Each
section of this chapter reviews and presents relevant systems techniques
and related prior work.
There has been significant work in the field of task period selection in
hard real-time systems for optimizing various use cases. For example, in
[25], Henriksson et al. attempt to find an optimal period assignment to
distribute computing resources between tasks. In [26], Cervin et al. try to
optimize the system performance of a control system. Neither, however,
focuses on creating harmonic task sets.
In [7], Han et al. have presented the Sr and the DCT algorithms which
attempt to create a harmonic task set to verify the schedulability of a given
8
CHAPTER 2. BACKGROUND AND RELATED WORK 9
input non-harmonic task set. Since there exist linear-time solutions to check
the schedulability of systems with harmonic task sets, both algorithms at-
tempt to find a harmonic task set such that every period assigned is less
than or equal to the original period and the utilization of every task is less
than 1. The schedulability of the original task set is conservatively inferred
from the schedulability of the harmonic task set. However, the Sr algo-
rithm assigns artificial periods which are always multiples of 2, whereas
the DCT algorithm creates a harmonic series using each element in the
original task set. Unlike our work presented in this dissertation, neither
algorithm attempts to find the optimal assignment. Our solution in this
dissertation can be applied readily to solve the above problem. Similarly,
in [27], harmonic deadline assignments were created using the least com-
mon multipliers (LCM) of the original task periods, to derive a utilization-
based feasibility test for systems with composite deadlines, but an optimal
assignment is not attempted to be created.
In [28] and [29], Nasri et al. address a problem similar to the one ad-
dressed in this dissertation, but accept a range of period values for each
task as input and target feasible real-number solutions allowing for a user
to choose harmonic sets with high or low utilization by controlling the uti-
lization bounds for a given harmonic range. In contrast, this dissertation
looks to find an optimal integer solution, where chosen values must be be-
low specified thresholds. The solution in this dissertation also allows the
user to select from a variety of cost metrics to optimize. In [30], Mohaqeqi
et al. attempt to minimize the total weighted sum of the periods where the
periods are not restricted.
CHAPTER 2. BACKGROUND AND RELATED WORK 10
1. Active Replica: In active replication, all redundant copies are identical and treated
uniformly. Each replica performs all operations, like accepting and processing
application inputs, performing state calculations, performing application calcula-
tions and producing output. This implies that, under normal operation, the system
needs to support duplicate suppression to filter out duplicate outputs.
to each task, (b) The type of replica assignment (Active, Cold, Hot) (c) the
allocation scheme determining replica to processor assignment.
The problem of supporting fault tolerance at the level of task schedul-
ing has been widely studied in the literature. A number of real-time task
allocation algorithms in order to tackle this problem in a distributed real-
time system [34–36] have been described in the literature. In [37], Oh et
al. present an online allocation heuristic to assign replicas to a minimum
number of processors such that all replicas guarantee that task deadlines
are met. They also derive the bound on the number of processors required
to feasibly schedule a task set using their heuristic. These approaches as-
sume the replica type and number of replicas as input parameters. For
example, [34] focuses only on active replication, where the redundant soft-
ware executes regardless of failure modes. In this dissertation, we present
a framework to analyze recovery time for a given task to determine the
replica type assignment.
Nodes
Fault-Tolerance Guarantees
Automotive Systems
Testing plays a very important role in the safety and verification of safety-
critical system, especially autonomous driving systems. Autonomous driv-
ing systems have a long history since the NAVLAB project at Carnegie
Mellon where a series of experimental platforms and tools were built [55].
Such tools which include, simulation/emulation capabilities, system-level
testing capabilities, fault injection and system verification capabilities.
In the 2000s, the Defense Advanced Research Projects Agency (DARPA)
held three competitions for autonomous driving vehicles, and a variety of
technologies were developed and demonstrated. In particular, the third
competition, the DARPA Urban Challenge in 2007, was designed to foster
innovation in autonomous vehicle operation in urban settings. In the Ur-
ban Challenge, the competitors developed full-scale autonomous driving
vehicles to navigate through a mock city environment, including merging
into moving traffic, navigating traffic circles, traversing busy intersections
and avoiding obstacles. The developed vehicle systems and software are
well-explained by each ranked competitor [56–66]. Each of these competi-
tors developed suites of tools and methodologies to test and verify their
capabilities. For example, in [58] the authors describe their Joint Archi-
tecture for Unmanned Systems (JAUS) for inter-process communications
which allowed for configuring software applications in a modular, recon-
figurable and reusable fashion. In our testing framework we adopt a pub-
lish/subscribe based communication infrastructure [56] to provide similar
capabilities. They also developed a data logging and playback system in-
tegrated at the communication level with the JAUS infrastructure which
CHAPTER 2. BACKGROUND AND RELATED WORK 16
17
CHAPTER 3. SELECTION OF TASK EXECUTION PARAMETERS 18
Theorem 1. An unschedulable task set cannot be made schedulable by transforming task periods
to values less than or equal to the original periods for rate-monotonic scheduling policy with
implicit deadlines.
Proof. For a task set τ = {τ1 , τ2 , ...τn } with tasks arranged with increasing periods i.e
Ti ≤ Ti+1 ∀ i from 1 to n, the schedulability under rate-monotonic scheduling can be
CHAPTER 3. SELECTION OF TASK EXECUTION PARAMETERS 20
where, aik represents an estimate of the response time for task τi [73].
This implies that, for an unschedulable task set, at least one task remains unschedu-
lable, i.e. its response time is greater than its deadline. Hence, if τm (m = 1 to n) is
unschedulable, we have,
m −1
am
k+1 = Cm + ∑ daik /Tj eCj > Tm (3.2)
j =1
From the above equation, we see that the response time estimate depends only on the
periods of tasks with i < m, i.e., the tasks with smaller periods and hence higher priority
under RMS. Since Tj is part of the denominator, transforming period values from Ti to
Ti0 such that Ti0 < Ti can only result in larger response times.
In this section, we describe our system model and present our problem
statement.
Given a task set τ, generate an optimal harmonic task set τ 0 with integer
periods (i.e. Ti0 ∈ N>0 ), if one exists, such that,
1. The period of every task in τ 0 is an integer less than or equal to its corresponding
period in τ, i.e., Ti0 ≤ Ti .
2. The worst-case execution time of every task in τi remains the same, i.e., Ci0 = Ci
and task set after harmonization continues to remain schedulable, i.e., Ci ≤ Ti0
That is, every period Ti0 in the harmonic period set can be represented as
i
Ti0 = T10 ∗ ∏ r j (3.8)
j =1
CHAPTER 3. SELECTION OF TASK EXECUTION PARAMETERS 23
Example: Consider the integer harmonic set {10, 30, 180}. These num-
bers correspond to T10 = 10, r = {1, 3, 6}. Since every number in the har-
monic period set is less than or equal to the corresponding number in the
original task set, we have,
Lemma 1. Given an input period set T, the range of valid values for T10 of the output harmonic
period set ranges from 1 to b T1 c.
Proof. Every element of the output harmonic period set is constrained to be an integer
less than or equal to the corresponding element in the input period set. Hence, the
largest value T10 can take is b T1 c. Since T10 is a positive integer, its minimum value is
1.
Lemma 2. Given an input period set T, any period ratio, ri ∀ i > 1, for the output harmonic
period set, ranges from 1 to b Ti /Ti0−1 c.
Proof. We have, ∀ i, Ti0 ≤ Ti and Ti0 is constrained to be a positive integer. Hence, ri must
also be a positive integer and since all tasks are ordered by non-decreasing periods the
maximum value ri can take is b Ti /Ti0−1 c and the minimum value of ri is 1.
CHAPTER 3. SELECTION OF TASK EXECUTION PARAMETERS 24
the input period set. On completion, we pick the output harmonic period
set that optimizes the selected cost function. For the example in Table 3.1,
we select the First Order Error (FOE) cost function. Hence, for the input
period set T = {20, 45, 136, 415}, we select the output harmonic period set
T 0 = {15, 45, 135, 405}. It is important to note that the priority ordering of
the harmonized task set remains identical to the RM priorities of the orig-
inal task set, if ties are broken in favor of smaller original period values.
The algorithm also checks the feasibility of the solution, i.e., Ci ≤ Ti0 , and
it returns the optimal solution if one exists, else returns null.
CHAPTER 3. SELECTION OF TASK EXECUTION PARAMETERS 26
The BFHS algorithm, being an exhaustive search, checks a very large num-
ber of output harmonic period sets in order to find the optimal solution.
The Branch-And-Bound Harmonic Search (BBHS) algorithm attempts to
reduce this search space by bounding the number of output harmonic pe-
riod sets in order to find the optimal solution by applying the following
properties.
• Error Bounding: Similar to the BFHS algorithm, the BBHS algorithm keeps track
of the minimum error seen up to the current point of execution. The BBHS algo-
rithm also calculates the error of the output harmonic period set w.r.t to the input
harmonic period set every time a new element gets added to the series. If, at any
step, this error exceeds the currently known minimum error value, the BBHS al-
gorithm terminates exploring the branches along that path. This is illustrated in
Table 3.2. The optimal error when the algorithm is in the process of forming the
series T 0 = {14, 28, 126, −} is 16, but, in this case, the error with just the first 3
elements is 19 which already exceeds the known minimum error of 16:4 up to this
point. This allows the BBHS algorithm to bound the search space efficiently. It is
also interesting to note that this bounding criterion gets stricter as new period sets
with lower errors are found.
Lemma 3. Given a harmonic periods set T 0 arranged in non-decreasing order, the values
of the subsequent elements i.e., Tj0 ∀ j > i, for a given element Ti0 , depend completely on
the current element Ti0 and not on any of the previous elements i.e., Tj0 ∀ j < i.
Proof. From the mathematical representation of harmonic sets from Section 3.2.3,
CHAPTER 3. SELECTION OF TASK EXECUTION PARAMETERS 27
we have,
i
Ti0 = T10 ∗ ∏ ri From Equation (3.7)
i =1
and so on ..
CHAPTER 3. SELECTION OF TASK EXECUTION PARAMETERS 28
Hence, the value of the next element in the harmonic series only depends on the
value of the previous element.
From Lemma 3, we see that subsequent elements of the harmonized series only
depend on the current element under consideration, irrespective of the preceding
elements. Hence, when BBHS adds a new element, also referred to as a node, we
are guaranteed that the elements to follow will be independent of the elements
before the current one. Hence, the BBHS stores each visited node and the current
error associated with that node. If a node is revisited, the error is checked. If the
error is greater than the stored value, BBHS terminates the series currently under
consideration. This can be seen in Table 3.2. When BBHS visits the node 40 for the
first time, it is processing the series T 0 = {20, 40, −, −}. It stores the corresponding
error, which in this case is 5, along with the node value of 40. When BBHS revisits
the node 40 while processing the series T 0 = {10, 40, −, −}, and since the error is
greater than 5, the search with the prefix {10, 40} is pruned.
Figure 3.1: Evaluation: Branch And Bound Harmonic Search Algorithm (Best Viewed in
Color)
The algorithms described in Section 3.3 find optimal solutions for the prob-
lem statement from Section 3.2.2.
Specifically, offline task set choices can greatly benefit from an optimal
harmonic assignment. However, run-time applications may benefit from
a sub-optimal solution that can be obtained very quickly. In this section,
we present two such algorithms that can produce a sub-optimal solution
to the problem statement in Section 3.2.2 in a fraction of the time taken by
admission control of task sets.
ζ i = m ∗ b xi (3.11)
where, we refer to m ∈ I as the multiplier, b ∈ N>0 as the base and xi ∈ N≥0 as the
exponent. From the mathematical representation of harmonic sets from Section 3.2.3,
we have,
i
Ti0 = T10 ∗ ∏ r j
j =1 (3.12)
Ti0 = T10 ∗ ri−1 , if all period ratios are identical.
Equation (3.12) is a geometric series hence, an integer geometric series is always har-
monic.
riod set with respect to a given cost function. For example, consider the
input set {10, 31, 92, 183}, for the FOE cost function, the optimal solution
obtained from the BBHS algorithm is {10, 30, 90, 180}, FOE = 6. The DPHS
outputs the series {10, 20, 80, 160}, FOE = 46, which is a sub-optimal so-
lution. However, in Section 3.4.3, we shown that DPHS can calculate sub-
optimal solutions significantly quicker than BBHS.
ζ i = m ∗ b xi (3.13)
That is, every period Ti0 in the harmonic period set can be represented as
Ti0 = m ∗ b xi (3.14)
Given that every Ti0 represents a task period, the multiplier is a positive
integer, i.e., m ∈ N>0 .
Example: Consider the integer harmonic set 10, 20, 40. These numbers
correspond to m = 10, b = 2, x = {0, 1, 2}. Since every number in the
harmonic period set is less than or equal to the corresponding number in
the original task set, we have,
Definition 1. A cost function for harmonization is said to be rational if the value of its cost
function increases as the deviation of any period in the resultant harmonic task set from the
original task set increases.
CHAPTER 3. SELECTION OF TASK EXECUTION PARAMETERS 32
Our goal is to find m∗ and b∗ , which represent the values of the mul-
tiplier and the base optimizing a given rational cost function, as defined
above. DPHS works for all rational cost functions. The commonly used
error cost functions presented in Section 3.2.2 are rational. Below is an
example.
The Total percentage error (TPE) cost function is given by,
n
c f (m, b, x ) = Minimize
T0
∑ (Ti − Ti0 )/Ti
i =1
n
(3.16)
⇒ c f (m, b, x ) = Minimize
m,b
∑ (Ti − (m ∗ bxi ))/Ti
i =1
Figure 3.2 illustrates the rationality of the TPE cost function. The ratio-
nality of the other cost functions from Section 3.2.2 can be inferred in the
same way.
Lemma 5. Given a set of periods T, the range of valid values that the multiplier m of the output
harmonic period set ranges from 1 to T1 .
CHAPTER 3. SELECTION OF TASK EXECUTION PARAMETERS 33
Proof. Every element of the output harmonic period set is constrained to be less than or
equal to the corresponding element in the input period set. Consider the first element
T10 , which can be represented as follows,
T10 = m ∗ b x1 ≤ T1
mmax ≤ T1 when b = 1 or x1 = 0
The value of the multiplier is maximized when the second term, b x1 , is 1, i.e., either
b = 1 or x1 = 0. Hence, the maximum value of multiplier is T1 and, since the multiplier
is a positive integer, the minimum value for the multiplier is 1.
Lemma 6. Given a set of periods T, the base b of the output harmonic period set, for a given
multiplier m, ranges from 1 to b Tn /mc.
Proof. We have, ∀ i, Ti0 ≤ Ti . Consider the last element Tn0 , which can be represented as
follows,
Tn0 = m ∗ b xn ≤ Tn
b xn ≤ Tn /m since m ∈ N>0
bmax ≤ Tn /m when xn = 1
Lemma 7. Given an input set of periods T, the exponent xi for each element of the output
harmonic period set ranges from 0 to blog2 ( Ti /m)c. For b = 1, the exponent is irrelevant (i.e.,
any value of xi ≥ 0 has the same effect).
CHAPTER 3. SELECTION OF TASK EXECUTION PARAMETERS 34
Ti0 = m ∗ b xi <= Ti
xi ≤ logb ( Tn /m)
As the base decreases, the value of the exponent increases. For a given multiplier m, the
value of the exponent is maximized when the value of the base is minimum i.e., in this
case, b = 2 since b ∈ N>1 . Hence, the maximum value of the exponent is blog2 ( Tn /m)c
for b ∈ N>1 and, since the exponent is a non-negative integer, the minimum value for
the exponent is 0. When b = 1, the second term b xi will always result in a value of 1, i.e.
every Ti0 = m, so any value of xi >= 0 has the same effect.
Figure 3.4: Brute-Force Geometric Series Search tables (Best Viewed in Color)
23: return T 0 . Return the closest feasible harmonic period set or null to indicate infeasible result
each element in this set, for a given multiplier and base combination. It
also shows the corresponding scaled cumulative percent error (SCTPE ) of
the output harmonic period set with respect to the original task set T. No-
tice that, for this range the values of xi remain constant across all entries,
i.e x = {0, 0, 1}. As can be seen, only the last entry in this range is of
interest in the search for an optimal assignment. This is the key insight on
which the Discrete Piecewise Search Algorithm presented in Algorithm 5
is based.
As before, we will compare our approach with a brute-force search.
Algorithm 3 presents the brute-force geometric series search approach. Notice
that the brute-force algorithm also checks the feasibility of the solution, i.e.,
Ci ≤ Ti0 , and it returns a solution if one exists, else returns null.
CHAPTER 3. SELECTION OF TASK EXECUTION PARAMETERS 37
f(x)
g1 ( x ) , x ∈ Set S1
g2 ( x ) ,
x ∈ Set S2
f (x) =
...
...
As can be seen above, the harmonization cost function takes the form
of a discrete piecewise function. We now find the local optimal values of
CHAPTER 3. SELECTION OF TASK EXECUTION PARAMETERS 39
Lemma 8. Given a set of harmonic period sets with a fixed multiplier and range of bases, such
that all values of xi remain identical, the harmonic period set with the largest base will be the local
(i.e. with the range of bases specified) optima w.r.t the cost functions of rational cost functions.
Proof. Since the multiplier and exponent for all the periods in each set remain constant,
the value of each period depends solely on the value of the base. Hence, greater the
value of the base, greater the value of the second term, i.e., b xi , and closer the period
is to the corresponding period in the original task set resulting in a lower error value.
Hence, under the above conditions, any cost function that can be reduced to the form
c f (m, b, x ) = Maximize b
b
will indicate that the cost function is rational and have a local optimum value at
b = bmax in the given base range. This can also be thought of as every base b, before
which the exponent f i of any term changes, is a locally optimal solution. Consider the
cost function used in Figure 3.3.
n
c f (m, b, x ) = Minimize ∑ ( Ti − m ∗ b xi )/Ti
T0 i =1
n
c f (b) = Minimize ∑ ( Ti − m ∗ b xi )/Ti
b i =1
c f (b) = Maximize b
b
Thus, the total percentage error cost function’s local optimal value will
be at b = bmax in the given base range. The property of Lemma 8 can
be clearly seen in Figure 3.4, where the multiplier is fixed at 5, the base b
CHAPTER 3. SELECTION OF TASK EXECUTION PARAMETERS 40
increases from 8 to 22, and x = {0, 0, 1}. T30 increases from 40 to 110, which
is the highest value it can take for x3 = 1. Hence, under these conditions, in
the above example, b = 22 will always be the local optimal solution for the
m = 5 and b = 8 to 22. Corresponding results also apply to other rational
cost functions.
Lemma 9. Given a harmonic period set represented as m ∗ b xi , for given multiplier m, the bases
√
where the local optimal solutions occur are b p Ti /mc where the range of p is from 1 to ( xi )max
and b = 1.
Proof. From Lemma 6, we know that for a fixed multiplier m, the base varies from 1 to
b Tn /mc. From Lemma 7, we know that, as the base increases, the exponent decreases
from blog2 ( Ti /m)c to 0. We know, from Lemma 8, that at every base before which the
exponent changes is a local optimal solution. This power flip occurs when the base is
just big enough that the exponent has to be deceased. This happens when a base reaches
the maximum possible value a given xi can support.
Ti0 = m ∗ b xi <= Ti
Given the above context, we now present the Discrete Piecewise Harmonic
Search algorithm in Algorithm 5. We consider the entire multiplier range
from 1 to T1 (from Lemma 10), but prune the number of bases using Lemma
CHAPTER 3. SELECTION OF TASK EXECUTION PARAMETERS 41
Algorithm 4 GetLocalMinima
1: procedure getLocalMinima
2: m ← multiplier
3: T ← period set from τ
4: B←1
5: for each Ti in T do
6: for each xpfrom 1 to blog2 ( Ti /m)c do
7: B ← b x ( Ti /m)c
8: return unique( B)
14: return T 0 . Return the closest feasible harmonic period set or null to indicate infeasible
result
closest to the original period value. This limits all tasks with periods larger
than a given harmonic to the value of the harmonic resulting in priorities
identical to the RM priorities of the original task set.
3.4.3 Evaluation
Figure 3.7: FOE Evaluation: DPHS vs BBHS vs PRHS vs Sr (Best Viewed in Color)
both the DPHS and PRHS algorithms are again significantly faster than
the BBHS algorithm. Figure 3.9 plots the average error per task set against
the task set cardinality. Again, the DPHS and PRHS algorithms produce
sub-optimal results but only slightly worse. It is interesting to note that,
between DPHS and PRHS, there is also a trade-off between accuracy and
run-time duration. On average, DPHS is faster but has greater error. Sr,
on the other hand is significantly faster than the other approaches but its
error performance is also significantly worse.
We next show the benefits of the harmonic search algorithms by con-
sidering real-world applications. We consider an avionics task set used by
Locke et al. [74]. Table 3.3 presents the outputs of harmonic search algo-
rithms proposed in this chapter.
Let us first consider the TSU cost function. The goal of the TSU cost
function is to reduce the total utilization of the resultant harmonic period
set. Hence, the period assignment will be biased towards maintaining the
CHAPTER 3. SELECTION OF TASK EXECUTION PARAMETERS 45
Figure 3.9: MPE Evaluation: DPHS vs BBHS vs PRHS vs Sr (Best Viewed in Color)
run-time admission control and other contexts that are time-sensitive and
can afford sub-optimal solutions. We also demonstrated the benefits of our
approach by considering real-world task sets.
Chapter 4
System Model
48
CHAPTER 4. SYSTEM MODEL 49
WCCTi = ri + Ji (4.2)
1. Accept inputs,
3. Produce outputs
In our system model, we assume that all tasks are independent from one
another and they do not self-suspend during execution. In other words, the
CHAPTER 4. SYSTEM MODEL 50
lifecycle of every task follows the above steps. All I/O operations, includ-
ing reading sensor values, are completed before performing calculations
and outputs including those driving actuators are produced at the end
of the task life-cycle. This architecture is consistent with the AUTOSAR
standard [22]. A task that performs all of the above operations is called a
primary. A fail-stop failure of a primary is tolerated by the system by pro-
visioning one or more standbys corresponding to each primary. Byzantine
faults caused by security breaches are beyond the scope of this dissertation.
1. Active Replica: In active replication, all redundant copies are identical and treated
uniformly. Each replica performs all operations, like accepting and processing
application inputs, performing state calculations, performing application calcula-
tions and producing output. This implies that, under normal operation, the system
needs to support duplicate suppression to filter out duplicate outputs.
Primary Backup
Failure promoted to
primary and
δ produces
Primary output
HB No HB
Backup Time
The system designer can decide which tasks are considered critical for the
application and which are considered non-critical. In this dissertation, we
assume that non-critical tasks do not have replicas and can be terminated
in order to allow a cold standby to execute when a primary fails. For
fault detection, we assume that the replicas monitor the status and health
of the primary, for example, by using heartbeats and producing outputs
when necessary [79, 81]. This is illustrated in Figure 4.1. We assume that
the underlying communication framework is reliable1 , i.e., it guarantees
that a message will either be delivered within a fixed message delivery
bound δ or not be delivered at all. Common communications protocols
like CAN/CAN-FD [82], FlexRay [52] and many variants of real-time Eth-
ernet [53, 54] can support these guarantees. The successful reception of a
heartbeat indicates to the replica that the primary is operational.
1 Safety-criticalreal-time systems must deal with communication failures. The communication layer
can utilize solutions like redundant CANbus links, dual FlexRay configurations with built-in support for
fault tolerance, and replicated ethernet switches. In the interests of brevity, we abstract away the details
of such solutions with our assumption of a reliable communication layer in this dissertation.
CHAPTER 4. SYSTEM MODEL 53
In this chapter, we presented our system, task and fault models. We assume
a distributed system consisting of several computational nodes, where each
node can communicate with every other node in the system by sending
messages. We assume that applications are independent and that they ac-
cept inputs, perform calculations and produce outputs. We assume that
tasks are replicated by using either the active or passive replication strate-
gies to support fault-tolerance. Our primary fault-model is fail-stop, which
is commonly used in the automotive industry. We use the system, task and
fault model presented in this chapter throughout the rest of the disserta-
tion.
Chapter 5
54
CHAPTER 5. SELECTION OF REPLICATION PARAMETERS 55
1. We derive the bounds on the recovery time of different types of redundancies, i.e,
active or passive, used in software fault tolerance techniques for real-time systems.
Recovery Time
Backup
promoted to
Primary
Primary primary and
Failure
produces output
HB No HB
Backup Time
Definition 3. Recovery time (RT) is the time elapsed from the instant of primary failure to the
instant when a redundant task is able to produce the desired output. This duration is shown in
Figure 5.1.
The choice of the type of redundant task to be used has a major impact
on a task’s recovery time. For example, an active replica can virtually
provide seamless recovery since it runs alongside the primary. The hot
and cold standbys, on the other hand, have to first detect primary failure.
In addition, the cold standby needs to then prime its state, which results in
an even longer recovery time.
The number of redundant copies assigned to each task is also an impor-
tant design parameter. Every task can be assigned m ( m ∈ N ) redundant
copies. It is important to note that different tasks can utilize different re-
dundancy types (i.e., active, hot or cold). The number of replicas and
their types are system parameters which are application-specific. Their
choice determines the number of failures a given task can tolerate and how
quickly a task can recover from a failure. The former is a system designer’s
choice and the latter can be captured by specifying a recovery-time require-
CHAPTER 5. SELECTION OF REPLICATION PARAMETERS 57
2 missed heartbeats
δ δ
Primary
5 10 15 20
HB No HB No HB HB Time
Backup
Backup arrivals
(a) Backup not following the Primary
δ = w + C + QJ No missed
WCCTPRI heartbeats
δ δ
Primary
5 10 15 20
Backup HB HB Time
3. find an allocation where all tasks satisfy their recovery-time requirements while
minimizing the number of nodes used for allocation.
we derive the recovery time bounds for hot and cold standbys.
Previous work [81] has shown that the bounds on the recovery time for
passive backups can be reduced if the backup task execution follows the
execution of the primary. The intuition for this can be seen in Figures
5.2a and 5.2b. As seen in Figure 5.2a, if the backup can execute at any
time independent of its primary, it is possible for a backup to miss up to
two heartbeats without primary failure. Hence, the backup must wait for
three consecutive missed heartbeats to declare failure of the primary and
initiate recovery, resulting in a longer recovery time. In contrast, when the
backup follows the execution of the primary, it needs only a single missed
heartbeat to detect primary failure.
For the backup to follow the primary, the following requirements must
be satisfied:
1. Global Time Synchronization: To ensure that the backup follows the primary, the
release time of the backup w.r.t that of the primary must be explicitly controlled.
Since fault-tolerant task allocation requires primaries and replicas to run on distinct
nodes, the nodes must be time-synchronized. This constraint can be relaxed for a
system which allows tasks to be released with offsets at boot up and has negligible
clock drift.
0 2 3 5 10
HB
Backup Time
δm = wm + QJm + Cm (5.1)
where,
• The queuing jitter QJm corresponds to the longest time between the initiating
event and the message being queued, ready to be transmitted on the network.
• The queuing delay wm corresponds to the longest time that the message can
remain in the device queue, before commencing successful transmission on
the network.
• The transmission time Cm corresponds to the longest time that the message
can take to be transmitted. In the case of standbys, the transmission time de-
pends on the standby type. Cold standbys need to accept state, and normally
require longer transmission times than hot standbys.
Backup
HB Time
the backup corresponding to the failure of the primary. The total time
from the release of the primary to the execution of the backup would be
WCCTpri + δhot + WCCThot . Hence, the recovery time is
1. The frequency of state transfer from the primary to the standby: The higher the
frequency of state transfer, the fresher is the state of the cold standby and hence
lower is the number of periods required for state priming (i.e., a lower value for p).
CHAPTER 5. SELECTION OF REPLICATION PARAMETERS 62
Thus, for a cold standby to recover from a primary failure, the recovery
time would be
RTR = 0
For a hot standby to recover from primary failure and maintain RTR = 0,
the recovery time, RThot , should be less than or equal to T, i.e., the redun-
dant task must recover before the deadline of its primary. Hence, from
Equation (5.2) we have,
T + δhot + T T
Hence, with the worst-case values for WCCT, a hot standby cannot satisfy
RTR = 0.
RTR > 0
Consider the case where RTR = n and n ∈ N>0 , allowing the task to
tolerate up to n missed deadlines when the primary fails.
In the case of a hot standby, RTR = n can be satisfied if RTHot <
(n + 1) T.
Considering n ≥ 2 and the worst-case values for the terms in the above
equation,
WCCTpri + δ + WCCTbkp ≤ 3T
(5.6)
T + δ + T ≤ 3T
RTR = 0
For a cold standby to satisfy RTR = 0, the recovery time should be less
than or equal to T. From Equation (5.3),
Standby Selection
RTR(n) Condition Standby Assignment
0 WCCTpri + δcold + WCCTcold ≤ T Cold ( p = 0)
0 WCCTpri + δhot + WCCThot ≤ T Hot
0 WCCTpri + δhot + WCCThot > T Active
>0 WCCTpri + δ + WCCTbkp + pT ≤ (n + 1) T Cold
>0 WCCTpri + δhot + WCCThot ≤ (n + 1) T Hot
>0 WCCTpri + δhot + WCCThot > (n + 1) T Active
RTR > 0
As shown in Figure 5.5a, a single primary can have more than one backup.
Both backups in the figure are released such that they follow the primary
to satisfy the primary’s recovery-time requirement. We assume that the
order of promotion to primary is statically configured. Suppose that the
CHAPTER 5. SELECTION OF REPLICATION PARAMETERS 65
BACKUP 2
Time
first backup in Figure 5.5a is designated to take over execution first after
primary failure. On primary failure, it is not guaranteed that the current
second backup would always satisfy the recovery time requirements of the
first which would now become the new primary. In order for the second-
level backup to now satisfy the RTR of the new primary, the release time of
the task needs to be corrected and this is shown in Figure 5.5b. Also, since
we are delaying the release time of the task, and deadlines are therefore
correspondingly postponed, the deferred start does not affect the overall
schedulability of the task set [90].
CHAPTER 5. SELECTION OF REPLICATION PARAMETERS 66
67
CHAPTER 6. FAULT-TOLERANT ASSIGNMENT OF TASKS TO COMPUTING NODES
68
tierOrder = 0 i.e. ψ0 tierOrder = 1 i.e. ψ1 Steering Throttle Brake Control Safety Audio
Control Control (0.16) (0.1) (0.1)
(0.2)
In order to aid the discussion in the sections to follow we will use the figure
above for reference.
CHAPTER 6. FAULT-TOLERANT ASSIGNMENT OF TASKS TO COMPUTING NODES
69
A typical task assignment heuristic has two steps: task ordering and map-
ping tasks to processors. For TPCD, the task ordering is derived from three
pieces of intuition:
1. The members of a group should be placed as far away from each other in the task
order as possible. This reduces the chances of a task facing a placement conflict.
Brake Control
(0.1) Safety Audio (0.1)
Throttle Control
(0.16) Brake Control
(0.1)
HVAC Task (0.4) Steering Control
(0.2) Throttle Control
(0.16) Safety Audio (0.1)
Video Playback Audio Playback Steering Control Brake Control
task (0.55) Task (0.5) Safety Audio (0.1)
(0.2) (0.1)
Brake Control
(0.1) Brake Control
(0.1)
Throttle Control
(0.16) Throttle Control
(0.16)
HVAC Task (0.4) Steering Control
(0.2) Steering Control
Brake Control
(0.2)
Video Playback Audio Playback (0.1)
task (0.55) Task (0.5) Safety Audio (0.1) Safety Audio (0.1) Safety Audio (0.1)
conflict.
3. Having group members close to each other in the task order is more expensive
towards the end than at the beginning of the task order. This is because tasks
towards the end may result in new processors being allocated while having only a
few tasks left in the queue to populate them.
It is important to note here that these intuitions solely target the fault-
tolerant placement constraints, and are independent of the system pa-
rameters used to determine schedulability. In the following sections, we
consider task utilization while deciding schedulability for task allocation.
CHAPTER 6. FAULT-TOLERANT ASSIGNMENT OF TASKS TO COMPUTING NODES
71
Algorithm 6 TPCD
1: procedure TPCD(Γ = {τ00 , τ01 , ....τ10 ...τn0 , ...})
2: for each task τj in Γ do
3: Ψi ← τji . Create tiers consisting of tasks of the same order
4: for each tier Ψi in Ψ do
5: Sort tasks in descending order
6: Task Assignment(α) ← BFD-P(Ψi )
7: return α . Return the task set assignment
Algorithm 7 TPCDC
1: procedure TPCDC(Γ = {τ00 , τ01 , ....τ10 ...τn0 , ...})
2: α ← TPCD(τji ∀ f ( j) 6= 0) . Step 1: Treat all tasks as hot-standbys
3: Step 2: Stop treating cold-standbys as hot standbys. Change all cold-standby
utilizations to default normal cold standby operation utilizations.
4: Task Assignment (α0 ) ← BFD-P(α, τji ∀ f ( j) == 0)
5: return α0
Safety Audio
Traction Control Cruise Control
(0.05)
Infotainment HVAC (0.3) ( 0.3) (0.3)
System (0.55)
Cruise Control
(Cold - 0.05)
Traction Control
(Cold - 0.05)
Safety Audio
(Hot - 0.05)
TPCDC Step 2 Safety Audio
(0.05)
HVAC (0.3)
Cruise Control Cruise Control
(0.3) (Hot - 0.3)
Infotainment
Traction Control Traction Control System (0.55)
( 0.3) (Hot - 0.3)
Figure 6.6: Tasks allocated using TPCDC and their utilizations (best viewed in color)
Safety Audio
Traction Control Cruise Control
(0.05)
Infotainment HVAC (0.3) ( 0.3) (0.3)
System (0.55)
Cruise Control
(Cold - 0.05)
Traction Control
(Cold - 0.05)
TPCDC Step 2 Safety Audio Safety Audio
(0.05) (Hot - 0.05)
HVAC (0.3)
Cruise Control Cruise Control
(0.3) (Hot - 0.3)
Infotainment
Traction Control Traction Control System (0.55)
( 0.3) (Hot - 0.3)
Figure 6.7: Tasks allocated using TPCDC and their utilizations with standby-
redistribution (best viewed in color)
the backups of some of the primaries on the dominant processor run iden-
tical copies as the primary it is possible to swap their positions with the
primary to produce a more balanced distribution of primaries and stand-
bys. This is highlighted in Figure 6.9.
CHAPTER 6. FAULT-TOLERANT ASSIGNMENT OF TASKS TO COMPUTING NODES
75
Safety Audio (B2)(0.1) Safety Audio (B1) (0.1) Audio Playback task
(P) (0.5)
Brake Control (B2) (0.1) Brake Control (B1) (0.1)
Figure 6.8: TPCD Solution to 6.1 with backup types highlighted (best viewed in color)
Safety Audio (P)(0.1) Safety Audio (B1) (0.1) Audio Playback task
(P) (0.5)
Brake Control (P) (0.1) Brake Control (B1) (0.1)
Figure 6.9: TPCD Solution to 6.1 with Primary redistribution (P-Primary, B- Backup, best
viewed in color)
(TPCDC)
The TPCD heuristic allocates primaries and standbys enforcing the place-
ment constraint while trying to minimize the number of processors used.
However, it does not leverage the fact that a cold standby under normal
operation runs with much lower processor utilization, allowing other non-
critical tasks to be scheduled along with the cold standby. This section
describes the TPCDC heuristic.
TPCDC divides all tasks into two categories as follows. A task is con-
sidered placement-critical if it has at least one backup or it is considered
application critical, else it is considered non-placement-critical. Any non-
CHAPTER 6. FAULT-TOLERANT ASSIGNMENT OF TASKS TO COMPUTING NODES
76
6.1.4 Evaluation
TPCD: Discussion
Figure 6.10 highlights the savings in number of processors per iteration that
would be obtained when using TPCD over R-BFD. As the figure indicates
TPCD saves up to 0.43 processors per iteration for a purely random task
set and up 0.91 processors per iteration for an L2 task set. Figures [6.11
-6.13] compares the performance of TPCD and R-BFD directly by plotting
the percentage of cases under which they outperform each other, i.e. they
1 Thisis feasible with EDF(Earliest Deadline First) and RMS (Rate-Monotonic Scheduling) with har-
monic task sets
CHAPTER 6. FAULT-TOLERANT ASSIGNMENT OF TASKS TO COMPUTING NODES
78
Figure 6.10: R-BFD vs TPCD Processors saved by TPCD over R-BFD per task set (best
viewed in color)
use at least one less processor than the other for a feasible task assignment
while enforcing the placement constraint. As Figures [6.11 -6.13] show,
TPCD outperforms R-BFD as both Umax (0.3, 0.5 and 0.7) and number of
primaries (1 to 40) are varied.
The performance of TPCD against R-BFD is primarily influenced by two
factors
1. Presence of placement-critical tasks of low utilization: Any task with a low uti-
lization and a backup has a very high placement cost, as it can leave a processor
relatively empty on assignment, retaining the potential to cause placement con-
straints for its group members. However, a task with high utilization would have a
lower placement cost. This can be intuitively visualized as the extreme case where
CHAPTER 6. FAULT-TOLERANT ASSIGNMENT OF TASKS TO COMPUTING NODES
79
Figure 6.11: Umax = 0.3, R-BFD vs TPCD: Percentage of task sets where one technique
does better (i.e.. uses at least one less processor) than the other (best viewed in color)
Figure 6.12: Umax = 0.5, R-BFD vs TPCD: Percentage of task sets where one technique
does better (i.e.. uses at least one less processor) than the other (best viewed in color)
CHAPTER 6. FAULT-TOLERANT ASSIGNMENT OF TASKS TO COMPUTING NODES
80
Figure 6.13: Umax = 0.7, R-BFD vs TPCD: Percentage of task sets where one technique
does better (i.e.. uses at least one less processor) than the other (best viewed in color)
we have a task with utilization value of 1.0 with two backups. Any allocation
scheme would result in each of these tasks ending up on different processors and
not cause any placement constraints. TPCD performs much better when a task set
contains placement-critical tasks (i.e. tasks that have one or more backups) with
low utilization because, unlike R-BFD, TPCD prioritizes the allocation of higher
order tasks first. The effect of this can be seen in Figures 6.10 and 6.13, where
TPCD does much better than R-BFD when Umax is low.
2. Performance of BFD-P when single task utilizations are high and when the num-
ber of tasks increase: The significance of placement constraints decrease as Umax
increases or when the number of tasks in the system is high. This results in the
CHAPTER 6. FAULT-TOLERANT ASSIGNMENT OF TASKS TO COMPUTING NODES
81
16
Umax = 0.3 TPCD
14
Umax = 0.3 TPCDC
Bins used per iteration
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
Number of Primaries
TPCDC: Evaluation
In order to evaluate TPCDC, we use a similar setup that we used for eval-
uating TPCD. We vary the maximum single task utilization (Umax ) and the
results presented here are at (Umax ) values of 0.3, 0.5 and 0.7. We plot
the number of processors used by each technique and as shown in Fig-
ure 6.15. TPCDC is able to reduce the number of processors required to
have a feasible task assignment while obeying the placement constraint.
The gains for TPCDC increase as (Umax ) increases because this means that
cold standbys could have larger utilizations on initial assignment freeing
up larger spaces when lower normal mode utilizations of cold standbys are
used while assigning the non-placement-critical tasks.
We now have a technique for assigning tasks with standbys to differ-
ent processors. In the following sections, we present a run-time framework
based on AUTOSAR that supports seen allocation of hot and cold standbys,
and shows how the system recovers from failures. We also look at practical
CHAPTER 6. FAULT-TOLERANT ASSIGNMENT OF TASKS TO COMPUTING NODES
84
considerations such as, the worst-case behaviors and overheads of our so-
lution. We complement our analysis with experimental results for failure
recovery latencies and overheads from a test bench running AUTOSAR-
compliant code.
In this section, until now, we have so far considered the (Ci , Ti , Di ) task
model for independent tasks, where Ci represents the worst-case utiliza-
tion of a task. We have assumed RM scheduling, a fixed-priority preemp-
tive scheduling policy which allows schedulability to be maximized while
respecting the deadlines of all tasks. This approach assumes that every task
is equally important, and that its worst-case execution time never exceeds
Ci .
Mixed-criticality systems offer more flexibility. In particular, tasks can
have different criticality levels. For example, in automobiles, actuation con-
trol tasks like brake, steering and throttle control are more critical than
tasks that control the HVAC system. Hence, if there is ever a situation
where we can only satisfy the deadline of some tasks, we should meet
those with higher criticality.
Calculating the worst-case execution time for many real-time cyber-
physical systems, including automobiles, is unfortunately difficult. This
is in part due to the run-time dependencies on the operating environment
that can influence the execution time of a given task. For example, consider
the task that detects, classifies and tracks objects in a vehicle’s environment.
As the number of objects increases, the computational resources needed to
CHAPTER 6. FAULT-TOLERANT ASSIGNMENT OF TASKS TO COMPUTING NODES
85
where,
Cio is the overload execution budget, and
ζ i is the criticality of the task.
2. Overloaded, with an execution budget of Cio which represents the load during over-
loaded operation.
18
14
Bins used per iteration
12
10
2
0 5 10 15 20 25
Number of Primaries
satisfied.
Task ordering can influence the allocation of certain tasks to certain nodes,
which in turn affects schedulability. To this end, we propose the C-TPCD
heuristic to evaluate the impact of the task order on resource utilization.
Since higher-criticality tasks cannot afford to miss deadlines, we prioritize
CHAPTER 6. FAULT-TOLERANT ASSIGNMENT OF TASKS TO COMPUTING NODES
89
Algorithm 10 C-TPCD
1: procedure TPCD(Γ = {τ00 , τ01 , ....τ10 ...τn0 , ...})
2: for each task τj in Γ do
3: ζ i ← τjk . Create tiers consisting of tasks of the same criticality
4: for each task τj in ζ do
5: Ψi ← τji . Create tiers consisting of tasks of the same order
6: for each tier Ψi in Ψ do
7: Sort tasks in descending order
8: Task Assignment(α) ← BFD-P(Ψi )
9: return α . Return the task set assignment
BC’’ (0.1, 0.2) SC’ TC’ (0.16, 0.4) SA’’ (0.1, 0.2)
(0.2, 0.4)
BC’ (0.1, 0.2) SA’(0.1, 0.2)
BC (0.1, 0.2) SC
SC TC (0.16, 0.4)
SA (0.1, 0.2)
(0.2,
(0.2)
0.4)
ζ =1
ζ=2
VP AP HVAC
(0.55) (0.5) (0.4)
ζ =3
verify admission. This ensures that, in the case of overload, all the highest-
criticality tasks will meet their deadlines. Once the highest tier is allocated,
the next criticality tier is allocated in the same way. This is shown in Figure
6.19 and 6.20. Figure 6.21 depicts the final allocation.
We use an identical experimental setup as described in Section 6.2.1
to evaluate the performance of CTPCD-ZSRM against TPCD-ZSRM. As
Figure 6.22 depicts, TPCD-ZSRM still performs better in terms of resource
utilization. This implies that prioritizing the fault-tolerant packing insights
CHAPTER 6. FAULT-TOLERANT ASSIGNMENT OF TASKS TO COMPUTING NODES
91
ζ=2
AP
(0.5)
HVAC ζ=2
(0.4) VP
(0.55)
TC (0.16, 0.4)
SC’
(0.2, 0.4) SA’(0.1, 0.2)
SC
TC’ (0.16, 0.4) (0.2 0.4)
SA’’ (0.1, 0.2)
BC’’ (0.1, 0.2) BC’ (0.1, 0.2) BC (0.1, 0.2) SA (0.1, 0.2)
Node 1 Node 2 Node 3 Node 4
Figure 6.21: C-TPCD: Final assignment
18
TPCD with ZSRM admission control
16 CTPCD with ZSRM admission control
14
Bins used per iteration
12
10
2
0 5 10 15 20 25
Number of Primaries
produce a better allocation in a small but sizable fraction of the cases. This
is due to the fact that, for a particular task set, CAPA admission control
can result in a node assignment that improves the fault-tolerant allocation.
Hence, in practice, an ensemble approach [96] which applies all approaches
and picks the best allocation would be ideal.
CHAPTER 6. FAULT-TOLERANT ASSIGNMENT OF TASKS TO COMPUTING NODES
93
0.6
0.5
Percentage of better allocations (atleast 1 bin)
0.4
TPCD with ZSRM admission control
CTPCDth ZSRM admission control
TPCD CAPA admission control
0.3
0.2
0.1
0
0 5 10 15 20 25
Number of Primaries
Algorithm 11 TPCDC+R
1: procedure TPCDC+R (Γ = {τ00 , τ01 , ....τ10 ...τn0 , ...}) . (τji : j → TaskId, i → TierOrder)
2: for each task τj in Γ do
3: Ψi ← τji . Create tiers consisting of tasks with redundancies of the same order
4: for each tier Ψi in Ψ do
5: Sort tasks in descending order of their utilizations
6: for each task τi in Ψi do
7: Check recovery time to primary and assign redundant-task type
8: Task Assignment(α) ← BFD-P(τi )
9: Apply lower run-time utilizations for cold standbys
10: Allocate the tasks that do not have redundancies
11: return α . Return the task set assignment
Algorithm 12 TRTI
1: procedure TRTI (Γ = {τ00 , τ01 , ....τ10 ...τn0 , ...}) . (τji : j → TaskId, i → TierOrder)
2: for each task τj in Γ do
3: Ψi ← τji . Create tiers consisting of tasks with redundancies of the same order
4: for each tier Ψi in Ψ do
5: Sort tasks in ascending order of RTR constraints
6: for each task τi in Ψi do
7: Check recovery time to primary and assign redundant-task type
8: Task Assignment(α) ← BFD-P(τi )
9: Apply lower run-time utilizations for cold standbys
10: Allocate the tasks that do not have redundancies
11: return α . Return the task set assignment
take over primary execution. Hence, if multiple redundant task options are
available, we prioritize cold standbys over hot standbys and active replicas
because they are the most resource-efficient. Next, hot standbys do not
normally produce outputs. Hence, the overhead for duplicate suppres-
sion is avoided and hot standbys can potentially run a degraded version
of the primary with lower utilization values. However, they may have a
scheduling penalty since they need to satisfy RTR constraints. Therefore,
the heuristic first checks if the hot standby satisfies the RTR constraint of
the task. If so, it assign a hot standby. Else, it chooses an active replica
instead of opening a new node for assignment.
It must be noted that the choices among three redundant-task types
would be different if the goal was different. For example, if communication
bandwidth is constrained, the cold standby overheads for state transfer
need to be factored in.3
As stated before, we prioritize cold standbys over hot standbys or active
replicas. Figure 6.25a shows the distribution of standby types produced
by TPCDC+R. We plot the percentage of active, hot or cold redundant
3 We will consider this overall system resource optimization problem as part of our future work.
CHAPTER 6. FAULT-TOLERANT ASSIGNMENT OF TASKS TO COMPUTING NODES
96
Tier 0
RTR Tier 3
Task 1’ Task 3’
(0.4) (0.6) (Hot)
(Active) Task 2’’
Task 2 (0.3) Task 2 Task 2’ Task 2’’
(0.3) (Cold) (0.3) (0.3) (Cold) (0.3) (Cold)
Task 3
(0.6) Task 3 Task 3’
Task 1 Task 1’’ Task 1 Task 1’ Task 1’’
Task 2’ (0.6) (0.6)
(0.4) (0.4) (0.4) (0.4) (0.4)
(0.3) (Cold) (Cold) (Hot)
(Cold) (Cold)
(d) TPCDC-R+ critical task allocation (e) RTT critical task allocation
(f) TPCDC-R+ non-critical task allocation (g) RTT non-critical task allocation
Algorithm 13 RTT
1: procedure RTT(Γ = {τ00 , τ01 , ....τ10 ...τn0 , ...})
2: for each task τj in Γ do
3: Ψi ← τji . Create tiers consisting of tasks of same RTR
4: for each tier Ψi in Ψ do
5: TPCDC+R(Ψi )
6: return α . Return the task set assignment
60
50 50
40 40
30 30
20 20
10 10
0 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Number of Primaries Number of Primaries
Active Replicas - TPCDC+R Hot Standbys - TPCDC+R Active Replicas - TRTI Hot Standbys - TRTI
Cold Standbys - TPCDC+R Cold Standbys - TRTI
50
% of allocations with fewer nodes
40
25
30 RTT
20 TRTI
20
15 TPCDC+R
10
0 10
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Number of Primaries 5
0
Active Replicas - RTT Hot Standbys - RTT 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Cold Standbys - RTT Number of Primaries
task assignments against the number of primary tasks in each task set.
The results are averaged across 50,000 tasksets, where tasks are randomly
generated. Each task is randomly assigned 0,1 or 2 redundancies, an RTR
constraint from 0 to 5, and a value for p (i.e., periods for cold standby
priming) from 0 to 5.
TPCDC+R prioritizes tasks with higher utilization values by assigning
them first in the task allocation order for each tier. This introduces addi-
tional placement constraints for tasks which have tight RTR requirements.
CHAPTER 6. FAULT-TOLERANT ASSIGNMENT OF TASKS TO COMPUTING NODES
98
An example occurs when a task with low utilization with strict RTR re-
quirements gets placed later in the allocation order. As a result, cold stand-
bys may become unschedulable forcing the use of active replicas, which in
turn can cause new nodes to be added.
To address this problem, we introduce two new heuristics based on
TPCDC+R that prioritize RTR constraints in the task allocation order.
1. In the first heuristic, we order tasks within each tier of TPCDC by their RTR re-
quirements instead of utilization values. We refer to this extension as the Tiered
RTR constraint Increasing (TRTI) heuristic. Algorithm 2 captures this TRTI heuristic.
2. In the second heuristic, we divide tasks into groups with different RTR require-
ments and allocate each group using the TPCDC heuristic separately. We refer to
this as the RTR-tiered (RTT) heuristic. Algorithm 3 presents this heuristic.
In the previous section, we saw that the RTT heuristic on average pro-
duces a better solution than TPCDC+R and TRTI. In this section, we look
at further improving on the RTT heuristic solution by utilizing the simu-
lated annealing method to solve the fault-tolerant task allocation problem
instead.
Simulated annealing is a general-purpose combinatorial optimization
technique first proposed by Kirkpatrick et al. [44]. The fault-tolerant task
assignment problem can be stated as an optimization problem as follows,
Given n tasks (τ1 , τ2 , ..., τn ), with utilization (u1 , u2 , ..., un ), where ui ≤ 1,
find the number of nodes M of size 1 that are needed to pack all tasks
such that a primary task and its corresponding redundant copies obey
the placement constraint of not being co-located on the same node and
optimizing the following cost function [98]
M
c f = Maximize ∑ ( ∑ u i )2 (6.2)
j =1 i ∈ k j
empty bin, we remove it from the allocation4 . The value of the objective
function is calculated for this new state. Let ∆C represent the change in
the cost function, i.e, ∆C = c f (α) − c f (α0 ). This state is unconditionally
accepted if ∆C < 0. If not, the Metropolis condition [99] is applied and the
state is accepted with a probability according to the following acceptance
function P = e(−∆C/T ) . We start with a large value for initial temperature
T = T∞ . When there is no appreciable change in the value of the cost
function across a few chains of computation or a maximum number of
iterations is reached, we lower the temperature. The annealing terminates
when the temperature T reaches a low-enough value, To and the current
best α is returned as the solution. We derive the values for T∞ and To for
the fault-tolerant task allocation problem in Section 6.4.2.
4 In
our experiments, we found no significant improvement in the quality of solutions obtained by
retaining an empty bin
CHAPTER 6. FAULT-TOLERANT ASSIGNMENT OF TASKS TO COMPUTING NODES
102
Lemma 10. The maximum reduction ∆C max for the cost function in Equation 6.2, for a
system of two nodes, k and l, by moving a task from node k to l occurs when Uk = 1 and
Ul = 0, where Uk and Ul are the total utilization values of the respective nodes.
Proof. Let ut represent the utilization of the task that is moved from bin k to l. Let
Uk0 and Ul0 be the transformed utilization values after a task is moved from node
k to l. Hence, Uk0 = Uk − ut and Ul0 = Ul + ut and ∆C for this operation can be
represented as,
2 2
∆C = Uk 2 + Ul 2 − Uk0 − Ul0 = Uk 2 + Ul 2 − (Uk − ut )2 − (Ul0 + ut )2
(6.3)
= 2 ∗ Uk ∗ ut − 2 ∗ Ul ∗ ut − 2 ∗ u2t
From Equation (6.3), ∆C is maximum when the positive terms are maximized and
the negative terms are minimized. Ul only appears in the second term which
is negative, and Uk appears only in the first term which is positive. Hence ∆C
is maximized when Uk = 1 and Ul = 0 corresponding to their maximum and
minimum possible values.
For the fault-tolerant task allocation problem, moving a task from one bin to an-
other can result in a different redundant-task-type assignment resulting in different
run-time utilizations. Let the factor s capture this utilization change. The associ-
ated change in the cost function for this operation is given by,
CHAPTER 6. FAULT-TOLERANT ASSIGNMENT OF TASKS TO COMPUTING NODES
103
From Lemma 5, the maximum value of ∆C, which represents the largest reduction
in the cost function, occurs when a task is moved from a completely-packed node
to a completely-empty node. Since we apply a greedy optimization of removing
empty bins, we consider Ul = e. Hence,
2. We randomly select two tasks currently located in two different bins and swap
them.
Lemma 11. The maximum reduction ∆C max for the cost function in Equation 6.2, for a
system of two nodes, k and l, by swapping two tasks occurs when one of the nodes has
U = 1 and the other has U = e.
Proof. Let Uk and Ul be the total utilization values of the respective nodes. Let ut1
represent the utilization of the task that is moved from bin k to l and ut2 represent
the utilization of the task that is moved from bin l to k. Let Uk0 and Ul0 be the
transformed utilization values after the tasks are swapped. Hence, Uk0 = Uk −
ut1 + ut2 , Ul0 = Ul + ut1 − ut2 and ∆C for this operation can be represented as,
2 2
∆C = Uk 2 + Ul 2 − Uk0 − Ul0
For our fault-tolerant task allocation problem, let the factors st1 and st2 capture the
utilization changes after the swap. The associated change in the cost function for
this operation is given by,
∆C =Uk2 + Ul2 − [(Uk − ut1 + st2 ∗ ut2 )2 + (Ul + st1 ∗ ut1 − ut2 )2 ]
=2 ∗ Uk ∗ ut1 − 2 ∗ Uk ∗ st2 ∗ ut2 + 2 ∗ ut1 ∗ st2 ∗ ut2 − u2t1 − (st2 ∗ ut2 )2 + (6.7)
From Lemma 11, the cost function is maximized when one bin has U = 1 and the
other has U = e. Hence,
∆C max 2 u2 ∗ ut1 − 2 ∗ st2 ∗ ut2 + 2 ∗ ut1 ∗ st2 ∗ ut2 − u2t1 − (st2 ∗ ut2 )2 +
(6.8)
+ 2 ∗ ut2 ∗ st1 ∗ ut1 − u2t2 2
− (st1 ∗ ut1 ) → Uk = 1, Ul = e
Given a task set, the value of ∆C max = max (∆C max 1 , ∆C max 2 ) can be easily calcu-
lated by substituting actual values into Equations (6.5) and (6.8) for all combina-
tions of tasks.
6.4.3 Evaluation
109
CHAPTER 7. SOFTWARE ARCHITECTURE TO SUPPORT AND MAINTAIN
FAULT-TOLERANCE GUARANTEES 110
7.1 Fault-Tolerant Software Architecture for the
GI+HBPRI Node2
Traction Traction
Control
HBHOT Hot-
Control
Standby
(Primary) (Hot)
Node1 ST
HBCOLD
GI+HBPRI
Symbol Meaning
HB Heartbeat Traction
Control
GI Group (Cold)
Information
ST State Node3
PRIMARY Node2
FAILURE
Traction
Traction Hot-
Control
Control Standby
(Primary)
(Primary)
GI+HBPRI HBCOLD
Node1
ST
Symbol Meaning
HB Heartbeat Traction
Control
GI Group (Cold)
Information
ST State
Node3
It is our central goal that a standby can take over execution when the
primary fails. This is achieved by the formation of task-level groups. A
group includes a single primary and all of its corresponding standbys. On
startup, the designated primary for each application initiates a group for-
mation protocol [45], [103] with its standbys resulting in the formation of
a task-level group for which the primary acts as the leader. For example,
CHAPTER 7. SOFTWARE ARCHITECTURE TO SUPPORT AND MAINTAIN
FAULT-TOLERANCE GUARANTEES 112
Node2
HBHOT
Traction Traction
Control GI+HBPRI Control
(Hot) (Primary)
Node1 GI+HBPRI
HBCOLD
ST
Symbol Meaning
HB Heartbeat Traction
Control
GI Group (Cold)
Information
ST State Node3
The standbys in the group have a precedence order dictated by the primary
to decide which standby would take over in the case of a primary failure.
A standby failure while the primary is still running results in the standbys
next in line being promoted to a higher position in the precedence queue.
As one would expect, a hot standby sits higher in the precedence order
than a cold standby.
When a hot standby has detected ‘k’ consecutive missed heartbeats,
it declares that the primary has failed and begins to produce outputs as
shown in Figure 7.2. It also maintains the group information that it regu-
larly received from the previous primary. This allows the new primary to
take over as the leader of the group and starts producing heartbeats and
state messages for the other standbys.
As pointed out in Section 4.2, a typical use case of a cold standby
would involve it being scheduled to run on a node which runs other lower-
priority, non-placement-critical tasks. Since the cold standby does not
perform any calculations, it would be computationally very inexpensive,
thus allowing the other tasks to execute without missing any deadlines. In
case of the failure of a placement-critical primary task, some of the lower-
priority non-placement-critical tasks would be degraded or switched off to
make CPU resources available to allow the cold standby to take over for the
CHAPTER 7. SOFTWARE ARCHITECTURE TO SUPPORT AND MAINTAIN
FAULT-TOLERANCE GUARANTEES 114
failed primary. The cold standby can also be configured to run as a pure
backup of a hot standby where once it detects a primary failure and a hot
standby takes over execution the cold standby gets immediately promoted
to a hot standby.
RxFTmsg TxFTmsg
Primary Failure Hot Produces Output
Primary
C+δ T C
δ
nTpri (n+1)Tpri
HB NO HB
Hot
δ
Primary δ
Hot
nTpri (n+1)Tpri
HB
Hot
1. Applications With State: These applications maintain state information derived from
previous inputs which they use to perform calculations on current inputs. In our
case, the traction control task employs digital filters that would require some state
information to process its current inputs. Use of such filters are very common to
automotive applications. A hot standby would calculate and maintain all the state
information necessary, whereas the cold standby would not produce any state in-
formation. Instead, the primary would send over this state information to the cold
standby. In our case state transfer occurs every primary period, but the frequency
of state transfer can be optimized based on the data freshness and recovery time
requirement of the cold standby. Quantifying this parameter is out of the scope
of this section and we will be consider this problem as part of our future work.
The cold standby also maintains a log of the input messages which allow it to re-
cover state after primary failure. The cold standby prunes this logs every time it
successfully receives state from the primary.
2. Applications Without State: These applications do not need any state to be main-
tained for their operation. In this case, no state information needs to be exchanged
between the primary and the cold standby.
In this section, we describe some of the messages that are passed between
the primary and its backups to maintain group membership. The primary
acts as the leader of the group and has information about all its members.
It shares this information with its group members so that, in the case of
CHAPTER 7. SOFTWARE ARCHITECTURE TO SUPPORT AND MAINTAIN
FAULT-TOLERANCE GUARANTEES 116
a primary failure, the remaining group members and the new leader can
continue to maintain the group.
Since the task to node assignment is done beforehand, each task is pre-
defined to run as either a primary, hot or cold standby. On startup, all
nodes enter a group formation phase. In this phase, a primary periodically
broadcasts a Groupcreate message along with its heartbeat. It then waits for
a pre-configured interval Ttimeout pri (typically a multiple of its period Tpri )
to receive any Groupcreated responses, which would indicate that a primary
already exists. If there is no such response in this interval, the primary
moves out of the startup phase and enters normal operation, where it starts
producing application outputs, listening for heartbeats from its standbys
and producing heartbeats, state and group information (i.e. the normal life
cycle outlined in Section 4.2).
Each standby on startup periodically produces a Group join message.
On receiving the Groupcreate or Groupcreated message from the primary, it
transitions out of the startup phase and starts its normal operation and
producing heartbeat messages. If a standby does not receive a Groupcreate
message for a period of Ttimeoutsb units, then it declares primary failure and
transitions to the normal mode of operation as the new primary. Figure 7.1
represents the final stable state of the system after the groups are created
and also shows the messages exchanged within the group during normal
operation.
If a primary fails and is later restarted, it will broadcast a Groupcreate
message along with its heartbeat. This time, the standby will have taken
over as the new primary and it will respond to the Groupcreate message
with a Groupcreated message indicating to the re-launched primary that it
should run as a standby. Figure 7.3 represents the final stable state of the
system after such a dynamic reconfiguration.
CHAPTER 7. SOFTWARE ARCHITECTURE TO SUPPORT AND MAINTAIN
FAULT-TOLERANCE GUARANTEES 117
Architectures
Hot
nThot (n+1)Thot (n+2)Thot (n+3)Thot
Now that we have studied the overheads and the worst case behavior of
our implementation, in this section we present an experimental evaluation
to demonstrate the feasibility and performance of our solution.
We build our test applications and the fault-tolerant library using AU-
TOSAR Version 4.2.1-compliant software architecture from ARCCORE [107].
Every task is configured as an AUTOSAR runnable. Each task invokes
methods from the fault-tolerance library to produce heartbeats, form groups
and monitor the health of its group members. The assignment of the type
of standby is made statically offline. We use three STM3210C boards con-
nected to each other over a 100Mbps Ethernet connection [54], over a Fast
Ethernet switch. We ensure that the primary and the standbys are assigned
to different ECUs. We create different phase offsets between the tasks by
powering on the boards at different offset times.
91.1250
Recovery Time Task 1( T = 300ms)
997.8998
237.03
236.72
Recovery Time Task 2 (T = 300ms)
789.4926
151.987
56.2295
151.23
441.5862
33.0210
10.322
10.312
13.9754
13.2218 440.9933 788.7833 997.1600 32.8219 55.9205 91.0463 10.305 151 236.568
Execution Phase Offset (ms) Execution Phase Offset (ms) Execution Phase Offset (ms)
20.00
20.00
23.33
20.00
20.00
16.67
13.33
13.33
10.00
10.00
10.00
10.00
6.67
3.33
3.33
0.00
0.00
0.00
0.00
0.00
0.00
0.00
360-400 400-440 440-480 480-520 520-560 560-600 600-640 640-680 680-720 720-760 760-800 90-100 100-110 110-120 120-130 130-140 140-150 150-160 160-170 170-180 180-190 190-200
Experimental Results
Overhead measurement
Discussion
Other Adaptive
Hardware Security
Persistency Platform
Acceleration Management
Services
Operating
System Adaptive Platform Services
API API
This section describes the main contribution of this section: a system de-
sign for fault-tolerance support in the AUTOSAR Adaptive Platform. Fig-
ure 7.11 presents a high-level view of our design. We design and imple-
ment a fault-tolerance wrapper for both clients and servers to make them
agnostic of fault-tolerance considerations. Maintaining the fault-tolerance
support in the application layer allows for portability across different types
of platforms and applications. We consider support for the following SOA
elements / abstractions that in combination make up an offered service:
2. Fields: Fields represent data elements. The fields abstraction supports three op-
erations, set: assigning values to data elements, get: retrie ving values of data
elements, notify: notify clients of changes in the values of the data elements.
CHAPTER 7. SOFTWARE ARCHITECTURE TO SUPPORT AND MAINTAIN
FAULT-TOLERANCE GUARANTEES 129
3. Methods: Remote procedure calls executed by the server on behalf of the client.
These SOA abstractions can be broadly divided into two categories based
on the unidirectional or bidirectional nature of messages exchanged be-
tween servers and clients.
1. Servers can fail: Multiple tasks must provide the same service. These services can
be identical copies or one can be a degraded copy of the other. Clients must be able
to locate, subscribe and use all copies of the service, allowing for implicit failure
handling.
2. Clients can fail: Multiple tasks must act as clients. Servers must be able to handle
commands and requests from multiple identical clients.
The current Adaptive AUTOSAR standard (version 18.10) does not spec-
ify requirements for fault-tolerance support. Having fault-tolerance re-
quirements defined in the standard is especially important for platform-
functional clusters like Execution Management and Communication Man-
agement since failures of these components can be catastrophic. Mapping
these replication requirements to the SOA abstractions described above we
have the following considerations,
CHAPTER 7. SOFTWARE ARCHITECTURE TO SUPPORT AND MAINTAIN
FAULT-TOLERANCE GUARANTEES 130
1. Client-side handling: Events, Fields-Get and Fields-Notify abstractions only modify
data on the client side, hence the fault-tolerance support only applies to the client
side.
2. Server-side handling: The Fields-Set abstraction only modifies data on the server
side, hence the fault-tolerance support only applies to the server side.
3. Client and Server-side handling: The Method abstraction can modify data on both
the server side and client side, hence the fault-tolerance support applies to the both
the server side and the client side.
1. Data Time stamping: Every request and reply from a server or client is times-
tamped with the latest global time value.
2. Replication Information: Every request and reply from a server or client is ap-
pended with data essential for fault-tolerance support like replica-type, status and
task and node id.
Service Tracking
Service Offer
Service Tracking
Field Get
Field Get
Fusion
Time
RPC
RPC
Filter
RPC Output
RPC Output
Time
pled with replication information can be used to create a unique identifier for each
request or reply.
We now present the Reliable Client and Reliable Server abstractions and
describe them in detail.
Reliable Client
3. Request Interface: This interface allows a client to make a single request call which
is relayed to all servers maintained by the ReliableClient.
Reliable Server
1. Filter Interface: This interface allows a server to supply a custom filter strategy
to apply if duplicate clients request an identical service. The ReliableServer uses
the unique identifier and time stamp primitives to identify duplicate inputs if the
filter strategy requires it.
Timing Analysis for Failure Detection and Recovery for Service failures
1. Initial Wait Phase (IWP): In this phase, the client remains silent for a random time
between control parameters CliDel IWPmin and CliDel IWPmax . The server also remains
silent for a random time between between control parameters SerDel IWPmin and
SerDel IWPmax . If a client receives an offer of a service in this phase, it directly jumps
to the Main Phase.
2. Repetition Phase (RP): In this phase, the client starts sending out find messages.
The time between every two find messages increases successively up to a maxi-
mum number of find messages given by CliFindMsgMax. The delay between two
find messages can be controlled by setting a control parameter CliRepDel. A client
transitions to the Main Phase on reception of an offer message. A service behaves
almost exactly like the client with the exception of the reaction to a received find
message. In this case, the server waits a random amount between control parame-
ters SerDel RPmin and SerDel RPmax of time within predefined bounds and sends out
a unicast offer message. A service will transition into the main phase if a number
of offer messages set by the control parameter SerFindMsgMax at a rate given by
SerRepDel RP is reached.
3. Main Phase (MP): In this phase, the client remains silent and the service continues
to send out offer messages to indicate its availability at a period set by the control
parameter SerRepDel MP .
CHAPTER 7. SOFTWARE ARCHITECTURE TO SUPPORT AND MAINTAIN
FAULT-TOLERANCE GUARANTEES 135
The above SOME-IP/SD phases can be categorized into two distinct
stages for the purpose of analyzing the fault recovery behavior as follows.
If the service fails before a connection has been established, the service
must be restarted. In the Adaptive Platform, this can be done using the
Platform Health Management and Execution Management Functional clus-
ters. The Platform Health management cluster sets a HW watchdog at
process launch. If the process launch fails and the HW watchdog expires,
the Execution Management is triggered to restart the process. Let Twd be
the expiration duration of the watchdog timer and let TEM represent the
time taken by the Execution Management Cluster to restart a process. Let
Tconnect represent the start-up delay associated with establishing a connec-
tion between a server and a client, the formal derivation of the startup
latency bound is derived in [111]. This latency bound can be controlled
using the control parameters described above. Hence, the maximum time
required for recovery is,
It is possible that the task fails to start on multiple retries. In this case,
the Platform Health Management (PHM) Functional Cluster should be able
to trigger appropriate actions with the help of Execution Management if
Trecovery > Trecmax . An important addition to the PHM definition would be
to allow for different actions based on the number of times the watchdog
timer expires.
CHAPTER 7. SOFTWARE ARCHITECTURE TO SUPPORT AND MAINTAIN
FAULT-TOLERANCE GUARANTEES 136
Client Non-Concurrent Replica Server Client Non-Concurrent Replica Server
Notification Notification
Notification Notification
Failure
Failure Stop Offer
Offer
Offer Stop Offer
Timeout, TTTL
Failure Detected
Notification
Time Time
(b)
(a)
Request
Request Request
Request
Reply
Reply
Reply Reply
Stop Offer Stop Offer
Request Failure Detected
Request
Request
Timeout, Ttout
Reply
Failure Detected
Reply
Time
Time
(a) (b)
Renesas Client
Rcar-H3
QEMU 0
In this section, we describe our experimental setup and present our eval-
uation results. As Figure 7.16 depicts, we have a two-node system, where
each node is running the AUTOSAR Adaptive Platform implementation
from Electrobit, compliant in most parts with the 18.03 Adaptive AU-
TOSAR [101] Standard. One of the nodes is a Renesas R-Car-H3 devel-
opment kit [112] and the other is a virtual instance running on a Qemu
emulator on a X86 laptop. The two nodes communicate over an 100Mbps
Ethernet switch. We create a server client example for testing. Both the
server and client use the ReliableServer and ReliableClient APIs. The
service is replicated by a non-co-located replica. The control parameters
selected for the experiments are presented in Table ??.
CHAPTER 7. SOFTWARE ARCHITECTURE TO SUPPORT AND MAINTAIN
FAULT-TOLERANCE GUARANTEES 139
In our first experiment, the offered service produces periodic events and
supports mechanisms to handle exceptions. We inject artificial faults that
present as segmentation faults. Figure 7.17 presents the results. Since the
faults are handled, the primary is able to trigger the replica to take over
before it fails.
In our next experiment, the offered service produces periodic events,
but does not support mechanisms to handle exceptions. We inject artificial
faults by disconnecting the RCar-H3 board from its power source. Figure
7.18 presents the results. Since the faults are not handled by the primary, it
is unable to trigger the replica. The replica has to incur an additional delay
of Tttl before it can detect the failure of the service. Hence, the recovery
takes longer compared to handled failures.
CHAPTER 7. SOFTWARE ARCHITECTURE TO SUPPORT AND MAINTAIN
FAULT-TOLERANCE GUARANTEES 140
7.2.4 Request-Reply-Based Services
In our next experiment, the offered service supports remote procedure calls
and mechanisms to handle exceptions. The client task periodically sends
out RPC requests with a period. We inject artificial faults into the service
that present as segmentation faults. Since the faults are handled, the pri-
mary is able to trigger the replica to take over before it fails. We observe
that, for most practical cases, the recovery is implicit since the replica takes
over from the primary before a new request from the client is made.
In our next experiment, the offered service supports remote procedure
calls, but does not support mechanisms to handle exceptions. We inject
artificial faults by disconnecting the RCar-H3 board from its power source.
Since the faults are not handled by the primary, it is unable to trigger the
replica. The replica waits till the timer Ttout expires before it can declare
the failure of the service and reply to the RPC request.
Infrastructure
Figures 7.22 and 7.23 show the operation of the effective clones. Un-
der normal condition 7.22 the effective clones run alongside the primary.
CHAPTER 7. SOFTWARE ARCHITECTURE TO SUPPORT AND MAINTAIN
FAULT-TOLERANCE GUARANTEES 145
In this example let us consider the BehaviorTask which produces behav-
ioral decisions that are used by the planning subsystem to plan trajectories
for driving. It is interesting to note here that the frequency of the be-
havioral outputs are effectively doubled because the effective clone of the
Behavior Task also produces outputs. However, this does not affect the
downstream module which in this case is the planning subsystem which
basically chooses the latest behavioral decision as the input. Figure 7.23
shows that the applications continuing to run successfully without the loss
of any functionality even when the primaries fail.
8.1 Introduction
148
CHAPTER 8. EVALUATION AND TESTING OF SELF-DRIVING SAFETY-CRITICAL
AUTOMOTIVE SYSTEMS 149
of these software components with each other and the hardware compo-
nents. The next step in the development process involves the deployment
of these components onto a real system. This deployment must ensure the
feasibility of correct operation of all components under myriad conditions.
CAVs are inherently safety-critical cyber-physical systems that interact
directly with the environment that they operate in. Testing requirements to
ensure correct system behavior add another significant layer of complex-
ity to the CAV development process. For any CAV, it is highly desirable
that the functional behavior of all hardware and software components is
throughly verified before on-road testing. In order to ensure safety, CAVs
also need to meet para-functional requirements like reliability, safety and
timeliness. The final level of testing involves on-road testing often un-
der controlled environments first. During this testing phase, the run-time
monitoring and data collection utilities are very useful. That allows system
verification and diagnosis in case of a deviation from system objectives.
Given the large number of hardware and software components and the
complexity of testing requirements a well-defined methodology supported
by a suite of tools is needed for validating correct and safe behavior. In this
chapter, we describe the tools and methodologies that we use to model,
design, develop and test applications for a CAV.
The organization of the rest of this chapter is as follows. In Section 8.2,
we discuss various elements in our design and development of CAVs. We
present a standard reference architecture for a CAV. We describe the var-
ious components within the reference architecture and their interactions.
We also present our development cycle and the process and tool flow that
we follow for developing CAV features. From Section ?? to Section ??, we
describe in detail the tools that we use as part of our development cycle. In
Section 8.4, we describe our run-time mechanisms for system monitoring
CHAPTER 8. EVALUATION AND TESTING OF SELF-DRIVING SAFETY-CRITICAL
AUTOMOTIVE SYSTEMS 150
and Development
Figure 8.1 presents a reference architecture for a CAV. As shown in the fig-
ure, a CAV consists of various sensors (such as cameras, radars and lidars),
processors, actuators (such as controllers for braking, throttling and steer-
ing) and software components (e.g., behaviors and planning) that interact
CHAPTER 8. EVALUATION AND TESTING OF SELF-DRIVING SAFETY-CRITICAL
AUTOMOTIVE SYSTEMS 151
with each other. Next, we present a brief overview of each component in
the reference architecture.
1. Sensors: A CAV uses a number of sensors to sense its environment. For example,
a CAV may use lidars and radars to detect, and track objects. It may also use sen-
sors like GPS, accelerometers, gyroscopes and wheel-speed sensors for localization.
Figure 8.2 highlights some of the sensors installed in our CAV.
2. V2X1 Communication Interfaces: A CAV can communicate and interact with V2X-
enabled vehicles, pedestrians and infrastructure. These V2X interfaces may make
use of DSRC which is a two-way short-to-medium-range wireless communications
technology [114]. DSRC messages can be used to alert CAVs about imminent
hazards like vehicles stopped ahead; potential collision at intersections or during
merging; sharp curves or slippery patches of roadway ahead [114]. DSRC mes-
sages can also be used to communicate intersection and traffic light information to
CAVs. CAVs can also co-ordinate and execute complex intersection protocols [115]
by exchanging DSRC messages.
3. Perception: A CAV continually needs to make decisions based on the its surround-
ing environment. The perception component is responsible for accepting and fus-
ing data from various sensors on the CAV. It is also the responsibility of the per-
ception system to interpret the data, identify and track various objects with a very
high degree of reliability.
4. Road-World Model: The road-world model accepts the processed sensor informa-
tion from the perception system, and using predefined map information, creates a
composite model of the world around the CAV for use by other sub-systems [1].
The composite model can be divided into several discrete interfaces: static obsta-
1 V2X - Vehicle to X, where X can represent infrastructure (V2I), other vehicles (V2V), or pedestrians
(V2P), bicyclists, cloud or other possibilities.
CHAPTER 8. EVALUATION AND TESTING OF SELF-DRIVING SAFETY-CRITICAL
AUTOMOTIVE SYSTEMS 152
cle maps, dynamic obstacle maps, visibility, health, current vehicle pose, and road
structure [102].
5. Route Planning: The route planning component is responsible for utilizing map
information to generate an optimized route for travel. This route can be optimized
for various metrics like distance from current CAV location to a desired location,
time and fuel economy.
6. Behaviors: The behaviors component is responsible for high-level decisions the CAV
needs to make to safely navigate its environment and reach its intended destina-
tion. Decisions like slowing down at a red light or a stop sign, safely yielding
to a pedestrian, changing lanes to maintain the global route plan are some of the
decisions the behaviors component is responsible for.
7. Short-term path planning: Unlike route planning that attempts to find an end-to-end
route for the CAV, the short-term path planner is responsible for determining the
immediate near-term path on the road that the CAV must take. For example, the
short-term path planner is responsible for determining a safe path to follow while
avoiding static and dynamic obstacles on the road, maintaining a safe distance
from a bicyclist sharing the road or moving into or out of a parking spot.
8. Health Monitor: As a safety-critical system, a CAV must at all times ensure the
safety of its passengers and everything in its environment. The health monitor
component tracks the health status of all hardware and software components on
the CAV. It is responsible for reporting failures and any necessary actions that must
be taken by the user.
9. Vehicle By-Wire Controls: The drive-by-wire controls allow the software components
to control the motion and safe operation of the CAV. Primary controls include
braking, throttling, steering and gearing. Secondary controls may include turn
signals, hazard indicators, wipers and door locks.
CHAPTER 8. EVALUATION AND TESTING OF SELF-DRIVING SAFETY-CRITICAL
AUTOMOTIVE SYSTEMS 153
10. Data Logger: The data logger logs state information from all software components.
The health monitor logs allow for diagnosis in the case of faults or excessive devi-
ation from expected behavior.
11. Embedded Computing Platform: The embedded computing platform hosts and ex-
ecutes all the software components. The embedded computing system consists
of multiple inter-connected computers. Typically, a CAV supports communication
technologies like CAN [82], Ethernet [54] and Flexray [?], that enable the proces-
sors to communicate with each other and to interface with the CAV sensors and
actuators.
12. User Interface: The user interface allows users to interact with the CAV, and ac-
cept inputs like destination location and preferred route. It also allows the user
to choose from different configuration settings for various components like the
route planner and behaviors. The user interface is responsible for communicating
information to the user such as the road-world model as seen by the CAV, the
short-term path, long-term route and system health statuses. The user interface is
also responsible for providing timely alerts to the user.
1. Reliability: A CAV must ensure reliable operation for prolonged periods of time
under varied load and failure conditions. For example, a CAV may have a reliabil-
ity requirement of maintaining normal operation in the presence of independent
software crash faults with a minimum inter-arrival period of two minutes.
CHAPTER 8. EVALUATION AND TESTING OF SELF-DRIVING SAFETY-CRITICAL
AUTOMOTIVE SYSTEMS 154
System On Road Testing
Requirements and Analysis
and Architecture
Emulation Testing
System and Analysis
Modeling ANALYSIS &
DESIGN And Design Simulation Testing
TESTING
and Analysis
Implementation
DEVELOPMENT
2. Safety: A CAV must ensure at all times the safety of its passengers and its sur-
rounding environment. Safety must also be ensured in the case of faults. This
is typically achieved by a combination of hardware and software fault-tolerance
techniques. For example, a CAV may have a safety requirement of maintaining a
minimum inter-vehicular distance of 5 meters in urban driving conditions.
3. Timeliness: The end-to-end latency from sensing the environment to actuating must
be small enough to be responsive to the speed and distance constraints of the
operating context. The detection and recovery from faults should also be timely.
For example, a CAV may have a timeliness requirement of processing sensor inputs
and producing control outputs once every 33ms.
The development cycle for a CAV consists of three phases: a design phase,
a development phase and a testing phase. These stages are depicted in
Figure 8.3 and described next.
1. System Requirements and Architecture: This phase involves the gathering of system-
level requirements and defining a system architecture that meets the objectives of
the reference architecture in Section 8.2.1.
2. System Modeling and Design: In this phase, a system designer models various soft-
ware and hardware components and chooses specific components to implement
CHAPTER 8. EVALUATION AND TESTING OF SELF-DRIVING SAFETY-CRITICAL
AUTOMOTIVE SYSTEMS 155
the CAV architecture. In addition, the software tasks interfaces must be designed
and mapped to the hardware configuration.
4. Simulation testing: Since CAVs are safety-critical, in this phase, the software com-
ponents are first tested in a simulated environment for correct behavior. A wide
range of scenarios that mimic real-world situations are tested to ensure the func-
tionality of the software implementation. Simulation testing must include injecting
artificial faults into the system to test system behavior in the presence of faults.
6. On-Road testing: The on-road testing phase involves testing the system initially
in controlled real-world environments to verify the integrity of the entire system.
Information from such tests is used to validate design choices and fine-tune con-
troller parameters, task allocation and frequencies. After confidence in correct sys-
tem behavior is established, on-road testing is commenced on public roads with
uncontrolled traffic. Complex scenarios are introduced gradually and considerable
caution is exercised throughout.
Run-Time
Diagnosis SysAnalyzer
Framework
Various combinations of these tools are used to develop and test each
subsystem from the reference architecture. For example, the path planning
subsystem uses SysWeaver and SysAnalyzer to generate and deploy code,
and TROCS for testing functional behavior.
In this section, we presented the various pieces in our CAV reference
architecture capturing our development and tool flow. Detailed descrip-
tions of SysWeaver, TROCS, AutoSim can be found in [24]. In the sections
to follow, we describe SysAnalyzer, the Run-Time Diagnostics Framework
and EMERALD in more detail.
CHAPTER 8. EVALUATION AND TESTING OF SELF-DRIVING SAFETY-CRITICAL
AUTOMOTIVE SYSTEMS 158
Brake Safety Steering Throttle HVAC Audio Video
Control (BC) Audio (SA) Control (SC) Control (TC) (40,100) Playback Playback
(10,100) (10,100) (20,100) (16,100) (Ui = 0.4) (AP) (VP)
(Ui = 0.1) (Ui = 0.1) (Ui = 0.2) (U = 0.16) (40,100) (55,100)
i ( 0 replicas)
( 2 replicas) ( 2 replicas) ( 1 replicas) ( 1 replicas) (Ui = 0.5) (Ui = 0.55)
( 0 replicas) ( 0 replicas)
SysDeployer
HVAC
SC
(0.4)
(0.2) SA (0.1)
VP BC (0.1)
(0.55) TC (0.16)
TC’ (0.16)
AP
SC’ (0.5)
(0.2)
SA’’ (0.1) SA’ (0.1)
BC’’ (0.1) BC’ (0.1)
Node 1 Node 2 Node 3
8.3 SysAnalyzer
2. Data Logging: The run-time diagnostics framework is also responsible for logging
important information during on-road tests. Not only does the framework log
exceptional conditions, it is also used to create logs that can be used to re-create
tests using the playback feature in TROCS. The playback tests in combination with
detailed logs can be used to fix problems in the functional behavior of the CAV.
3. Fault Injection: EMERALD supports a large suite of fault injection features like,
varying noise models for sensors for different weather patterns, failures in system
and application processes, faults sensor outputs etc.
5. Scenario Generation: Simulation testing is a very important step in ensuring the safe
operation of various autonomous driving applications. Many scenarios are quite
dangerous to test in the real world, for example testing CAV behavior in the pres-
ence of unexpected jay walking pedestrians. EMERALD allows a system design
to generate and test autonomous driving applications under various conditions in
simulation by simply editing a set of configuration files that present all the editable
parameters.
CHAPTER 8. EVALUATION AND TESTING OF SELF-DRIVING SAFETY-CRITICAL
AUTOMOTIVE SYSTEMS 162
6. Automated Testing: EMERLAD also can generate random scenarios to continuously
stress test the system to identify edge-case and gaps in the application logic.
Conclusions
164
CHAPTER 9. CONCLUSIONS 165
Autonomous driving systems are not only safety-critical, but they are also
resource-constrained. Hence, once we have selected the task execution and
replication parameters, the next step is to assign these tasks to comput-
ing nodes such that we minimize the system resource utilization. To this
CHAPTER 9. CONCLUSIONS 166
Guarantees
Automotive Systems
ing complexity in CAV systems, having the right tools and methodologies
in place is critical to developing functionally correct, safe and reliable CAV
applications.
Devices
Glossary
170
Appendix B
In this section, we provide a brief overview of the BFD-P and R-BFD heuris-
tics [20].
The BFD-P algorithm follows the below steps,
2. Fit every task into the best fit processor obeying the placement constraint, i.e., any
task should not be co-located with its replica.
2. The primary tasks are extracted and allocated first using the BFD-P heuristic.
171
APPENDIX B. EXISTING TASK PARTITIONING HEURISTICS 172
3. The replicas are then allocated one by one, highest order replicas first, i.e., opposite
to the TPCD approach.
[2] Ragunathan (Raj) Rajkumar, Insup Lee, Lui Sha, and John Stankovic. Cyber-
physical systems: The next computing revolution. In Proceedings of the 47th Design
Automation Conference, DAC ’10, pages 731–736, New York, NY, USA, 2010. ACM.
3, 17
[4] Minsoo Ryu and Seongsoo Hong. A period assignment algorithm for real-time
system design. In W. Rance Cleaveland, editor, Tools and Algorithms for the Con-
struction and Analysis of Systems, pages 34–43, Berlin, Heidelberg, 1999. Springer
Berlin Heidelberg. 3, 17
[6] Joseph Y. T. Leung. A new algorithm for scheduling periodic, real-time tasks.
Algorithmica, 4(1):209, Jun 1989. 3, 17
173
BIBLIOGRAPHY 174
[7] C. C. Han and H. Y. Tyan. A better polynomial-time schedulability test for real-
time fixed-priority scheduling algorithms. In Proceedings Real-Time Systems Sympo-
sium, pages 36–45, Dec 1997. 3, 8, 17, 30, 43
[10] S. DSouza, A. Bhat, and R. Rajkumar. Sleep scheduling for energy-savings in multi-
core processors. In 2016 28th Euromicro Conference on Real-Time Systems (ECRTS),
pages 226–236, July 2016. 3, 18
[11] Anand Bhat, Soheil Samii, and Ragunathan (Raj) Rajkumar. Recovery Time Con-
siderations in Real-Time Systems Employing Software Fault Tolerance. In Sebas-
tian Altmeyer, editor, 30th Euromicro Conference on Real-Time Systems (ECRTS 2018),
volume 106 of Leibniz International Proceedings in Informatics (LIPIcs), pages 23:1–
23:22, Dagstuhl, Germany, 2018. Schloss Dagstuhl–Leibniz-Zentrum fuer Infor-
matik. 4
[12] Seong Woo Kwak and Jung-Min Yang. Optimal checkpoint placement on real-time
tasks with harmonic periods. Journal of Computer Science and Technology, 27(1):105–
112, Jan 2012. 4
[16] T. King. An overview of arinc 653 part 4. In 2012 IEEE/AIAA 31st Digital Avionics
Systems Conference (DASC), pages 1–7, Oct 2012. 4, 18
[18] D. Oh and T. Baker. Utilization bounds for n-processor rate monotonic scheduling
with static processor assignment. In Real-Time System, pages 15:183–192, 1998. 5,
159
[19] D. Johnson. Near optimal allocation algorithms. Ph.D. Dissertation, MIT, MA. 5,
57, 159
[20] J. Kim et al. R-BATCH: task partitioning for fault-tolerant multiprocessor real-
time systems. In CIT 2010, Bradford, West Yorkshire, UK, June 29-July 1, 2010, pages
1872–1879, 2010. 5, 12, 14, 57, 67, 71, 159, 171
[21] A. Bhat, S. Samii, and R. Rajkumar. Practical task allocation for software fault-
tolerance and its implementation in embedded automotive systems. In 2017 IEEE
Real-Time and Embedded Technology and Applications Symposium (RTAS), pages 87–98,
April 2017. 5
[23] Lingfeng Wang and K. C. Tan. Software testing for safety critical applications.
IEEE Instrumentation Measurement Magazine, 8(2):38–47, June 2005. 7
[24] A. Bhat, S. Aoki, and R. Rajkumar. Tools and methodologies for autonomous
driving systems. Proceedings of the IEEE, 106(9):1700–1716, Sep. 2018. 7, 157
[25] D. Henriksson and A. Cervin. Optimal on-line sampling period assignment for
real-time control tasks based on plant state information. In Proceedings of the 44th
IEEE Conference on Decision and Control, pages 4469–4474, Dec 2005. 8
[26] A. Cervin, M. Velasco, P. Marti, and A. Camacho. Optimal online sampling pe-
riod assignment: Theory and experiments. IEEE Transactions on Control Systems
Technology, 19(4):902–910, July 2011. 8
[27] Nasro Min-Allah, Samee Ullah Khan, and Wang Yongji. Optimal task execution
times for periodic tasks using nonlinear constrained optimization. The Journal of
Supercomputing, 59(3):1120–1138, Mar 2012. 9
[29] M. Nasri and G. Fohler. An efficient method for assigning harmonic periods to
hard real-time tasks with period ranges. In 2015 27th Euromicro Conference on Real-
Time Systems, pages 149–159, July 2015. 9
[30] Morteza Mohaqeqi, Mitra Nasri, Yang Xu, Anton Cervin, and Karl-Erik Årzén.
Optimal harmonic period assignment: complexity results and approximation al-
gorithms. Real-Time Systems, Apr 2018. 9
[31] Jean claude Laprie and Brian Randell. Fundamental concepts of computer systems
dependability. In In Proceedings of the 3rd IEEE Information Survivability, Boston,
Massachusetts, USA, October 2000, pages 24–26, 2001. 10, 50
BIBLIOGRAPHY 177
[34] J. J. Chen, C. Y. Yang, T. W. Kuo, and S. Y. Tseng. Real-time task replication for fault
tolerance in identical multiprocessor systems. In 13th IEEE Real Time and Embedded
Technology and Applications Symposium (RTAS’07), pages 249–258, April 2007. 11
[35] S. Gopalakrishnan and M. Caccamo. Task partitioning with replication upon het-
erogeneous multiprocessor systems. In 12th IEEE Real-Time and Embedded Technol-
ogy and Applications Symposium (RTAS’06), pages 199–207, April 2006. 11
[38] J. Kim et al. Safer: System-level architecture for failure evasion in real-time ap-
plications. In Real-Time Systems Symposium (RTSS), 2012 IEEE 33rd, 2012. 12, 117,
118
[39] P. Guo and Z. Xue. Improved task partition based fault-tolerant rate-monotonic
scheduling algorithm. In 2016 International Conference on Security of Smart Cities,
Industrial Control System and Communications (SSIC), pages 1–5, July 2016. 12
BIBLIOGRAPHY 178
[41] Kay Klobedanz et al. Embedded Systems: Design, Analysis and Verification: 4th IFIP
TC 10, IESS 2013, Paderborn, Germany, June 17-19, 2013. Proceedings, pages 238–249.
Springer Berlin Heidelberg, Berlin, Heidelberg, 2013. 12, 14
[43] Ping Zhu, Fumin Yang, and Gang Tu. Fault-tolerant rate-monotonic compact-
factor-driven scheduling in hard-real-time systems. Wuhan University Journal of
Natural Sciences, 15(3):217–221, 2010. 12
[48] P. Narasimhan et al. Mead: support for real-time fault-tolerant corba. Concurrency
and Computation: Practice and Experience, 17(12):1527–1545, 2005. 13
[49] Caroline Lu, Jean-Charles Fabre, and Marc-Olivier Killijian. An approach for im-
proving Fault-Tolerance in Automotive Modular Embedded Software. 14
BIBLIOGRAPHY 179
[50] A. Bhat, S. Samii, and R. Rajkumar. Practical task allocation for software fault-
tolerance and its implementation in embedded automotive systems. In 2017 IEEE
Real-Time and Embedded Technology and Applications Symposium (RTAS), pages 87–98,
April 2017. 14
[51] Jean-Charles Fabre, Marc-Olivier Killijian, and François Taïani. Robustness of au-
tomotive applications using reflective computing: lessons learnt. In SAC, 2011.
14
[52] Traian Pop, Paul Pop, Petru Eles, Zebo Peng, and Alexandru Andrei. Timing
analysis of the flexray communication protocol. Real-Time Systems, 39(1):205–235,
Aug 2008. 14, 52, 59
[54] D. Thiele, P. Axer, and R. Ernst. Improving formal timing analysis of switched
ethernet by exploiting fifo scheduling. In 2015 52nd ACM/EDAC/IEEE Design Au-
tomation Conference (DAC), pages 1–6, June 2015. 14, 52, 121, 153
[55] Charles Thorpe, Martial H Hebert, Takeo Kanade, and Steven A Shafer. Vision and
navigation for the carnegie-mellon navlab. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 10(3):362–373, 1988. 15
[56] Chris Urmson, Joshua Anhalt, Drew Bagnell, Christopher Baker, Robert Bittner,
MN Clark, John Dolan, Dave Duggins, Tugrul Galatali, Chris Geyer, et al. Au-
tonomous driving in urban environments: Boss and the urban challenge. Journal
of Field Robotics, 25(8):425–466, 2008. 15
[57] Michael Montemerlo, Jan Becker, Suhrid Bhat, Hendrik Dahlkamp, Dmitri Dolgov,
Scott Ettinger, Dirk Haehnel, Tim Hilden, Gabe Hoffmann, Burkhard Huhnke,
BIBLIOGRAPHY 180
et al. Junior: The stanford entry in the urban challenge. Journal of field Robotics,
25(9):569–597, 2008. 15
[58] Andrew Bacha, Cheryl Bauman, Ruel Faruque, Michael Fleming, Chris Terwelp,
Charles Reinholtz, Dennis Hong, Al Wicks, Thomas Alberi, David Anderson, et al.
Odin: Team victortango’s entry in the darpa urban challenge. Journal of Field
Robotics, 25(8):467–492, 2008. 15
[59] John Leonard, Jonathan How, Seth Teller, Mitch Berger, Stefan Campbell, Gaston
Fiore, Luke Fletcher, Emilio Frazzoli, Albert Huang, Sertac Karaman, et al. A
perception-driven autonomous urban vehicle. Journal of Field Robotics, 25(10):727–
774, 2008. 15
[60] Jonathan Bohren, Tully Foote, Jim Keller, Alex Kushleyev, Daniel Lee, Alex Stew-
art, Paul Vernaza, Jason Derenick, John Spletzer, and Brian Satterfield. Little ben:
The ben franklin racing team’s entry in the 2007 darpa urban challenge. Journal of
Field Robotics, 25(9):598–614, 2008. 15
[61] Isaac Miller, Mark Campbell, Dan Huttenlocher, Frank-Robert Kline, Aaron
Nathan, Sergei Lupashin, Jason Catlin, Brian Schimpf, Pete Moran, Noah Zych,
et al. Team cornell’s skynet: Robust perception and planning in an urban environ-
ment. Journal of Field Robotics, 25(8):493–527, 2008. 15
[62] Fred W Rauskolb, Kai Berger, Christian Lipski, Marcus Magnor, Karsten Cor-
nelsen, Jan Effertz, Thomas Form, Fabian Graefe, Sebastian Ohl, Walter Schu-
macher, et al. Caroline: An autonomously driving vehicle for urban environments.
Journal of Field Robotics, 25(9):674–724, 2008. 15
[63] Benjamin J Patz, Yiannis Papelis, Remo Pillat, Gary Stein, and Don Harper. A
practical approach to robotic design for the darpa urban challenge. Journal of Field
Robotics, 25(8):528–566, 2008. 15
BIBLIOGRAPHY 181
[65] James R McBride, Jerome C Ivan, Doug S Rhode, Jeffrey D Rupp, Matthew Y
Rupp, Jeffrey D Higgins, Doug D Turner, and Ryan M Eustice. A perspective on
emerging automotive safety applications, derived from lessons learned through
participation in the darpa grand challenges. Journal of Field Robotics, 25(10):808–
840, 2008. 15
[66] Felix Von Hundelshausen, Michael Himmelsbach, Falk Hecker, Andre Mueller,
and Hans-Joachim Wuensche. Driving with tentacles: Integral structures for sens-
ing and motion. Journal of Field Robotics, 25(9):640–673, 2008. 15
[67] S Shah, D Dey, C Lovett, and A Kapoor. Aerial informatics and robotics platform.
Technical report, Technical report MSR-TR-9, Microsoft Research, 2017. 16
[69] Daniel Krajzewicz, Georg Hertkorn, Christian Rössel, and Peter Wagner. Sumo
(simulation of urban mobility)-an open-source traffic simulation. In Proceedings
of the 4th middle East Symposium on Simulation and Modelling (MESM20002), pages
183–187, 2002. 16
[70] Martin Fellendorf and Peter Vortisch. Validation of the microscopic traffic flow
model vissim in different real-world situations. In transportation research board 80th
annual meeting, 2001. 16
simulator for vanets. ACM SIGMOBILE mobile computing and communications re-
view, 12(1):31–33, 2008. 16
[72] Christoph Sommer, Reinhard German, and Falko Dressler. Bidirectionally coupled
network and road traffic simulation for improved ivc analysis. IEEE Transactions
on Mobile Computing, 10(1):3–15, 2011. 16
[73] M. Joseph and P. Pandya. Finding response times in a real-time system. The
Computer Journal, 29(5):390–395, 1986. 20
[80] Kay Klobedanz, Jan Jatzkowski, Achim Rettberg, and Wolfgang Mueller. Fault-
tolerant deployment of real-time software in autosar ecu networks. In Gunar
Schirner, Marcelo Götz, Achim Rettberg, Mauro C. Zanella, and Franz J. Ram-
mig, editors, Embedded Systems: Design, Analysis and Verification, pages 238–249,
Berlin, Heidelberg, 2013. Springer Berlin Heidelberg. 51
[81] A. Bhat, S. Samii, and R. Rajkumar. Practical task allocation for software fault-
tolerance and its implementation in embedded automotive systems. In 2017 IEEE
Real-Time and Embedded Technology and Applications Symposium (RTAS), pages 87–98,
April 2017. 52, 54, 57, 59, 93, 159
[82] Robert I. Davis, Alan Burns, Reinder J. Bril, and Johan J. Lukkien. Controller area
network (can) schedulability analysis: Refuted, revisited and revised. Real-Time
Systems, 35(3):239–272, Apr 2007. 52, 59, 153
[83] Chris Urmson, Joshua Anhalt, Drew Bagnell, Christopher Baker, Robert Bittner,
M. N. Clark, John Dolan, Dave Duggins, Tugrul Galatali, Chris Geyer, Michele
Gittleman, Sam Harbaugh, Martial Hebert, Thomas M. Howard, Sascha Kolski,
Alonzo Kelly, Maxim Likhachev, Matt McNaughton, Nick Miller, Kevin Peterson,
Brian Pilnick, Raj Rajkumar, Paul Rybski, Bryan Salesky, Young-Woo Seo, Sanjiv
Singh, Jarrod Snider, Anthony Stentz, William “Red” Whittaker, Ziv Wolkowicki,
Jason Ziglar, Hong Bae, Thomas Brown, Daniel Demitrish, Bakhtiar Litkouhi, Jim
Nickolaou, Varsha Sadekar, Wende Zhang, Joshua Struble, Michael Taylor, Michael
Darms, and Dave Ferguson. Autonomous Driving in Urban Environments: Boss and
the Urban Challenge, pages 1–59. Springer Berlin Heidelberg, Berlin, Heidelberg,
2009. 54
[85] Thomas Wolf and Alfred Strohmeier. Fault tolerance by transparent replication for
distributed ada 95. In Michael González Harbour and Juan A. de la Puente, editors,
Reliable Software Technologies — Ada-Europe’ 99, pages 412–424, Berlin, Heidelberg,
1999. Springer Berlin Heidelberg. 54
[87] Navin Budhiraja, Keith Marzullo, Fred B. Schneider, and Sam Toueg. Distributed
systems (2nd ed.). chapter The Primary-backup Approach, pages 199–216. ACM
Press/Addison-Wesley Publishing Co., New York, NY, USA, 1993. 54
[88] KapDae Ahn, Jong Kim, and SungJe Hong. Fault-tolerant real-time scheduling
using passive replicas. In Proceedings Pacific Rim International Symposium on Fault-
Tolerant Systems, pages 98–103, Dec 1997. 54
[89] Dong-Ik Oh and T.P. Bakker. Utilization bounds for n-processor rate monotone
scheduling with static processor assignment. Real-Time Systems, 15(2):183–192, Sep
1998. 57
[90] Jorge Real and Alfons Crespo. Mode change protocols for real-time systems: A
survey and a new proposal. Real-Time Systems, 26(2):161–197, Mar 2004. 65
[91] Taxonomy and definitions for terms related to on-road motor vehicle automated
driving systems., . 77
[92] Karthik Lakshmanan, Dionisio De Niz, Ragunathan (RAJ) Rajkumar, and Gabriel
Moreno. Overload provisioning in mixed-criticality cyber-physical systems. ACM
Trans. Embed. Comput. Syst., 11(4):83:1–83:24, January 2013. 85, 86
[96] Mike Phillips, Venkatraman Narayanan, Sandip Aine, and Maxim Likhachev. Ef-
ficient search with an ensemble of heuristics. In Proceedings of the 24th International
Conference on Artificial Intelligence, IJCAI’15, pages 784–791. AAAI Press, 2015. 92,
159
[97] Paul Emberson, Roger Stafford, and Robert I Davis. Techniques for the synthesis
of multiprocessor tasksets. In proceedings 1st International Workshop on Analysis Tools
and Methodologies for Embedded and Real-time Systems (WATERS 2010), pages 6–11,
2010. 99, 107
[98] Krzysztof Fleszar and Khalil S. Hindi. New heuristics for one-dimensional bin-
packing. Comput. Oper. Res., 29(7):821–839, June 2002. 100
[99] N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, and E. Teller. Equa-
tion of state calculations by fast computing machines. jcp, 21:1087–1092, jun 1953.
101
[100] R.L. Rao and S.S. Iyengar. Bin-packing by simulated annealing. Computers and
Mathematics with Applications, 27(5):71 – 82, 1994. 102, 105
grul Galatali, Hartmut Geyer, Michele Gittleman, Sam Harbaugh, Martial Hebert,
Thomas Howard, Alonzo Kelly, David Kohanbash, Maxim Likhachev, Nick Miller,
Kevin Peterson, Raj Rajkumar, Paul Rybski, Bryan Salesky, Sebastian Scherer,
Young-Woo Seo, Reid Simmons, Sanjiv Singh, Jarrod M. Snider, Anthony (Tony)
Stentz, William (Red) L. Whittaker, and Jason Ziglar. Tartan racing: A multi-modal
approach to the darpa urban challenge. Technical Report CMU-RI-TR-, Pittsburgh,
PA, April 2007. 109, 141, 152
[104] A. Avizienis et al. Basic concepts and taxonomy of dependable and secure com-
puting. Dependable and Secure Computing, IEEE Transactions on, 2004. 113
[105] Robert I. Davis, Alan Burns, Reinder J. Bril, and Johan J. Lukkien. Controller area
network (can) schedulability analysis: Refuted, revisited and revised. Real-Time
Systems, 35(3):239–272, Apr 2007. 118
[106] T. Pop, P. Pop, P. Eles, Z. Peng, and A. Andrei. Timing analysis of the flexray com-
munication protocol. In 18th Euromicro Conference on Real-Time Systems (ECRTS’06),
pages 11 pp.–216, July 2006. 118
[110] Object management group (omg), data distribution service for real-time systems.
https://www.omg.org/spec/DDS/1.4/. 126
BIBLIOGRAPHY 187
[113] S. Oikawa and R. Rajkumar. Portable rk: a portable resource kernel for guaranteed
and enforced timing behavior. In Proceedings of the Fifth IEEE Real-Time Technology
and Applications Symposium, pages 111–120, 1999. 146
[115] Reza Azimi. Co-operative Driving at Intersections using Vehicular Networks and Vehicle-
Resident Sensing. PhD dissertation, Carnegie Mellon University, 2015. 151
[116] Dionisio de Niz, Raj Rajkumar, and Gaurav Bhatia. Model-based development
of embedded systems: The sysweaver approach. 2013 IEEE 19th Real-Time and
Embedded Technology and Applications Symposium (RTAS), 00:231–242, 2006. 156