Anand Bhat PHD Thesis

Practical Solutions for Fault-Tolerance in
Connected and Autonomous Vehicles (CAVs)
Submitted in partial fulfillment of the requirements for

the degree of
Doctor of Philosophy
in
Electrical and Computer Engineering
Anand Ganpat Bhat
B.S., Vivekanand Education Society’s Institute of Technology

M.S., Electrical and Computer Engineering, Carnegie Mellon University
Carnegie Mellon University

Pittsburgh, PA
August 2019
c 2019 Anand Ganpat Bhat.
All rights reserved.
iii
Dedication
This thesis is dedicated to my parents Ganpat and Mukta Bhat, who have been a constant
source of inspiration. To my adivsor, colleagues, and friends, without their guidance this
would not have be possible. Lastly, to my wife, Neha Hegde, for her constant love and
support.
iv
Acknowledgements
This dissertation would have not been possible without the help and support of many
people. First and foremost, I would like to thank my advisor, Prof. Raj Rajkumar. I con-
sider myself extremely fortunate to have had an opportunity to work with Prof. Rajku-
mar. Working closely under his guidance and expertise has definitely made me a better
thinker, engineer and researcher. I am grateful for the opportunity to work on several
diverse and exciting projects to demonstrate my research, ranging from system-level
ones like building fault-tolerant system architectures for various platforms including
the CMU autonomous driving platform to application-level projects like SysAnalyzer to
analyze and deploy real fault-tolerant systems. Prof. Rajkumar also gave me several
opportunities to work with and mentor several other students, which led me to expand
my horizons and become an independent thinker.
I am grateful to the members of my thesis committee, Prof. Anthony Rowe, Prof.
Pei Zhang and Dr. Soheil Samii for their time, effort and inputs in completing this
dissertation.
I would like to thank Dr. Soheil Samii for his constant feedback with regards to
several aspects of my work and his guidance towards making my research practical
and industry relevant. It has been a great pleasure working closely with him. I would
also like to thank Tom Furhman and Dr. Massimo Ossella for their insights on various
aspects of my work.
A special thanks to General Motors (GM) for funding my research. I wish to thank the
members of the CMU’s autonomous driving team: Prof. John Dolan, Jongho Lee, Tianyu
Gu, Chiyu Dong, Adam Werries, and all other former members. Their passion and
efforts made me proud of being part of the team and contributing to our autonomous
car.
Most of my time during my doctoral studies was spent at the Real-Time and Mul-
timedia systems Lab (RTML). Thanks to all the members of RTML who shared their
v
time with me: Gaurav Bhatia, Hyoseung Kim, Junsung Kim, Reza Azimi, Alexei Colin,
Sandeep D’souza, Iijoo Baek, Shunsuke Aoki, Peter Jan, Mengwen He and Weijing Shi.
Also, I would like to thank Toni M. Fox, Chelsea Mendenhall, Brittany Frost and Brid-
gette Bernagozzi for their kind support on administrative matters.
Besides the RTML members, I am grateful to my friends: Rupesh Mehta, Fiona Britto,
Mihir Dattani, Naman Jain, Oliver Shih, Ashvin Swaminathan, Swati Rajendran and
Abhijeet Mishra, without these people, I could not have fully enjoyed my time at CMU.
I would like to thank my parents for their constant love and guidance. Lastly, my thanks
go to my wife, Neha Hegde. She has been a great source of love and support through
these final stages of my PhD.
vi
Abstract
With advances in sensing, machine learning, and computing systems, various semi-
autonomous and autonomous driving applications have become feasible. This has re-
sulted in a dramatic increase in the amount and complexity of computational resources
needed in vehicles. Tasks such as perceiving the environment using sensors like li-
dars, radars, and cameras, fusing data from these sensors to create a road-world model,
route planning, and modeling behaviors, are all computationally intensive and safety-
critical. Conventionally, system reliability in safety-critical applications including avi-
ation is achieved by replicating hardware and running multiple instances of the same
software on different pieces of hardware. Often, a voting mechanism is used to generate
the output, and measures are taken in the system design to ensure that hardware compo-
nents fail independently. However, this approach is extremely inefficient in terms of cost,
weight, space and power, especially for the automotive industry. High automation levels
impose more stringent fault-tolerance requirements in terms of the number of tasks that
need redundancies (standbys), as well the number of failures that are required to be
tolerated for each task (i.e., the number of standbys for each task). Also, the operational
design domain (ODD) of the automated vehicle has a significant impact on the fault-
tolerance requirements. This motivates the need for adaptive cost-optimized software
fault-tolerance solutions to reduce overall resource utilization. This dissertation aims to
achieve this objective in the context of resource-constrained fault-tolerant autonomous
driving applications by considering a comprehensive set of system-level design consid-
erations together. First we present a family of optimal and sub-optimal harmonic search
algorithms and heuristics for selecting task execution parameters. We then present a
framework to derive replication parameters for a given task set and a family of heuris-
tics to allocate tasks to computing nodes while optimizing CPU resource utilization. We
next design and implement our software architectures to support fault-tolerant execu-
tion on popular automotive platforms and demonstrate that our primitives are practical
vii
through experimental evaluations. Finally, we also present the tools and methodologies
we use to test and verify the safe operation of an autonomous driving vehicle.
Contents
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
Contents viii
List of Tables xi
List of Figures xii
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Scope and Approach of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Background and Related Work 8

2.1 Selection of Task Execution Parameters . . . . . . . . . . . . . . . . . . . . . 8
2.2 Selection of Replication Parameters . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Fault-tolerant Assignment of Tasks to Computing Nodes . . . . . . . . . . . 11
2.4 Software Architecture to Support and Maintain Fault-Tolerance Guarantees 13
2.5 Evaluation and Testing of Self-Driving Safety-Critical Automotive Systems 15
3 Selection of Task Execution Parameters 17

3.1 The Schedulability Impact of Harmonization . . . . . . . . . . . . . . . . . . 19
viii
CONTENTS ix
3.2 System Model and Problem Definition . . . . . . . . . . . . . . . . . . . . . . 20

3.3 Optimal Harmonic Search Algorithms . . . . . . . . . . . . . . . . . . . . . . 23
3.4 Sub-Optimal Harmonic Search Heuristics . . . . . . . . . . . . . . . . . . . . 30
3.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4 System Model 48
4.1 Computation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2 Task Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3 Fault Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5 Selection of Replication Parameters 54

5.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.2 Recovery Time Analysis for Passive Backups . . . . . . . . . . . . . . . . . . 58
5.3 Redundant-Task Type Assignment To Tasks . . . . . . . . . . . . . . . . . . . 62
5.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6 Fault-tolerant Assignment of Tasks to Computing Nodes 67

6.1 Task Partitioning with Known Replication Parameters . . . . . . . . . . . . . 68
6.2 Fault-tolerant Task Allocation for Mixed-Criticality Systems with Over-
loaded Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.3 Task Partitioning with Recovery Time Constraints . . . . . . . . . . . . . . . 93
6.4 Applying Simulated Annealing to the Fault-Tolerant Task Allocation Prob-
lem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7 Software Architecture to Support and Maintain Fault-Tolerance Guarantees 109

7.1 Fault-Tolerant Software Architecture for the AUTOSAR Classic Platform . 110
7.2 Fault-Tolerant Software Architecture for the AUTOSAR Adaptive Platform 125
CONTENTS x
7.3 SAFFIRE: Software Architecture For Fault-tolerant Imbed Real-time Envi-

ronments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
7.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
8 Evaluation and Testing of Self-Driving Safety-Critical Automotive Systems 148

8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
8.2 Connected and Autonomous Vehicle (CAV) Design and Development . . . 150
8.3 SysAnalyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
8.4 Run-Time Diagnostics Framework . . . . . . . . . . . . . . . . . . . . . . . . 160
8.5 EMulator/simulator for Embedded Real-time autonomous Intelligent Driv-
ing (EMERALD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
8.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
9 Conclusions 164
9.1 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
9.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
A Glossary 170
B Existing Task Partitioning Heuristics 171

B.1 The BFD-P and R-BFD Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . 171
Bibliography 173
List of Tables
3.1 BFHS Example for c f = FOE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.2 BBHS Example for c f = FOE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Harmonic Search outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.1 Conditions for Redundant Task Selection . . . . . . . . . . . . . . . . . . . . . . 64
7.1 Task-Level Fault Tolerance Library Overhead . . . . . . . . . . . . . . . . . . . . 120

7.2 Control Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
xi
List of Figures
3.1 Evaluation: Branch And Bound Harmonic Search Algorithm (Best Viewed in
Color) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Rational Cost Function Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3 Brute-Force Geometric Series Search Plot . . . . . . . . . . . . . . . . . . . . . . 34
3.4 Brute-Force Geometric Series Search tables (Best Viewed in Color) . . . . . . . 35
3.5 Discrete Piecewise Function Example . . . . . . . . . . . . . . . . . . . . . . . . 37
3.6 Run-time Performance Evaluation DPHS vs BBHS vs PRHS vs Sr (Best Viewed
in Color) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.7 FOE Evaluation: DPHS vs BBHS vs PRHS vs Sr (Best Viewed in Color) . . . . 44
3.8 Run-time Performance Evaluation: DPHS vs BBHS vs PRHS vs Sr (Best Viewed
in Color) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.9 MPE Evaluation: DPHS vs BBHS vs PRHS vs Sr (Best Viewed in Color) . . . . 46
4.1 Detecting Primary Failure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.1 Defining Recovery Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.2 Motivation for Backup following the Primary . . . . . . . . . . . . . . . . . . . . 58
5.3 Recovery Time Bounds for Hot Standby τ2 . . . . . . . . . . . . . . . . . . . . . 60
5.4 Recovery Time Bounds for Cold Standby τ2 . . . . . . . . . . . . . . . . . . . . . 61
5.5 Support for Multi-Level Backups . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.1 Tasks to be allocated and their utilizations . . . . . . . . . . . . . . . . . . . . . 68
xii
List of Figures xiii
6.2 TPCD Task order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

6.3 TPCD Task Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.4 BFD-P Heuristic Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.5 R-BFD Heuristic Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.6 Tasks allocated using TPCDC and their utilizations (best viewed in color) . . . 73
6.7 Tasks allocated using TPCDC and their utilizations with standby-redistribution
(best viewed in color) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.8 TPCD Solution to 6.1 with backup types highlighted (best viewed in color) . . 75
6.9 TPCD Solution to 6.1 with Primary redistribution (P-Primary, B- Backup, best
viewed in color) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.10 R-BFD vs TPCD Processors saved by TPCD over R-BFD per task set (best
viewed in color) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.11 Umax = 0.3, R-BFD vs TPCD: Percentage of task sets where one technique
does better (i.e.. uses at least one less processor) than the other (best viewed
in color) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
in color) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
in color) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.14 TPCD vs Optimal Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.15 TPCD vs TPCDC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.16 CAPA-TPCD vs ZSRM-TPCD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.17 C-TPCD: Example Taskset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.18 C-TPCD: Criticality Tiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.19 C-TPCD: Highest criticality assignment . . . . . . . . . . . . . . . . . . . . . . . 91
6.20 C-TPCD: Medium criticality assignment . . . . . . . . . . . . . . . . . . . . . . . 91
List of Figures xiv
6.21 C-TPCD: Final assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

6.22 TPCD vs C-TPCD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.23 TPCD-CAPA vs C-TPCD-ZSRM vs TPCD-ZSRM . . . . . . . . . . . . . . . . . . 93
6.24 Example: TPCDC-R+ vs RTT (Best Viewed In Color) . . . . . . . . . . . . . . . 96
6.25 Evaluation: RTT vs TRTI vs TPCDC+R . . . . . . . . . . . . . . . . . . . . . . . 97
6.26 Execution-Time Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.27 Resource Utilization Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
7.1 Group Formation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

7.2 Failure Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.3 Dynamic Reconfiguration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.4 Worst-case behavior: Fixed phasing . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.5 Behavior: Variable phasing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.6 Single missed heartbeat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.7 Improving the worst-case execution bounds . . . . . . . . . . . . . . . . . . . . 120
7.8 Recovery Time - Fixed Execution Offset . . . . . . . . . . . . . . . . . . . . . . . 122
7.9 Recovery Time - Variable Execution Offset . . . . . . . . . . . . . . . . . . . . . 122
7.10 Adaptive AUTOSAR Platform Architecture . . . . . . . . . . . . . . . . . . . . . 127
7.11 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
7.12 Reliable Client Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
7.13 Reliable Server Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
7.14 Recovery example for non-concurrent replicas providing notification-based
services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
7.15 Recovery example for non-concurrent replicas providing request-reply-based
services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
7.16 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
7.17 Notification Based Services with handled exceptions . . . . . . . . . . . . . . . 139
7.18 Notification Based Services with unhandled exceptions . . . . . . . . . . . . . . 139
List of Figures xv
7.19 CMU autonomous driving research platform [1] . . . . . . . . . . . . . . . . . . 141

7.20 Standby Types Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
7.21 Standby Types Trade offs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
7.22 Effective Clones Operation: Normal Conditions . . . . . . . . . . . . . . . . . . 144
7.23 Effective Clones Operation: Failure Conditions . . . . . . . . . . . . . . . . . . . 144
7.24 SAFFIRE: Mode Change support . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
8.1 A Reference Architecture for CAVs . . . . . . . . . . . . . . . . . . . . . . . . . . 150

8.2 Sensor Installation [1] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
8.3 Development Cycle for a CAV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
8.4 Tool/Process Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
8.5 SysAnalyzer WorkFlow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
8.6 EMERALD: Scenario Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
8.7 Weather Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
8.8 Lidar Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
Chapter 1
Introduction
1.1 Motivation
With advances in sensing, machine learning, and computing systems, various semi-
autonomous and autonomous driving applications have become feasible. This has re-
sulted in a dramatic increase in the amount and complexity of computational resources
needed in vehicles. Tasks such as perceiving the environment using sensors like li-
dars, radars, and cameras, fusing data from these sensors to create a road-world model,
route planning, and modeling behaviors, are all computationally intensive and safety-
critical. Conventionally, system reliability in safety-critical applications including avi-
ation is achieved by replicating hardware and running multiple instances of the same
software on different pieces of hardware. Often, a voting mechanism is used to derive
the output, and measures are taken in the system design to ensure that hardware com-
ponents fail independently. However, this approach is extremely inefficient in terms of
cost, weight, and space for many applications, especially for the automotive industry,
where reliability requirements can be diverse. For example, five driving automation
levels (DALs) have been defined in SAE J3016 standard to characterize the spectrum
of self-driving features. To put such systems in context with redundancy requirements,
consider a Level 2 system active on highways only. In such a system, although the driver
1
CHAPTER 1. INTRODUCTION 2
is not in direct control of the vehicle motion, the driver has a supervisory role: the driver
is expected to take over control in case of any subsystem or component failure. In such
systems, only a small subset of all software tasks need redundancy. Now, consider a
Level 4 system active in the same operational domain (highways). The system itself is
now responsible to bring the vehicle to a safe stop in case of failures. High automation
levels may impose more stringent fault-tolerance requirements in terms of the number
of tasks that need redundancies (standbys), as well the number of failures that are re-
quired to be tolerated for each task (i.e., the number of standbys for each task). Also,
the operational design domain (ODD) of the automated vehicle has a significant impact
on the fault-tolerance requirements. For example, in urban-driving applications, certain
tasks like pedestrian detection, are perhaps more safety-critical than highway-driving
applications. Hence, the type and number of standbys required for every task also de-
pends heavily on the ODD. This motivates the need for adaptive cost-optimized software
fault-tolerance solutions to reduce overall resource utilization.
1.2 Thesis Statement
Our thesis statement is as follows.
The fault-tolerance requirements for computing systems in resource-constrained

connected and autonomous vehicles can be met by adopting analyzable approaches
to redundancy management, utilizing demonstrably efficient task execution pa-
rameters, and performing structured testing and validation for different levels of
automation and operating contexts.
1.3 Scope and Approach of the Thesis
In order to achieve this objective, as part of this thesis, we consider the

following aspects of the overall system design together.
1. Selection of Task Execution Parameters:

Real-time systems closely interact with the environments in which they are de-
ployed [2]. Owing to the recurring nature of events in such environments, a peri-
odic task model is widely used in these systems. For example, in the autonomous
driving context, a car needs to sense the environment, perform calculations and
control actuators on a continual basis. This is accomplished by using tasks that run
periodically, using a preemptive real time scheduling policy like Rate-Monotonic
Scheduling (RMS) [3]. The selection of these periods plays a very important role in
the design, analysis, and schedulability of a real-time system.
The selection of task periods is also driven by the safety and performance speci-
fications of a real-time application [4]. This choice naturally has a direct impact
on system schedulability. For example, a feedback control application can perhaps
produce very accurate control if it runs at a very high frequency, i.e., if the period
assigned to the task running the feedback control application is small. However,
since smaller periods mean higher CPU utilization, system schedulability is re-
duced [5].
There are several advantages for these periods to he harmonic, i.e., when every
period in the task set is an integer multiple of its shorter periods. It has been
shown that the exact schedulability analysis for the RMS policy is an NP-Complete
problem [6], unless the periods are harmonic [7]. Also, polynomial-time solutions
exist for the response-time analysis of systems with harmonic task sets [8]. Hav-
ing harmonic periods allows for phase optimizations that reduce communication
latencies [9] and also enables energy-saving optimizations [10]. Phase optimiza-
tions can improve recovery latencies [11] and allow for optimal checkpointing for
recovery [12]. Harmonic task sets also play an important role in reducing the com-
plexity in the design of distributed time-triggered embedded systems [13]. Given
all these distinct advantages, in practice, harmonic task sets are widely chosen in
real-time safety-critical systems like automobiles and avionics [14–17]. Given these
wide-ranging implications of the choices, task periods must be carefully selected
to meet all the safety and application requirements while ensuring that the advan-
tages mentioned above are also gained.
2. Selection of Replication Parameters

As mentioned earlier, the system reliability requirements for automotive applica-
tions can be diverse and these requirements vary significantly with DALs and the
ODD. Traditional approaches leverage active and passive replication strategies to
meet fault-tolerance demands of such systems. But, given the diversity of the re-
quirements, there is a need to have more fine-grained control over the resources
used by these traditional replication strategies. It is also important to ensure that
these strategies are flexible since requirements may potentially change at run-time
based on the ODD.
For safety-critical systems employing software redundancy, it is also important to

consider the timing behavior of the run-time fault-tolerance primitives. Hence, the
recovery time, which is the amount of time it takes for a redundant task to take over
execution on the failure of a primary task, becomes a very important design pa-
rameter. The recovery time for a given task depends on various factors such as task
allocation, primary and redundant task priorities, system load and the scheduling
policy. Each task can also have a different recovery time requirement (RTR). For
example, in automobiles with automated driving features, safety-critical tasks like
perception and steering control have strict RTRs, whereas such requirements can
be more relaxed in the case of tasks like heating control and mission planning.
These recovery time requirements in turn influence the type of replication strategy
chosen.
Another aspect to consider is the criticality of a given application. Based on the

context, some applications are more critical than others. In other words, the impact
of the failure of some applications is far greater than the failure of other applica-
tions. Hence, the number of failures that need to be handled for a given application
in an ODD also becomes an important consideration. This in turn directly impacts
the number of replicas to be assigned to a particular application.
3. Fault-tolerant Assignment of Tasks to Computing Nodes

Assigning tasks to processors is a well-known bin-packing problem [18] and it has
been proven to be NP-hard [19]. In the traditional bin-packing problem, the as-
signment is only constrained by the size of the bin, which in this case, corresponds
to the processing capacity of a given node. Replication requirements impose an
additional constraint, where a primary and a replica may not be co-located. This
constraint ensures that, in the case of a failure of the computing node, both the
primary and the replica do not fail together. In [20], Kim et al. defined this as
the Fault-tolerant Partitioned Scheduling problem. This additional fault-tolerance
constraint to the bin packing problem can result in significant waste in terms of
resources if the allocation is not performed well [21]. Hence, the task-to-node as-
signment has to be carefully carried out to ensure that all computational resources
are used efficiently.
Another aspect to consider is that the fault-tolerant assignment of tasks influences

their execution patterns, which in turn affects the recovery latencies of replicas.
Hence, it is important to analyze the impact of allocation on recovery time and
ensure that these parameters are co-optimized.
4. Software Architecture to Support and Maintain Fault-Tolerance Guarantees

In order to support the run-time execution of applications in the presence of faults,
it is essential to have a framework to support the operation of these replicas. Given

the replication strategy, this framework must be able to detect primary failures, and
activate appropriate replicas in a timely and resource efficient manner. This run-
time framework should ensure that the fault-tolerance guarantees of the system
design are satisfied.
The type of run-time framework also heavily depends on the underlying software
architecture in use. Many automotive applications use AUTOSAR (AUTomotive
Open System ARchitecture [22]), an open and standardized automotive software
architecture jointly developed by automobile manufacturers, suppliers, and tool
vendors. The AUTOSAR Classic Platform (CP) [22] standard, which is a widely-
accepted standardized software architecture for automotive electronic control units
(ECUs), addresses the needs of deeply-embedded low-complexity devices. The
AUTOSAR Adaptive Platform provides mainly high-performance computing and
communication mechanisms. It also offers flexible software configuration, e.g. to
support software updates over the air. Features specifically defined for the Classic
Platform, such as access to electrical signals and automotive-specific bus systems,
can be integrated into the Adaptive Platform as well. Compared to the AUTOSAR
Classic platform, the Adaptive platform supports Service-Oriented Architecture
(SOA), which allows the dynamic linking of services and clients during runtime,
making it flexible for application developers. The Adaptive Platform supports
manycore processors and heterogeneous computing platforms that offer parallel
processing, as well as fast and high-bandwidth communication technologies such
as the Ethernet. The Adaptive Platform also supports several safety and security
features like priority-based scheduling, execution of authenticated code and con-
trolled allocation of memory and cpu resources. It is therefore important to design
and implement support for fault-tolerance primitives on these platforms.
As mentioned before, the replication requirements for automotive applications are

not static, and vary significantly based on the ODD. It is therefore important for
the system infrastructure to enable dynamic reconfiguration of system resources.
Thus, there is a need for a system-level framework to support these changes in
operational mode and manage their impact on the fault-tolerance requirements.
5. Evaluation and Testing of Self-Driving Safety-Critical Automotive Systems

The challenges and approaches for testing of safety-critical applications have been
widely studied in the literature [23], [24] An integral part of testing safety-critical
application is simulation/emulation testing. This is especially true for the auto-
motive context, where many safety-critical applications cannot be easily verified
on-road given the safety concerns. Hence, having a robust simulation environ-
ment where these applications can be throughly tested before on-road verification
helps to go a long way in guaranteeing system safety.
Chapter 2
Background and Related Work
This chapter presents the background and related work on the following
topics: (a) Selection of task execution parameters, (b) Selection of replica-
tion parameters, (c) Fault-tolerant assignment of tasks to computing nodes,
(d) Software architecture to support and maintain fault-tolerance guaran-
tees and (e) Testing of self-driving safety-critical automotive systems. Each
section of this chapter reviews and presents relevant systems techniques
and related prior work.
2.1 Selection of Task Execution Parameters
There has been significant work in the field of task period selection in
hard real-time systems for optimizing various use cases. For example, in
[25], Henriksson et al. attempt to find an optimal period assignment to
distribute computing resources between tasks. In [26], Cervin et al. try to
optimize the system performance of a control system. Neither, however,
focuses on creating harmonic task sets.
In [7], Han et al. have presented the Sr and the DCT algorithms which
attempt to create a harmonic task set to verify the schedulability of a given
8
CHAPTER 2. BACKGROUND AND RELATED WORK 9
input non-harmonic task set. Since there exist linear-time solutions to check
the schedulability of systems with harmonic task sets, both algorithms at-
tempt to find a harmonic task set such that every period assigned is less
than or equal to the original period and the utilization of every task is less
than 1. The schedulability of the original task set is conservatively inferred
from the schedulability of the harmonic task set. However, the Sr algo-
rithm assigns artificial periods which are always multiples of 2, whereas
the DCT algorithm creates a harmonic series using each element in the
original task set. Unlike our work presented in this dissertation, neither
algorithm attempts to find the optimal assignment. Our solution in this
dissertation can be applied readily to solve the above problem. Similarly,
in [27], harmonic deadline assignments were created using the least com-
mon multipliers (LCM) of the original task periods, to derive a utilization-
based feasibility test for systems with composite deadlines, but an optimal
assignment is not attempted to be created.
In [28] and [29], Nasri et al. address a problem similar to the one ad-
dressed in this dissertation, but accept a range of period values for each
task as input and target feasible real-number solutions allowing for a user
to choose harmonic sets with high or low utilization by controlling the uti-
lization bounds for a given harmonic range. In contrast, this dissertation
looks to find an optimal integer solution, where chosen values must be be-
low specified thresholds. The solution in this dissertation also allows the
user to select from a variety of cost metrics to optimize. In [30], Mohaqeqi
et al. attempt to minimize the total weighted sum of the periods where the
periods are not restricted.
2.2 Selection of Replication Parameters
Replication is an important technique that deals with permanent crash

faults [31]. Hardware failures, operating system crashes and process crashes
are some examples of crash faults. We assume that these crash faults are
fail-silent [32]. In order to tolerate these crash faults, systems typically em-
ploy fault tolerance by replication [33]. Three major types of redundancies
are as follows:
1. Active Replica: In active replication, all redundant copies are identical and treated
uniformly. Each replica performs all operations, like accepting and processing
application inputs, performing state calculations, performing application calcula-
tions and producing output. This implies that, under normal operation, the system
needs to support duplicate suppression to filter out duplicate outputs.
2. Hot Standby: A hot standby is based on the primary-backup approach. It performs

all the operations of the primary task except for producing outputs. On detection
of primary failure, the hot standby is promoted to become a primary and begins
to produce outputs. Unlike an active replica, a hot standby can run a degraded
version of the primary to optimize resource consumption.
3. Cold Standby: A cold standby is also based on the primary-backup approach. It

can be of two types depending on the type of application. If an application is
stateless, the cold standby does not perform any operations until it detects primary
failure. For applications with state, the cold standby accepts and logs application
inputs but does not perform any other operations. It regularly accepts the state
from the primary to maintain consistency. On detection of primary failure, the
cold standby primes its state first and then begins to produce outputs.
In order employ effective replication, a system designer must decide

the following replication parameters: (a) The number of replicas assigned
to each task, (b) The type of replica assignment (Active, Cold, Hot) (c) the
allocation scheme determining replica to processor assignment.
The problem of supporting fault tolerance at the level of task schedul-
ing has been widely studied in the literature. A number of real-time task
allocation algorithms in order to tackle this problem in a distributed real-
time system [34–36] have been described in the literature. In [37], Oh et
al. present an online allocation heuristic to assign replicas to a minimum
number of processors such that all replicas guarantee that task deadlines
are met. They also derive the bound on the number of processors required
to feasibly schedule a task set using their heuristic. These approaches as-
sume the replica type and number of replicas as input parameters. For
example, [34] focuses only on active replication, where the redundant soft-
ware executes regardless of failure modes. In this dissertation, we present
a framework to analyze recovery time for a given task to determine the
replica type assignment.
2.3 Fault-tolerant Assignment of Tasks to Computing
Nodes
Prior research has studied real-time task allocation algorithms in order

to tackle the problem of fault-tolerance in a distributed real-time system
[34–36]. These approaches focus only on active replication, where the backup
software executes regardless of execution or failure modes. The resource
consumption of such approaches is impractical for resource-constrained
systems like cars, especially as the level of automation increases and mul-
tiple failures need to be tolerated. In contrast, the software standby ap-
proach we focus on in this dissertation allows fault-tolerance solutions with
optimized resource usage by activating some backups only in case of fail-

ures. The use of such standbys in software fault-tolerance is also described
in [38]. They make use of the support from the underlying Linux operat-
ing system to spawn tasks at run-time in case of failures, limiting system
portability.
Fault-tolerant task allocation using a combination of active replication
and the primary-backup approach has been studied in [39] and [40]. Both
techniques introduce phasing delays to support backup overlapping and
backup deallocation techniques. Neither technique leverages the lower
run-time utilization of different types of passive backups to optimize the
number of processors used for deployment. Klobedanz et al. present
an approach for the deployment of real-time software specifically in AU-
TOSAR [41]. The technique assumes a pre-defined number of processors
and aims to produces a feasible task assignment. Similarly, in [42], Ra-
mamritham et al. aim to find a feasible fault-tolerant deployment in a
distributed system. We however, seek to minimize the number of proces-
sors since we target resource-constrained systems like autonomous vehi-
cles. In [43], the type of standby, i.e, whether it is active or passive, is an
output of the solution, whereas in this dissertation we allow the application
requirements to decide the standby type, i.e., we accept standby type as-
signment as an input. In [20], Kim et al. proposed the R-BFD heuristic for
task assignment while meeting the placement constraint, assuming that ev-
ery task in the system needs to be backed up. In this dissertation, we relax
this assumption since all tasks need not have backups. For example, in the
case of automobiles, the tasks that run the entertainment system need not
have backups. Another example is a driver monitoring system which nec-
essarily need not have backups; it may suffice if the driver is notified about
a failure. In this dissertation, we present our fault-tolerant task assignment
heuristic TPCD (Tiered Placement Constraint Decreasing) and compare its

performance against R-BFD. We propose the TPCDC (Tiered Placement
Constraint Decreasing with Cold Standbys) heuristic which uses the run-
time characteristics of cold standbys to reduce the number of processors
used. We also consider the impact of overload operation on the task alloca-
tion. We propose heuristics to determine redundant-task-type assignments
and allocate these tasks to different nodes satisfying the recovery time re-
quirements of all tasks while attempting to optimize resource utilization.
We also apply the Simulated Annealing method [44] to the fault-tolerant
task allocation problem and compare its performance to the heuristics pro-
posed.
2.4 Software Architecture to Support and Maintain
Fault-Tolerance Guarantees
Software fault-tolerance in a real-time distributed system has been pre-

viously studied in the literature. The problem of providing all correct
processors with consistent views of the processor-group membership and
guarantee bounded processor failure detection has been described in [45].
Classically, the burden of supporting fault-tolerance primitives has been
laid on the system middleware, e.g., FT-CORBA [46–48]. Middleware so-
lutions like CORBA need to support varied interactions between compo-
nents within an existing architecture. This typically makes them computa-
tionally heavy and non-portable, and CORBA is therefore not appropriate
in resource constrained safety-critical automotive systems. The software
fault-tolerance library described in this dissertation can be completely im-
plemented only in the application components. This makes the implemen-
tation extremely portable, which is very important in a framework like

AUTOSAR, given that it is being widely adopted in many automotive ap-
plications.
Several approaches to achieve fault-tolerant execution in the AUTOSAR
Classic Platform have been described in the literature. In [41], Klobedanz et
al. present an approach for deployment of real-time software which aims
to produce a feasible task assignment to processors. They also present a
concept to detect failed nodes and activate reconfigurations in AUTOSAR.
In [20], the authors propose the R-FLOW algorithm for Hot Standby and
Cold Standby processors to support fault tolerance with bounded recov-
ery times. Then, they integrate this algorithm in AUTOSAR to provide
dependability. In [49], Lu et. al. suggest a “reflective principle” approach
to achieve fault tolerance in AUTOSAR. This approach introduces defense
software which works on logging of information, checking, and recovery.
This defense software requires access to the OS in order to monitor the
control and data flow at the OS level. In [50], the authors present a frame-
work to support fault-tolerant execution in the AUTOSAR Classic Plat-
form. In [51], Fabre et al. propose a multi-level reflection approach to
achieve fault tolerance and robustness using AUTOSAR as a middleware.
This multi-level layering of functional and non-functional software works
as self-checking component. Fault tolerance considerations have also been
made on low-level communication frameworks like Flexray [52] and Eth-
ernet [53] [54]. This dissertation, on the other hand, for the first time,
presents a system design and analysis for fault-tolerance support in the
Service-Oriented Architecture of the AUTOSAR Adaptive Platform that
uses the SOME/IP framework.
2.5 Evaluation and Testing of Self-Driving Safety-Critical
Automotive Systems
Testing plays a very important role in the safety and verification of safety-
critical system, especially autonomous driving systems. Autonomous driv-
ing systems have a long history since the NAVLAB project at Carnegie
Mellon where a series of experimental platforms and tools were built [55].
Such tools which include, simulation/emulation capabilities, system-level
testing capabilities, fault injection and system verification capabilities.
In the 2000s, the Defense Advanced Research Projects Agency (DARPA)
held three competitions for autonomous driving vehicles, and a variety of
technologies were developed and demonstrated. In particular, the third
competition, the DARPA Urban Challenge in 2007, was designed to foster
innovation in autonomous vehicle operation in urban settings. In the Ur-
ban Challenge, the competitors developed full-scale autonomous driving
vehicles to navigate through a mock city environment, including merging
into moving traffic, navigating traffic circles, traversing busy intersections
and avoiding obstacles. The developed vehicle systems and software are
well-explained by each ranked competitor [56–66]. Each of these competi-
tors developed suites of tools and methodologies to test and verify their
capabilities. For example, in [58] the authors describe their Joint Archi-
tecture for Unmanned Systems (JAUS) for inter-process communications
which allowed for configuring software applications in a modular, recon-
figurable and reusable fashion. In our testing framework we adopt a pub-
lish/subscribe based communication infrastructure [56] to provide similar
capabilities. They also developed a data logging and playback system in-
tegrated at the communication level with the JAUS infrastructure which
allowed them to visualize data in a simulator. Our Run-Time Diagnostics

framework also leverages our publish/subscribe architecture to log data
for playback.
Microsoft developed and open-sourced Aerial Informatics and Robotics
Platform (AirSim), which is designed to simulate drones and autonomous
vehicles [67]. Udacity, an online education company, also developed an
open source self-driving car simulator in order to offer its self-driving car
program to its users [68]. In addition, to help study and understand CAVs,
microscopic traffic simulators and VANET (Vehicular Ad Hoc Networks)
simulators are high in demand. The microscopic traffic simulators such as
SUMO [69] and VISSIM [70] were originally developed to simulate realis-
tic vehicular mobility traces, and now have extended to provide a general
basis for analyzing the effectiveness of V2X protocols. VANET simulators
such as TraNS [71] and Veins [72] integrate a mobility generator and a
network simulator, and offer a general method to study V2X protocols by
allowing the behavior of simulated vehicles to be changed during each sim-
ulation. Most, if not all, of these tools work only in virtual environments.
Chapter 3
Selection of Task Execution Parameters
Real-time systems closely interact with the environments in which they

are deployed [2]. Owing to the recurring nature of events in such en-
vironments, a periodic task model is widely used in these systems. For
example, in the autonomous driving context, a car needs to sense the en-
vironment, perform calculations and control actuators on a periodic basis.
This is accomplished by using tasks that run periodically, using a preemp-
tive real-time scheduling policy like Rate-Monotonic Scheduling (RMS) [3].
The selection of these periods plays a very important role in the design,
analysis, and schedulability of a real-time system.
The selection of task periods is driven by the safety and performance
specifications of a real-time application [4]. This choice naturally has a
direct impact on system schedulability. For example, a feedback control
application can perhaps produce very accurate control if it runs at a very
high frequency, i.e., if the period assigned to the task running the feedback
control application is small. However, since smaller periods mean higher
CPU utilization, system schedulability is reduced [5]. It has been shown
that the exact schedulability analysis for the RMS policy is an NP-Complete
problem [6], unless the periods are harmonic [7], i.e., every period in the
17
CHAPTER 3. SELECTION OF TASK EXECUTION PARAMETERS 18
task set is an integer multiple of its shorter periods. Also, polynomial-time

solutions exist for the response-time analysis of systems with harmonic
task sets [8].
There are also several considerations related to the periodic nature of
tasks. The nature of RM schedulability [3] means that a CPU cannot be
completely utilized, i.e., reach 100% utilization, unless the task periods are
harmonic1 . However, in Section 3.1, we show that, subject to the constraint
of having periods less than or equal to the original periods after transfor-
mation, an unschedulable task set cannot be made schedulable by making
the periods harmonic. Nevertheless, having harmonic periods allows for
phase optimizations that reduce communication latencies [9] and also en-
ables energy-saving optimizations [10]. Finally, harmonic task sets play
an important role in reducing the complexity in the design of distributed
time-triggered embedded systems [13]. In practice, harmonic task sets are
widely used in real-time safety-critical systems like automobiles and avion-
ics [14–17].
In this chapter, we consider a set of independent periodic tasks with
application-specified period values. Considering the benefits of harmonic
task sets listed above, the goal of this chapter is to assign a set of in-
teger harmonic periods to the input task set while optimizing for com-
monly used cost functions like the total percentage error (TPE), total sys-
tem utilization (TSU), first order error (FOE), and maximum percentage
error (MPE). We ensure that every period assigned is less than or equal
to the original application-specified period, while maintaining the RM pri-
orities of the original task set and ensuring that the task set continues to
remain schedulable. This constraint is intended to satisfy the safety and
performance specifications. We consider only integer periods owing to the
1 Some nearly-harmonic task sets can also be scheduled up to 100% utilization by RMS.
discrete nature of time in real-time computer systems.

The rest of this chapter is organized as follows. We presented related
work in Section 2.1. We describe the schedulability impact of harmoniza-
tion in Section 3.1. We present the problem statement and describe our
system model in Section 3.2. We present the brute-force harmonic search
algorithm and our BBHS algorithm in Section 3.3. We also compare the per-
formance of our BBHS algorithm against the brute-force harmonic search
algorithm in terms of the number of operations performed to find the op-
timal solution. In Section 3.4, we present two sub-optimal heuristics that
can be used in contexts that are time-sensitive and can afford sub-optimal
solutions. We apply our approach to harmonize task sets used in real-
world applications to highlight its benefits. In Section 3.5, we conclude
and summarize our findings.
3.1 The Schedulability Impact of Harmonization
As highlighted above, it is tempting to think that harmonization can im-

prove schedulability. However, in this section, we show that an unschedu-
lable task set cannot be made schedulable by harmonization, constrained
to having the periods less than or equal to the original periods for rate-
monotonic scheduling policy with implicit deadlines.
Theorem 1. An unschedulable task set cannot be made schedulable by transforming task periods
to values less than or equal to the original periods for rate-monotonic scheduling policy with
implicit deadlines.
Proof. For a task set τ = {τ1 , τ2 , ...τn } with tasks arranged with increasing periods i.e
Ti ≤ Ti+1 ∀ i from 1 to n, the schedulability under rate-monotonic scheduling can be
determined using the response time test, which is as follows:

i −1 i
aik+1 = Ci + ∑ d aik /Tj eCj , a0i = ∑ Cj (3.1)
j =1 j =1
where, aik represents an estimate of the response time for task τi [73].
This implies that, for an unschedulable task set, at least one task remains unschedu-
lable, i.e. its response time is greater than its deadline. Hence, if τm (m = 1 to n) is
unschedulable, we have,
m −1
am
k+1 = Cm + ∑ daik /Tj eCj > Tm (3.2)
j =1
From the above equation, we see that the response time estimate depends only on the
periods of tasks with i < m, i.e., the tasks with smaller periods and hence higher priority
under RMS. Since Tj is part of the denominator, transforming period values from Ti to
Ti0 such that Ti0 < Ti can only result in larger response times.
3.2 System Model and Problem Definition
In this section, we describe our system model and present our problem
statement.
3.2.1 System Model
We consider a hard real-time task set τ = {τ1 , τ2 , τ3 , . . . , τn }, which consists

of n independent periodic tasks. Each task τi is described as (Ci , Ti , Di ),
where Ci (Ci ∈ N>0 ) represents the worst-case execution time of the task,
Ti (Ti ∈ N>0 ) is the period of the task and Di (Di ∈ N>0 ) is the deadline
by which the task is expected to complete execution. We assume implicit
deadlines, i.e., Di = Ti . We also assume that the tasks in the task set are
ordered by non-decreasing periods, i.e., T1 ≤ T2 ≤ . . . ≤ Tn .
We define a harmonic set as a set of numbers where every number is an

integer multiple of every smaller number from the set. A task set in which
the set of all task periods form a harmonic set is referred to as a harmonic
task set and the set of task periods is referred to as a harmonic period set.
3.2.2 Problem Statement
Given a task set τ, generate an optimal harmonic task set τ 0 with integer
periods (i.e. Ti0 ∈ N>0 ), if one exists, such that,
1. The period of every task in τ 0 is an integer less than or equal to its corresponding
period in τ, i.e., Ti0 ≤ Ti .
2. The worst-case execution time of every task in τi remains the same, i.e., Ci0 = Ci
and task set after harmonization continues to remain schedulable, i.e., Ci ≤ Ti0
3. Each task maintains its original RM priority
A resulting task set is said to be optimal if it optimizes the selected

cost function. Some commonly used cost functions are the total percent-
age error (TPE), total system utilization (TSU), first order error (FOE), and
maximum percentage error (MPE). The cost functions (c f ) are as follows.
1. Total percentage error (TPE), i.e.,

n
c f = Minimize
T0
∑ (Ti − Ti0 )/Ti (3.3)
i =1
2. Total system utilization (TSU), i.e.,

n
c f = Minimize
T0
∑ (Ci /Ti0 ) (3.4)
i =1
3. First order error (FOE) i.e.,

n
c f = Minimize
T0
∑ (Ti − Ti0 ) (3.5)
i =1
4. Maximum percentage error (MPE) i.e.,
c f = Minimize max[( Ti − Ti0 )/Ti ]. (3.6)

T0 i
The choice of the cost function is application-dependent. For example, a

resource-constrained system could optimize for reduced system utilization
whereas, an application with relatively stricter period requirements could
prefer to minimize the maximum percentage error from the original task
set. This list of cost functions above is not exhaustive, but the algorithms to
be presented in Section 3.3, can be applied to any cost function whose error
increases as the transformed periods decrease in value. We refer to such
cost functions as rational cost functions. We discuss rational cost functions
in detail in Section 3.4.1.
3.2.3 Mathematical Representation of Harmonic Sets
We now derive a mathematical representation for our problem definition.

First, consider an integer harmonic set ζ = {ζ 1 , ζ 1 , ..., ζ n }, with n numbers.
Since every ζ i is an integer multiple of a smaller number in the set, every
ζ i can be represented as
i
ζi = ζ1 ∗ ∏ rj (3.7)
j =1
where, we refer to r j ∈ I as the period ratio, i.e.,



1,
 j=1
rj =

ζ j /ζ j−1 , 1 < j <= n.

That is, every period Ti0 in the harmonic period set can be represented as
i
Ti0 = T10 ∗ ∏ r j (3.8)
j =1
Example: Consider the integer harmonic set {10, 30, 180}. These num-
bers correspond to T10 = 10, r = {1, 3, 6}. Since every number in the har-
monic period set is less than or equal to the corresponding number in the
original task set, we have,
Ti = Ti0 + Ki where Ki ∈ N=>0

i (3.9)
Ti = ( T10 ∗ ∏ r j ) + Ki
j =1
3.3 Optimal Harmonic Search Algorithms
In this section, we detail our Brute-Force Harmonic Search (BFHS) algo-

rithm and our Branch-And-Bound Harmonic Search (BBHS) algorithm,
that solve the problem defined in Section 3.2.2.
3.3.1 Brute-Force Harmonic Search Algorithm
Lemma 1. Given an input period set T, the range of valid values for T10 of the output harmonic
period set ranges from 1 to b T1 c.
Proof. Every element of the output harmonic period set is constrained to be an integer
less than or equal to the corresponding element in the input period set. Hence, the
largest value T10 can take is b T1 c. Since T10 is a positive integer, its minimum value is
1.
Lemma 2. Given an input period set T, any period ratio, ri ∀ i > 1, for the output harmonic
period set, ranges from 1 to b Ti /Ti0−1 c.
Proof. We have, ∀ i, Ti0 ≤ Ti and Ti0 is constrained to be a positive integer. Hence, ri must
also be a positive integer and since all tasks are ordered by non-decreasing periods the
maximum value ri can take is b Ti /Ti0−1 c and the minimum value of ri is 1.
Algorithm 1 Brute-Force Harmonic Search Algorithm

1: global variables
2: c f , cost function under consideration
3: optT 0 , optimal harmonic series for c f (Output)
4: optErr, optimal error w.r.t the input period set
5: end global variables
6: procedure bruteForceHarmonicSearch(τ, c f )
7: τ ← input task set
8: T ← period set from τ
9: c f ← cost function to minimize
10: optErr = numMax
11: 0
T1max = b T1 c
12: T 0 → Initialize temporary period set
13: for each f irstElement from T1max 0 to 1 do
14: T10 ← f irstElement
15: if !validPeriod(1, T10 , τ) then
16: break
17: calculateNextElement(2, T 0 , τ)
18: return optT 0
19: procedure calculateNextElement(i, T 0 , τ)
20: i ← index of element to be calculated
21: T 0 ← Harmonic period set to be updated
23: n ← size of T
24: rimax = b Ti /Ti−1 c
25: for each ri from rimax to 1 do
26: Ti0 = Ti0−1 ∗ ri
27: if !validPeriod(i, Ti0 , τ) then
28: break
29: if i == n then . i.e, this is the last element of the series
30: error = calculateError(c f , T 0 , τ, n)
31: if error < optErr then
32: optT 0 ← T 0
33: optErr ← error
34: else
35: calculateNextElement(i + 1, T 0 , τ)
From Lemma 1 and Lemma 2, we have bounds on T10 and ri . We itera-

tively calculate all possible output harmonic period sets and pick the one
that optimizes the cost function under consideration. Algorithm 1 presents
this iterative calculation. We start by picking the first element of the out-
put harmonic series from its valid range according to Lemma 1. We then
recursively calculate the other elements in the series. This can be visual-
ized in Table 3.1. We repeat the process for each valid first element. At
each step, we calculate the error of the output harmonic period set w.r.t.
Table 3.1: BFHS Example for c f = FOE

Original Period Set → 20 45 136 415
Brute-Force Harmonic Search Iterations Error
20 -
20 40 -
20 40 120 -
20 40 120 360 76
20 40 120 240 196
20 40 120 120 316
20 40 80 -
20 40 80 400 76
.
.
15 45 135 405 16
.
.
Table 3.2: BBHS Example for c f = FOE

Original Period Set → 20 45 136 415
Brute-Force Harmonic Search Iterations Error,Index
20 0,1
20 40 5,2
20 40 120 76,3
20 40 120 360 76,4
20 40 80 -
.
.
15 45 135 405 16,4
.
14 28 126 19,3
.
10 40 15,2
.
.
the input period set. On completion, we pick the output harmonic period
set that optimizes the selected cost function. For the example in Table 3.1,
we select the First Order Error (FOE) cost function. Hence, for the input
period set T = {20, 45, 136, 415}, we select the output harmonic period set
T 0 = {15, 45, 135, 405}. It is important to note that the priority ordering of
the harmonized task set remains identical to the RM priorities of the orig-
inal task set, if ties are broken in favor of smaller original period values.
The algorithm also checks the feasibility of the solution, i.e., Ci ≤ Ti0 , and
it returns the optimal solution if one exists, else returns null.
3.3.2 Branch-And-Bound Harmonic Search Algorithm (BBHS)
The BFHS algorithm, being an exhaustive search, checks a very large num-
ber of output harmonic period sets in order to find the optimal solution.
The Branch-And-Bound Harmonic Search (BBHS) algorithm attempts to
reduce this search space by bounding the number of output harmonic pe-
riod sets in order to find the optimal solution by applying the following
properties.
• Error Bounding: Similar to the BFHS algorithm, the BBHS algorithm keeps track
of the minimum error seen up to the current point of execution. The BBHS algo-
rithm also calculates the error of the output harmonic period set w.r.t to the input
harmonic period set every time a new element gets added to the series. If, at any
step, this error exceeds the currently known minimum error value, the BBHS al-
gorithm terminates exploring the branches along that path. This is illustrated in
Table 3.2. The optimal error when the algorithm is in the process of forming the
series T 0 = {14, 28, 126, −} is 16, but, in this case, the error with just the first 3
elements is 19 which already exceeds the known minimum error of 16:4 up to this
point. This allows the BBHS algorithm to bound the search space efficiently. It is
also interesting to note that this bounding criterion gets stricter as new period sets
with lower errors are found.
• Node Bounding : We first present a Lemma.
Lemma 3. Given a harmonic periods set T 0 arranged in non-decreasing order, the values
of the subsequent elements i.e., Tj0 ∀ j > i, for a given element Ti0 , depend completely on
the current element Ti0 and not on any of the previous elements i.e., Tj0 ∀ j < i.
Proof. From the mathematical representation of harmonic sets from Section 3.2.3,
Algorithm 2 Branch-and-Bound Harmonic Search

1: global variables
2: c f , cost function under consideration
3: optT 0 , optimal harmonic series for c f (Output)
4: optErr, optimal error w.r.t the input period set
5: end global variables
6: procedure bruteForceHarmonicSearch(τ, c f )
10: optErr = numMax
11: 0
T1max = b T1 c
12: T 0 → Initialize temporary period set
13: for each f irstElement from T1max 0 to 1 do
14: 0
T1 ← f irstElement
15: if !validPeriod(1, T10 , τ) then
16: break
17: calculateNextElement(2, T 0 , τ)
18: return optT 0
19: procedure calculateNextElement(i, T 0 , τ)
20: i ← index of element to be calculated
21: T 0 ← Harmonic period set to be updated
22: T ← Input period set
23: n ← size of T
24: rimax = b Ti /Ti−1 c
25: for each ri from rimax to 1 do
26: Ti0 = Ti0−1 ∗ ri
27: if !validPeriod(i, Ti0 , τ) then
28: break
29: if i == n then . i.e, this is the last element of the series
30: error = calculateError(c f , T 0 , τ, n)
31: if error < optErr then
32: optT 0 ← T 0
33: optErr ← error
34: break
35: else
36: error = calculateError(c f , T 0 , τ, i)
37: if error >= optErr then
38: break
39: if nodeVisited(i, Ti0 , error) then
40: continue
41: else
42: storeVisitedNode(i, Ti0 , error)
43: calculateNextElement(i + 1, T 0 , τ)
we have,
i
Ti0 = T10 ∗ ∏ ri From Equation (3.7)
i =1
∴ T20 = T10 ∗ r1 ∗ r2 ∴ T20 = T10 ∗ r2 ∵ r1 = 1

(3.10)
∴ T30 = T10 ∗ r2 ∗ r3 ∵ r1 = 1 ∴ T30 = T20 ∗ r3
∴ T40 = T10 ∗ r2 ∗ r3 ∗ r4 ∵ r1 = 1 ∴ T40 = T30 ∗ r4
and so on ..
Hence, the value of the next element in the harmonic series only depends on the
value of the previous element.
From Lemma 3, we see that subsequent elements of the harmonized series only
depend on the current element under consideration, irrespective of the preceding
elements. Hence, when BBHS adds a new element, also referred to as a node, we
are guaranteed that the elements to follow will be independent of the elements
before the current one. Hence, the BBHS stores each visited node and the current
error associated with that node. If a node is revisited, the error is checked. If the
error is greater than the stored value, BBHS terminates the series currently under
consideration. This can be seen in Table 3.2. When BBHS visits the node 40 for the
first time, it is processing the series T 0 = {20, 40, −, −}. It stores the corresponding
error, which in this case is 5, along with the node value of 40. When BBHS revisits
the node 40 while processing the series T 0 = {10, 40, −, −}, and since the error is
greater than 5, the search with the prefix {10, 40} is pruned.
• Last-Element Bounding : From Lemma 3, we know that subsequent elements of a

harmonized series only depend on the current element under consideration. The
last element is special in this context, since it does not have any subsequent ele-
ments. Hence, the optimal choice for the last element can be directly made based
on the cost function. In the case of rational cost functions, which will be discussed
in Section 3.4.1, this turns out to be the value closest to the element from the input
period set i.e., the value resulting from the highest value of the period ration from
its range as defined in Lemma 2. This can be seen in Table 3.2. For the series,
T 0 = {20, 40, 120, 360}, the BBHS does not check other values for the last element
namely, 240 and 120 since for the FOE cost function, the highest value will always
be the best.
Algorithm 2 highlights these bounding rules.

Figure 3.1: Evaluation: Branch And Bound Harmonic Search Algorithm (Best Viewed in
Color)
3.3.3 Evaluation: Branch And Bound Harmonic Search
In this subsection, we evaluate the run-time performance of the BBHS al-

gorithm against the BFHS algorithm. We plot the number of elements
checked per iteration against cardinality of the task set. We randomly gen-
erate 5000 task sets for a given cardinality and plot the average number of
elements checked. For this experiment, we consider the FOE cost function.
Figure 3.1 shows the results. It is important to note here that the plot is in
log scale. As the figure indicates, BBHS outperforms BFHS by up to four
orders of magnitude.
3.4 Sub-Optimal Harmonic Search Heuristics
The algorithms described in Section 3.3 find optimal solutions for the prob-
lem statement from Section 3.2.2.
Specifically, offline task set choices can greatly benefit from an optimal
harmonic assignment. However, run-time applications may benefit from
a sub-optimal solution that can be obtained very quickly. In this section,
we present two such algorithms that can produce a sub-optimal solution
to the problem statement in Section 3.2.2 in a fraction of the time taken by
admission control of task sets.
3.4.1 Discrete Piecewise Harmonic Search
Let us first consider a relationship between a geometric series and an har-

monic series.
Lemma 4. An integer geometric series is always harmonic [7].
Proof. An integer geometric series can be represented as follows,
ζ i = m ∗ b xi (3.11)
where, we refer to m ∈ I as the multiplier, b ∈ N>0 as the base and xi ∈ N≥0 as the
exponent. From the mathematical representation of harmonic sets from Section 3.2.3,
we have,
i
Ti0 = T10 ∗ ∏ r j
j =1 (3.12)
Ti0 = T10 ∗ ri−1 , if all period ratios are identical.
Equation (3.12) is a geometric series hence, an integer geometric series is always har-
monic.
The Discrete Piecewise Harmonic Search (DPHS) Algorithm that we

present next attempts to find an optimal geometric series for a given pe-
riod set with respect to a given cost function. For example, consider the
input set {10, 31, 92, 183}, for the FOE cost function, the optimal solution
obtained from the BBHS algorithm is {10, 30, 90, 180}, FOE = 6. The DPHS
outputs the series {10, 20, 80, 160}, FOE = 46, which is a sub-optimal so-
lution. However, in Section 3.4.3, we shown that DPHS can calculate sub-
optimal solutions significantly quicker than BBHS.
Mathematical Representation of a Harmonic Set as a Geometric Series
First, consider an integer harmonic set ζ = {ζ 1 , ζ 1 , ..., ζ n }, with n numbers.

Since every ζ i is an integer multiple of a smaller number in the set, every
ζ i can be represented as
ζ i = m ∗ b xi (3.13)
That is, every period Ti0 in the harmonic period set can be represented as
Ti0 = m ∗ b xi (3.14)
Given that every Ti0 represents a task period, the multiplier is a positive
integer, i.e., m ∈ N>0 .
Example: Consider the integer harmonic set 10, 20, 40. These numbers
correspond to m = 10, b = 2, x = {0, 1, 2}. Since every number in the
harmonic period set is less than or equal to the corresponding number in
the original task set, we have,
Ti = Ti0 + Ki where Ki ∈ R=>0

(3.15)
xi
Ti = m ∗ b + Ki
Rational Cost Functions
Definition 1. A cost function for harmonization is said to be rational if the value of its cost
function increases as the deviation of any period in the resultant harmonic task set from the
original task set increases.
Rational Metric Example: Total Percentage Error(TPE)
Original Task set: [10, 20, 40, 81] TPE
Harmonic Assignment 1: [10, 20, 40, 80] 1.2%
Harmonic Assignment 2: [10, 20, 40, 40] 50.6%
TPE Increases as Δ (Deviation) Increases, hence TPE is rational
Figure 3.2: Rational Cost Function Example
Our goal is to find m∗ and b∗ , which represent the values of the mul-
tiplier and the base optimizing a given rational cost function, as defined
above. DPHS works for all rational cost functions. The commonly used
error cost functions presented in Section 3.2.2 are rational. Below is an
example.
The Total percentage error (TPE) cost function is given by,
n
c f (m, b, x ) = Minimize
T0
∑ (Ti − Ti0 )/Ti
i =1
n
(3.16)
⇒ c f (m, b, x ) = Minimize
m,b
∑ (Ti − (m ∗ bxi ))/Ti
i =1
Figure 3.2 illustrates the rationality of the TPE cost function. The ratio-
nality of the other cost functions from Section 3.2.2 can be inferred in the
same way.
Bounds on the multiplier, the base, and the exponent
In this section, we present some properties of the period values we are

working with in terms of the mathematical representation from Section
3.4.1.
Lemma 5. Given a set of periods T, the range of valid values that the multiplier m of the output
harmonic period set ranges from 1 to T1 .
Proof. Every element of the output harmonic period set is constrained to be less than or
equal to the corresponding element in the input period set. Consider the first element
T10 , which can be represented as follows,
T10 = m ∗ b x1 ≤ T1
mmax ≤ T1 when b = 1 or x1 = 0
The value of the multiplier is maximized when the second term, b x1 , is 1, i.e., either
b = 1 or x1 = 0. Hence, the maximum value of multiplier is T1 and, since the multiplier
is a positive integer, the minimum value for the multiplier is 1.
Lemma 6. Given a set of periods T, the base b of the output harmonic period set, for a given
multiplier m, ranges from 1 to b Tn /mc.
Proof. We have, ∀ i, Ti0 ≤ Ti . Consider the last element Tn0 , which can be represented as
follows,
Tn0 = m ∗ b xn ≤ Tn
b xn ≤ Tn /m since m ∈ N>0
bmax ≤ Tn /m when xn = 1
⇒ bmax = b Tn /mc since b ∈ N>0

For a given multiplier m, the value of the base is maximized when xn = 1. Hence, the
maximum value of multiplier is b Tn /mc and, since the base is a positive integer, the
minimum value for the base is 1.
Lemma 7. Given an input set of periods T, the exponent xi for each element of the output
harmonic period set ranges from 0 to blog2 ( Ti /m)c. For b = 1, the exponent is irrelevant (i.e.,
any value of xi ≥ 0 has the same effect).
Figure 3.3: Brute-Force Geometric Series Search Plot
Proof. We have, ∀ i, Ti0 ≤ Ti . That is,
Ti0 = m ∗ b xi <= Ti
xi ≤ logb ( Tn /m)
since b ∈ N=>1 and m ∈ N>0 , we have
( xi )max ≤ log2 ( Tn /m) when b = 2
( xi )max = blog2 ( Tn /m)c since xi ∈ N=>0
As the base decreases, the value of the exponent increases. For a given multiplier m, the
value of the exponent is maximized when the value of the base is minimum i.e., in this
case, b = 2 since b ∈ N>1 . Hence, the maximum value of the exponent is blog2 ( Tn /m)c
for b ∈ N>1 and, since the exponent is a non-negative integer, the minimum value for
the exponent is 0. When b = 1, the second term b xi will always result in a value of 1, i.e.
every Ti0 = m, so any value of xi >= 0 has the same effect.
From Lemma 10 and Lemma 6, we have the bounds on the multiplier

and the base of the output harmonic period set T 0 for an input period set
T. We need to iterate over all combinations of the multiplier and base,
and find the optimal values m∗ and b∗ for the chosen cost function from
Section 3.4.1. In each iteration, the values of m and b are known, the only
unknown is x which is efficiently calculated as the highest exponent to
which the base b can be raised before Ti0 becomes greater than Ti . This is al-
Figure 3.4: Brute-Force Geometric Series Search tables (Best Viewed in Color)
gorithm is captured in the findClosestHarmonicSeries function in Algorithm

3. As an example, consider a simple task set τ = {(2, 12), (3, 35), (2, 112)}
as an input to the Brute-Force Geometric Series Search Algorithm. Sup-
pose the cost function of interest is minimizing the overall percentage er-
ror, defined in Section 3.2.2. Figure 3.3 plots the scaled total percentage
error for every combination of the multiplier and base as calculated by the
Harmonic Brute-Force Search Algorithm. From Lemma 10, the multiplier
values range from 1 to 12 since T1 = 12. For each multiplier, the bases have
a different range as given by Lemma 6 and represented in the plot. For a
given multiplier and a range of bases, the plot shows a constant decreasing
trend in the total percentage error. Also, the trend repeats for different
multipliers and base ranges. Figure 3.4 highlights the values correspond-
ing to one such trend, i.e., for m = 5 and b from 8 to 22. The figure shows
the output harmonic period set and the exponent values corresponding to
Algorithm 3 Brute-Force Geometric Series Search Algorithm

1: procedure findClosestHarmonicSeries
2: m ← multiplier
3: b ← base
6: for each τi in τ do
7: xi = blogb ( Ti /m)c
8: Ti0 = m ∗ b xi
9: return T 0
10: procedure bruteForceGeometricSeriesSearch
14: mmax = T1
15: error = numMax
16: for each multiplier m from 1 to mmax do
17: bmax = b Tn /mc
18: for each base b from 1 to bmax do
19: 0
Ttemp = findClosestHarmonicSeries(τ, m, b)
20: 0
errorTemp = calculateError(c f , τ, Ttemp )
21: 0
if errorTemp < error & Ci <= Ttemp then
22: T 0 = Ttemp
0
23: return T 0 . Return the closest feasible harmonic period set or null to indicate infeasible result
each element in this set, for a given multiplier and base combination. It
also shows the corresponding scaled cumulative percent error (SCTPE ) of
the output harmonic period set with respect to the original task set T. No-
tice that, for this range the values of xi remain constant across all entries,
i.e x = {0, 0, 1}. As can be seen, only the last entry in this range is of
interest in the search for an optimal assignment. This is the key insight on
which the Discrete Piecewise Search Algorithm presented in Algorithm 5
is based.
As before, we will compare our approach with a brute-force search.
Algorithm 3 presents the brute-force geometric series search approach. Notice
that the brute-force algorithm also checks the feasibility of the solution, i.e.,
Ci ≤ Ti0 , and it returns a solution if one exists, else returns null.
Discrete Piecewise Function
f(x) = x, x ϵ I, 0 <= x <= 15
2x - 31, x ϵ I, 16 <= x <= 30
f(x)
Figure 3.5: Discrete Piecewise Function Example
Discrete Piecewise Optimization
In this section, we present the discrete piecewise optimization approach.
Definition 2. A function is said to be discrete piecewise if it can be represented as follows,





 g1 ( x ) , x ∈ Set S1




 g2 ( x ) ,
 x ∈ Set S2
f (x) =




 ...




...

Figure 3.5 illustrates an example of a discrete piecewise function.

The optimal value of a given discrete piecewise function f ( x ) can be
obtained by finding the optimal value from the set of maximal values for
each piecewise segment, i.e,
f ( x )opt = Opt{ g1 ( x )opt , g2 ( x )opt , ...}

Properties and Description
In this section, we first express the harmonization problem as a discrete

piecewise optimization problem and highlight its properties. We then de-
scribe the discrete piecewise harmonic search algorithm in detail.
In Section 3.4.1, we showed that the multiplier and the base fall within
fixed range of values. The TPE cost function over these ranges can be
expressed as follows:
n
c f (m, b, x ) = Minimize ∑ ( Ti − m ∗ b xi )/Ti ∀m, b
T0 i =1
We decompose the above equation over ranges with a fixed multiplier

and a range of bases such that that all values of xi remain identical.

Minimize ∑in=1 ( Ti − m1 ∗ b xi )/Ti

g1 ( b ) ,





 T0


∀0 < b ≤ b j1








g2 ( b ) , Minimize ∑in=1 ( Ti − m1 ∗ b xi )/Ti



T0









 ∀b j1 < b ≤ b j2







 .




.

c f (b) =
gk−1 (b), Minimize ∑in=1 ( Ti − m1 ∗ b xi )/Ti



T0






∀b jl < b ≤ b Tn /m1 c







Minimize ∑in=1 ( Ti − m2 ∗ b xi )/Ti

gk ( b ),



T0





∀ 0 < b ≤ bm 1












 .




.

As can be seen above, the harmonization cost function takes the form
of a discrete piecewise function. We now find the local optimal values of
each piecewise segment.
Lemma 8. Given a set of harmonic period sets with a fixed multiplier and range of bases, such
that all values of xi remain identical, the harmonic period set with the largest base will be the local
(i.e. with the range of bases specified) optima w.r.t the cost functions of rational cost functions.
Proof. Since the multiplier and exponent for all the periods in each set remain constant,
the value of each period depends solely on the value of the base. Hence, greater the
value of the base, greater the value of the second term, i.e., b xi , and closer the period
is to the corresponding period in the original task set resulting in a lower error value.
Hence, under the above conditions, any cost function that can be reduced to the form
c f (m, b, x ) = Maximize b
b
will indicate that the cost function is rational and have a local optimum value at
b = bmax in the given base range. This can also be thought of as every base b, before
which the exponent f i of any term changes, is a locally optimal solution. Consider the
cost function used in Figure 3.3.
n
c f (m, b, x ) = Minimize ∑ ( Ti − m ∗ b xi )/Ti
T0 i =1
n
c f (b) = Minimize ∑ ( Ti − m ∗ b xi )/Ti
b i =1
Since m and xi are constants,

n
c f (b) = Maximize ∑ (m ∗ b xi )
b i =1
c f (b) = Maximize b
b
Thus, the total percentage error cost function’s local optimal value will
be at b = bmax in the given base range. The property of Lemma 8 can
be clearly seen in Figure 3.4, where the multiplier is fixed at 5, the base b
increases from 8 to 22, and x = {0, 0, 1}. T30 increases from 40 to 110, which
is the highest value it can take for x3 = 1. Hence, under these conditions, in
the above example, b = 22 will always be the local optimal solution for the
m = 5 and b = 8 to 22. Corresponding results also apply to other rational
cost functions.
Lemma 9. Given a harmonic period set represented as m ∗ b xi , for given multiplier m, the bases
√
where the local optimal solutions occur are b p Ti /mc where the range of p is from 1 to ( xi )max
and b = 1.
Proof. From Lemma 6, we know that for a fixed multiplier m, the base varies from 1 to
b Tn /mc. From Lemma 7, we know that, as the base increases, the exponent decreases
from blog2 ( Ti /m)c to 0. We know, from Lemma 8, that at every base before which the
exponent changes is a local optimal solution. This power flip occurs when the base is
just big enough that the exponent has to be deceased. This happens when a base reaches
the maximum possible value a given xi can support.
Ti0 = m ∗ b xi <= Ti
b xi <= Ti /m, m ∈ N>0


 p
b xi ( Ti /m)c,
 if 0 < xi <= blog2 Ti /mc
bmax =

1,
 if xi = 0
p
x
For any value of xi from 1 to blog2 Ti /mc, b i ( Ti /m)c will be the set of local optimal
bases and, for xi = 0, it is equivalent to b = 1.
The Discrete Piecewise Harmonic Search (DPHS) Algorithm
Given the above context, we now present the Discrete Piecewise Harmonic
Search algorithm in Algorithm 5. We consider the entire multiplier range
from 1 to T1 (from Lemma 10), but prune the number of bases using Lemma
Algorithm 4 GetLocalMinima
1: procedure getLocalMinima
2: m ← multiplier
4: B←1
5: for each Ti in T do
6: for each xpfrom 1 to blog2 ( Ti /m)c do
7: B ← b x ( Ti /m)c
8: return unique( B)
Algorithm 5 Discrete Piecewise Harmonic Search Algorithm

1: procedure DPHS
5: mmax = T1
6: error = numMax
7: for each multiplier m from 1 to mmax do
8: B = getLocalMinima(m, T)
9: for each base b in B do
10: 0
Ttemp = findClosestHarmonicSeries(τ, m, b)
11: 0
errorTemp = calculateError(c f , τ, Ttemp )
12: 0
if (errorTemp < error )&(Ci <= Ttemp ) then
13: T 0 = Ttemp
0
14: return T 0 . Return the closest feasible harmonic period set or null to indicate infeasible
result
9. The GetLocalMinima function takes an input period set and a multiplier

value to calculate the local optima. Then, the algorithm only searches the
local optima to optimize for the given cost function.
Consider the example in Section 3.4.1, where m = 5 and x = {0, 0, 1} re-
√
sults in a range of bases from 5 to 22. DPHS will search only b = b p Ti /mc,
√
i.e., b = b 1 112/5c = 22. As can been seen from Figure 3.4, b = 22 is a
local minimum.
It is important to note that priority ordering of the harmonized task set
continues to remain identical to the RM priorities of the original task set, if
ties are broken in favor of smaller original period values. This is a property
of the FindClosestHarmonicSeries method which assigns the harmonic
closest to the original period value. This limits all tasks with periods larger
than a given harmonic to the value of the harmonic resulting in priorities
identical to the RM priorities of the original task set.
3.4.2 Period-Ratio Harmonic Search (PRHS)
In this section, we describe another sub-optimal search algorithm, the Period-

Ratio Harmonic Search. From Lemma 2, we have seen that the range of
period ratios can be significantly large. Also, the period ratios depend on
the current element under consideration i.e., Ti0 . The PRHS algorithm at-
tempts to maintain the period ratios as close as possible to the input period
set. Hence, it limits the number of period ratios checked to d Ti+1 /Ti e + 1
, d Ti+1 /Ti e, b Ti+1 /Ti c and b Ti+1 /Ti c − 1 if they are valid for a given Ti0 .
Hence, instead of checking all possible multipliers, the PRHS algorithm
limits the period ratios to only 4. The motivation for this algorithm comes
from observations of the optimal solutions obtained from the BFHS and
BBHS algorithms in Section 3.3, which often have period ratios close to
those of the original period set.
Table 3.3: Harmonic Search outputs

C 5 2 1 5 3 8 2 9 5 3 1 1 3 1 3 1 1
Input Period Set 25 25 40 50 50 59 80 80 100 200 200 200 200 200 200 1000 1000
Utilizations 0.2 0.08 0.025 0.1 0.06 0.13559 0.025 0.1125 0.05 0.015 0.005 0.005 0.015 0.005 0.015 0.001 0.001 Error
BBHS 25 25 25 50 50 50 50 50 100 200 200 200 200 200 200 1000 1000 0.972
TSU DPHS 25 25 25 50 50 59 80 80 100 200 200 200 200 200 200 800 800 0.9725
PRHS 25 25 25 50 50 50 50 50 100 200 200 200 200 200 200 1000 1000 0.972
BBHS 25 25 25 50 50 50 50 50 100 200 200 200 200 200 200 1000 1000 84
FOE DPHS 8 8 40 40 40 40 40 40 40 200 200 200 200 200 200 1000 1000 213
PRHS 25 25 25 50 50 50 50 50 100 200 200 200 200 200 200 1000 1000 84
BBHS 25 25 25 50 50 50 50 50 100 200 200 200 200 200 200 1000 1000 127
TPE DPHS 25 25 25 50 50 59 80 80 100 200 200 200 200 200 200 800 800 244
PRHS 25 25 25 50 50 50 50 50 100 200 200 200 200 200 200 1000 1000 127
BBHS 20 20 40 40 40 40 80 80 80 160 160 160 160 160 160 960 960 32
MPE DPHS 20 20 40 40 40 40 80 80 80 160 160 160 160 160 160 640 640 36
PRHS 20 20 40 40 40 40 80 80 80 160 160 160 160 160 160 960 960 32
Figure 3.6: Run-time Performance Evaluation DPHS vs BBHS vs PRHS vs Sr (Best

Viewed in Color)
3.4.3 Evaluation
In this sub-section, we evaluate the run-time performance of the sub-optimal

algorithms DPHS, PRHS and Sr [7] against the optimal BBHS algorithm.
In Figure 3.6, we plot the execution time per iteration against the cardinal-
ity of the task set. Results presented are averaged across 5000 randomly
generated task sets for a given task set cardinality. When we consider the
FOE cost function, both the DPHS and PRHS algorithms are significantly
faster than the BBHS algorithm. Figure 3.7 plots the average error per task
set against the task set cardinality. While, as expected, both the DPHS
and PRHS algorithms produce sub-optimal results, their average errors are
close to the best obtained by BBHS. In Figure 3.8, we plot the execution
time per iteration against the range of values of the task periods. Results
presented are also averaged across 5000 randomly generated task sets for
a given task set cardinality. When we consider the MPE cost function,
Figure 3.7: FOE Evaluation: DPHS vs BBHS vs PRHS vs Sr (Best Viewed in Color)
both the DPHS and PRHS algorithms are again significantly faster than
the BBHS algorithm. Figure 3.9 plots the average error per task set against
the task set cardinality. Again, the DPHS and PRHS algorithms produce
sub-optimal results but only slightly worse. It is interesting to note that,
between DPHS and PRHS, there is also a trade-off between accuracy and
run-time duration. On average, DPHS is faster but has greater error. Sr,
on the other hand is significantly faster than the other approaches but its
error performance is also significantly worse.
We next show the benefits of the harmonic search algorithms by con-
sidering real-world applications. We consider an avionics task set used by
Locke et al. [74]. Table 3.3 presents the outputs of harmonic search algo-
rithms proposed in this chapter.
Let us first consider the TSU cost function. The goal of the TSU cost
function is to reduce the total utilization of the resultant harmonic period
set. Hence, the period assignment will be biased towards maintaining the
Figure 3.8: Run-time Performance Evaluation: DPHS vs BBHS vs PRHS vs Sr (Best

Viewed in Color)
periods of tasks with relatively high utilization. This is evidenced by the

fact that the DPHS algorithm leaves the period of the highest-utilization
task τ1 unchanged. Next, we consider the FOE cost function. The algo-
rithms ensure that the period values of τ16 and τ17 are not reduced, as
they have the largest period values and the maximum potential for caus-
ing first-order errors. Similar arguments can be made for the MPE and
TPE cost functions. It is interesting to note here that, in this example, the
PRHS algorithm finds the optimal solutions for all the cost functions con-
sidered. This is because the original task set was close to being harmonic
and, hence, the optimal output maintains similar period ratios as the orig-
inal task set. Also, since, the original task set was not close to a geometric
series, the DPHS algorithm yields sub-optimal solutions.
It is also noteworthy that there can exist significantly different optimal
solutions based on the choice of the cost function.
Figure 3.9: MPE Evaluation: DPHS vs BBHS vs PRHS vs Sr (Best Viewed in Color)
3.5 Chapter Summary
The first step in the deployment of autonomous driving applications is se-

lecting task execution parameters. In this chapter, we presented the design
and resource optimization benefits of harmonic task sets. Hence, we ad-
dressed the problem of assigning harmonic periods to an arbitrary task
set such that every task gets assigned an integer period less than or equal
to its application-specified upper bound and the task utilization of every
task is less than 100%. Using a mathematical framework to represent in-
teger harmonic task sets, we first presented a brute-force harmonic search
algorithm. Next, we presented the Branch-and-Bound Harmonic Search
algorithm which improves on the performance of the brute-force harmonic
search by a few orders of magnitude. We also presented two algorithms,
the “Discrete Piecewise Harmonic Search” (DPHS) and the “Period Ratio Har-
monic Search” (PRHS), to find sub-optimal solutions which can be used by
run-time admission control and other contexts that are time-sensitive and
can afford sub-optimal solutions. We also demonstrated the benefits of our
approach by considering real-world task sets.
Chapter 4
System Model
4.1 Computation Model
In this dissertation, we consider a distributed system consisting of N com-

putational nodes, where each node can communicate with every other
node in the system by sending messages. We assume a set of n tasks,
(τ1 , τ2 , ..., τn ), where each task is assigned a unique priority based on the
Rate-Monotonic Scheduling (RMS) [3] policy. We assume that the tasks
are sorted in non-increasing order of priorities. We assume that a higher-
priority task can immediately preempt a lower-priority task. Each task τi
is assumed to have a worst-case execution time (WCET) of Ci , a period
of Ti and an implicit deadline Di = Ti . The analysis can be adapted to
other scheduling policies and deadline models (e.g., D < T), as long as a
response-time analysis is available. Each task τi may be blocked by lower-
priority tasks for at most Bi units of time as a result of the operation of
a concurrency control protocol like the Priority Ceiling Protocol [75]. We
assume that the worst-case release jitter, the worst-case time a task τi can
spend waiting to be released after arrival, is Ji [76].
The schedulability of a task can be evaluated using the response-time
48
CHAPTER 4. SYSTEM MODEL 49
analysis presented in [76].

i −1
rin+1 = Ci + Bi + ∑ d(rin + Jj )/Tj eCj
j =1
(4.1)
i
ri0 = ∑ Cj
j =1
Equation (4.1) represents an iterative solution which starts at ri0 and

terminates when either Ri = rin+1 = rn or rin+1 > Di . We refer to Ri as
the worst-case response time for task τi . Ri is measured from the instant
the task is released to its completion. The worst-case time from arrival to
completion of task i [76], also known as the worst-case completion time
(WCCT), is given by,
WCCTi = ri + Ji (4.2)
A task is said to be schedulable if its WCCTi ≤ Di .
4.2 Task Model
We consider that the life-cycle of every execution of a task is divided into

three parts:
1. Accept inputs,
a) Application-level inputs like sensor readings
b) Fault-Tolerance library inputs (e.g., status messages)
2. Perform calculations, and
3. Produce outputs
In our system model, we assume that all tasks are independent from one
another and they do not self-suspend during execution. In other words, the
lifecycle of every task follows the above steps. All I/O operations, includ-
ing reading sensor values, are completed before performing calculations
and outputs including those driving actuators are produced at the end
of the task life-cycle. This architecture is consistent with the AUTOSAR
standard [22]. A task that performs all of the above operations is called a
primary. A fail-stop failure of a primary is tolerated by the system by pro-
visioning one or more standbys corresponding to each primary. Byzantine
faults caused by security breaches are beyond the scope of this dissertation.
4.3 Fault Model
In our model, we are primarily concerned about permanent crash faults

[31]. Hardware failures, operating system crashes and process crashes are
some examples of crash faults. We assume that these crash faults are fail-
silent [32]. In order to tolerate these crash faults, we employ fault tolerance
by replication [33]. We consider three types of redundancies:
1. Active Replica: In active replication, all redundant copies are identical and treated
uniformly. Each replica performs all operations, like accepting and processing
application inputs, performing state calculations, performing application calcula-
tions and producing output. This implies that, under normal operation, the system
needs to support duplicate suppression to filter out duplicate outputs.
2. Hot Standby: A hot standby is based on the primary-backup approach. It performs

all the operations of the primary task except for producing outputs. On detection
of primary failure, the hot standby is promoted to become a primary and begins
to produce outputs. Unlike an active replica, a hot standby can run a degraded
version of the primary to optimize resource consumption.
3. Warm Standby: A warm standby is also based on the primary-backup approach.

A warm standby is designed to leverage the ability of software application that can
perform state and output calculations separately. It performs all the operations of
the primary task except for performing output calculations and producing outputs,
i.e. it does maintain its own state. On detection of primary failure, the warm
standby is promoted to become a primary and begins to produce outputs.
4. Cold Standby: A cold standby is also based on the primary-backup approach. It

can be of two types depending on the type of application. If an application is
stateless, the cold standby does not perform any operations until it detects primary
failure. For applications with state, the cold standby accepts and logs application
inputs but does not perform any other operations. It regularly accepts the state
from the primary to maintain consistency. On detection of primary failure, the
cold standby primes its state first and then begins to produce outputs.
5. Frigid Standby: A frigid standby is also based on the primary-backup approach.

A frigid standby is not scheduled to execute by default. On detection of primary
failure, a frigid standby is launched at run-time, it then primes its state and then
begins to produce outputs.
Transient and intermittent faults can be overcome by techniques like

simple re-execution, forward recovery [77], and recovery blocks [78]. The
impact of these solutions can be accounted for by modifying the analysis of
task response times to include additional fault-induced processing require-
ments [77]. In this dissertation, we focus on permanent faults, though the
analysis for transient faults from [77] can be incorporated into our frame-
work. Similar fault models have been used in the automotive context in [79]
and [80].
In our model, a task and its replica have the same period and a task can
be assigned one or more replicas based on the application requirements.
Primary Backup
Failure promoted to
primary and
δ produces
Primary output
HB No HB
Backup Time
Figure 4.1: Detecting Primary Failure
The system designer can decide which tasks are considered critical for the
application and which are considered non-critical. In this dissertation, we
assume that non-critical tasks do not have replicas and can be terminated
in order to allow a cold standby to execute when a primary fails. For
fault detection, we assume that the replicas monitor the status and health
of the primary, for example, by using heartbeats and producing outputs
when necessary [79, 81]. This is illustrated in Figure 4.1. We assume that
the underlying communication framework is reliable1 , i.e., it guarantees
that a message will either be delivered within a fixed message delivery
bound δ or not be delivered at all. Common communications protocols
like CAN/CAN-FD [82], FlexRay [52] and many variants of real-time Eth-
ernet [53, 54] can support these guarantees. The successful reception of a
heartbeat indicates to the replica that the primary is operational.
1 Safety-criticalreal-time systems must deal with communication failures. The communication layer
can utilize solutions like redundant CANbus links, dual FlexRay configurations with built-in support for
fault tolerance, and replicated ethernet switches. In the interests of brevity, we abstract away the details
of such solutions with our assumption of a reliable communication layer in this dissertation.
4.4 Chapter Summary
In this chapter, we presented our system, task and fault models. We assume
a distributed system consisting of several computational nodes, where each
node can communicate with every other node in the system by sending
messages. We assume that applications are independent and that they ac-
cept inputs, perform calculations and produce outputs. We assume that
tasks are replicated by using either the active or passive replication strate-
gies to support fault-tolerance. Our primary fault-model is fail-stop, which
is commonly used in the automotive industry. We use the system, task and
fault model presented in this chapter throughout the rest of the disserta-
tion.
Chapter 5
Selection of Replication Parameters
As mentioned in Chapter 1, due to the complexity of computational re-

sources used in real-time systems, many application domains, such as in-
dustrial control, aviation and automobiles [83], achieve fault tolerance by
replicating hardware and running multiple instances of the same software
on different pieces of hardware, often using a voting mechanism to derive
the output [84]. Unfortunately, this approach is extremely inefficient in
terms of cost, weight, space and energy needs for many applications. This
is especially true for the automotive industry where the system reliability
requirements can be diverse and cost constraints are stringent. Similarly,
other diverse needs are also evident from the fact that different tasks run-
ning in an automobile have different levels of safety criticality. For exam-
ple, the braking control task is far more safety-critical than (say) an audio
playback task. This motivates the need for adaptive cost-optimized fault-
tolerance solutions to reduce overall resource utilization. Hence, software
fault tolerance techniques like active replication [85,86] or the primary-backup
approach [87,88] using hot and cold standbys are more applicable to systems
like automobiles [79, 81].
The diversity in the reliability requirements for tasks in a system using
54
CHAPTER 5. SELECTION OF REPLICATION PARAMETERS 55
software fault tolerance techniques is captured by the recovery time require-

ment of each task. The RTR specifies the number of consecutive deadlines
of the primary task a redundancy can afford to miss without the system
being considered to have failed. The recovery time requirement for a task
varies depending on its safety criticality. Tasks that are safety-critical have
a strict (and very low) upper bound on RTR, while others can afford more
relaxed values. The recovery time, i.e., the time a redundancy takes to suc-
cessfully take over execution on primary task failure, is influenced by a
number of factors like redundancy type, redundancy priority and network
delays. The goal of this chapter is to analyze the recovery times achieved by
different types of redundant tasks (active/passive) used in software fault
tolerance techniques for real-time systems. The major contributions of this
chapter are as follows:
1. We derive the bounds on the recovery time of different types of redundancies, i.e,
active or passive, used in software fault tolerance techniques for real-time systems.
2. We derive conditions to map the recovery time requirements of a task to a redun-

dancy type assignment.
In Section 2.2, we described related work. The rest of this chapter is

organized as follows. In Section 4, we define our system model and fault
model and describe different types of redundancies we consider for our
analysis. In Section 5.2, we quantify recovery time and derive bounds on
the recovery time for each redundancy type. In Section 5.3, we derive
conditions to assign a redundancy type to a task given its recovery time
requirements. We summarize and conclude our findings in Section 5.4.
Recovery Time
Backup
promoted to
Primary
Primary primary and
Failure
produces output
HB No HB
Backup Time
Figure 5.1: Defining Recovery Time
5.1 Problem Statement
Definition 3. Recovery time (RT) is the time elapsed from the instant of primary failure to the
instant when a redundant task is able to produce the desired output. This duration is shown in
Figure 5.1.
The choice of the type of redundant task to be used has a major impact
on a task’s recovery time. For example, an active replica can virtually
provide seamless recovery since it runs alongside the primary. The hot
and cold standbys, on the other hand, have to first detect primary failure.
In addition, the cold standby needs to then prime its state, which results in
an even longer recovery time.
The number of redundant copies assigned to each task is also an impor-
tant design parameter. Every task can be assigned m ( m ∈ N ) redundant
copies. It is important to note that different tasks can utilize different re-
dundancy types (i.e., active, hot or cold). The number of replicas and
their types are system parameters which are application-specific. Their
choice determines the number of failures a given task can tolerate and how
quickly a task can recover from a failure. The former is a system designer’s
choice and the latter can be captured by specifying a recovery-time require-
ment for each task.
Definition 4. Recovery time requirement (RTR) is the maximum number of consecutive

deadlines of the primary task that the system can afford to miss before the redundant task must
recover in accordance with Definition 1.
We first determine which type of redundant task is appropriate for a

given task to meet its recovery-time requirement. The benefit of using
replicas is maximal when a task and any of its redundant copies obey the
placement constraint of not being co-located on the same node. To this end
in [20], Kim et al. defined the Fault-tolerant Partitioned Scheduling problem
as one of assigning independent tasks to nodes where every member of a
group, i.e., a primary task and its copies, would not be co-located on the
same node. This ensures that, when nodes fail independently, they do not
result in application failures. The bin-packing problem [89] of allocating
fault-tolerant tasks is known to be NP-hard [19], and heuristics were pro-
posed in [81] to address this problem. In this chapter, we extend these
heuristics to ensure that the recovery-time requirements of tasks are also
satisfied.
We assume that task I/O dependencies1 and ensuring input consistency
between a primary and its redundant copies are considered by the system
designer in assigning the RTR of each task in the system.
To summarize, the goals of this chapter are as follows. Given N nodes,
and a task set τ = { T1 , T2 , . . . , Tn }, where every task has an application-
dependent recovery-time requirement RTRi ,
1. derive bounds on the recovery time for each redundant-task type,

1 Detailed task models capturing I/O dependencies are certainly needed, and will be part of our future
work. For example, task I/O dependencies can be factored into our analysis by constructing composite
(virtual) tasks formed by combining tasks with I/O dependencies.
2 missed heartbeats
δ δ
Primary
5 10 15 20
HB No HB No HB HB Time
Backup
Backup arrivals
(a) Backup not following the Primary
δ = w + C + QJ No missed
WCCTPRI heartbeats
δ δ
Primary
5 10 15 20
Backup HB HB Time
(b) Backup following the Primary
Figure 5.2: Motivation for Backup following the Primary
2. decide a redundant-task type, i.e., active, hot or cold, and
3. find an allocation where all tasks satisfy their recovery-time requirements while
minimizing the number of nodes used for allocation.
5.2 Recovery Time Analysis for Passive Backups
In the previous section, we saw that an active replica can be seamlessly

recovered from, since other replicas are running in parallel. In this section,
we derive the recovery time bounds for hot and cold standbys.
5.2.1 Backup Following the Primary
Previous work [81] has shown that the bounds on the recovery time for
passive backups can be reduced if the backup task execution follows the
execution of the primary. The intuition for this can be seen in Figures
5.2a and 5.2b. As seen in Figure 5.2a, if the backup can execute at any
time independent of its primary, it is possible for a backup to miss up to
two heartbeats without primary failure. Hence, the backup must wait for
three consecutive missed heartbeats to declare failure of the primary and
initiate recovery, resulting in a longer recovery time. In contrast, when the
backup follows the execution of the primary, it needs only a single missed
heartbeat to detect primary failure.
For the backup to follow the primary, the following requirements must
be satisfied:
1. Global Time Synchronization: To ensure that the backup follows the primary, the
release time of the backup w.r.t that of the primary must be explicitly controlled.
Since fault-tolerant task allocation requires primaries and replicas to run on distinct
nodes, the nodes must be time-synchronized. This constraint can be relaxed for a
system which allows tasks to be released with offsets at boot up and has negligible
clock drift.
2. Network Schedulability Analysis: In order to calculate the optimal release instant

for the backup, network delays must be characterized. 2 The worst-case network
response time δm for message m can be represented as,
2 Popular automotive network technologies, like CAN [82] and FlexRay [52], have response-time anal-
yses to bound the worst-case message delivery time.
WCCTPRI WCCTPRI + δ τ1(2,5)

δ
Primary τ2(1,10)
Failure
Primary
0 2 3 5 10
HB
Backup Time
Recovery Time for τ2
Figure 5.3: Recovery Time Bounds for Hot Standby τ2
δm = wm + QJm + Cm (5.1)
where,
• The queuing jitter QJm corresponds to the longest time between the initiating
event and the message being queued, ready to be transmitted on the network.
• The queuing delay wm corresponds to the longest time that the message can
remain in the device queue, before commencing successful transmission on
the network.
• The transmission time Cm corresponds to the longest time that the message
can take to be transmitted. In the case of standbys, the transmission time de-
pends on the standby type. Cold standbys need to accept state, and normally
require longer transmission times than hot standbys.
5.2.2 Recovery Time Bounds for Hot Standbys
A hot standby produces an output immediately after it detects primary

failure as described in Section 4 and shown in Figure 5.3. Let WCCTpri
be the WCCT for the primary. Let WCCThot be the completion time of
WCCTPRI + δ τ1(2,5) Cold Standby p=1

WCCTPRI
δ
Primary τ2(1,10)
Failure
Primary
0 2 3 5 10 State Priming
Backup
HB Time
Recovery Time for τ2
Figure 5.4: Recovery Time Bounds for Cold Standby τ2
the backup corresponding to the failure of the primary. The total time
from the release of the primary to the execution of the backup would be
WCCTpri + δhot + WCCThot . Hence, the recovery time is
RTHot = WCCTpri + δhot + WCCThot (5.2)
5.2.3 Recovery Time Bounds for Cold Standbys
A cold standby takes longer to recover from a failed primary as described

in Section 4, since it does not produce any state of its own, but instead
receives regular state updates from the primary. This is illustrated in Figure
5.4. It only logs application inputs which it uses to prime state for future
use. These logs can be cleared once a primary state is applied. Let p denote
the number of periods the cold standby needs to prime state and produce
output. A cold standby for a stateless application does not need to prime
any state, and hence, in this case, p = 0. For applications with state, the
value of p depends on two factors:
1. The frequency of state transfer from the primary to the standby: The higher the
frequency of state transfer, the fresher is the state of the cold standby and hence
lower is the number of periods required for state priming (i.e., a lower value for p).
2. Priming state is highly application-dependent. Some applications may make tem-

poral corrections of the most recent state using appropriate extrapolations. Other
applications may iterate through all the logged inputs between the last received
state and the time instant the failure is detected, and, in each iteration, re-calculate
the state to finally produce output based on fresh state. In this section, we as-
sume that, for applications with state, the value of p is provided by the application
designer.
Thus, for a cold standby to recover from a primary failure, the recovery
time would be
RTCold = WCCTpri + δcold + pT + WCCTcold (5.3)
5.3 Redundant-Task Type Assignment To Tasks
In this section, we identify the types of redundant task assignments that

can satisfy a given RTR constraint. As described in Section 5.1 an active
replica can be seamlessly recovered from, since other replicas are running
in parallel, hence it can satisfy any RTR requirement.
5.3.1 Hot standby
RTR = 0
For a hot standby to recover from primary failure and maintain RTR = 0,
the recovery time, RThot , should be less than or equal to T, i.e., the redun-
dant task must recover before the deadline of its primary. Hence, from
Equation (5.2) we have,
RTHot ≤ T ⇒ WCCTpri + δhot + WCCThot ≤ T (5.4)

With the worst-case values for the terms in Equation (5.4),
T + δhot + T T
Hence, with the worst-case values for WCCT, a hot standby cannot satisfy
RTR = 0.
RTR > 0
Consider the case where RTR = n and n ∈ N>0 , allowing the task to
tolerate up to n missed deadlines when the primary fails.
In the case of a hot standby, RTR = n can be satisfied if RTHot <
(n + 1) T.
From Equation (5.2),
WCCTpri + δhot + WCCThot ≤ (n + 1) T (5.5)
Considering n ≥ 2 and the worst-case values for the terms in the above
equation,
WCCTpri + δ + WCCTbkp ≤ 3T
(5.6)
T + δ + T ≤ 3T
Assuming δ < T, a hot standby can meet RTR ≥ 2 (if it is schedulable).
5.3.2 Cold standby
RTR = 0
For a cold standby to satisfy RTR = 0, the recovery time should be less
than or equal to T. From Equation (5.3),
WCCTpri + δcold + pT + WCCTcold ≤ T (5.7)

Standby Selection
RTR(n) Condition Standby Assignment
0 WCCTpri + δcold + WCCTcold ≤ T Cold ( p = 0)
0 WCCTpri + δhot + WCCThot ≤ T Hot
0 WCCTpri + δhot + WCCThot > T Active
>0 WCCTpri + δ + WCCTbkp + pT ≤ (n + 1) T Cold
>0 WCCTpri + δhot + WCCThot ≤ (n + 1) T Hot
>0 WCCTpri + δhot + WCCThot > (n + 1) T Active
Table 5.1: Conditions for Redundant Task Selection
Equation (5.7) must be satisfied for a cold standby to meet RTR = 0.

However, if p 6= 0, a cold standby cannot satisfy RTR = 0.
RTR > 0
In the case of a cold standby, RTR = n can be satisfied if RTCold < (n + 1) T

in the worst case.
From Equation (5.3),
WCCTpri + δ + WCCTbkp + pT ≤ (n + 1) T (5.8)
Table 1 summarizes all the conditions above for standby selection. We

see that, for certain conditions, multiple options are available for redundant-
task type assignment. We describe our approach to redundant task selec-
tion in case of multiple available options in Section 6.3.
5.3.3 Multi-Level Backups
As shown in Figure 5.5a, a single primary can have more than one backup.
Both backups in the figure are released such that they follow the primary
to satisfy the primary’s recovery-time requirement. We assume that the
order of promotion to primary is statically configured. Suppose that the
RTR = 1, to the current primary

δ
PRIMARY
Time
BACKUP 1
Time
BACKUP 2
Time
(a) Multi-Level Backups
RTR = 1, to the current primary

δ
PRIMARY
δ Time
BACKUP 1
Time
BACKUP 2
Time
(b) Release Time Correction
Figure 5.5: Support for Multi-Level Backups
first backup in Figure 5.5a is designated to take over execution first after
primary failure. On primary failure, it is not guaranteed that the current
second backup would always satisfy the recovery time requirements of the
first which would now become the new primary. In order for the second-
level backup to now satisfy the RTR of the new primary, the release time of
the task needs to be corrected and this is shown in Figure 5.5b. Also, since
we are delaying the release time of the task, and deadlines are therefore
correspondingly postponed, the deferred start does not affect the overall
schedulability of the task set [90].
5.4 Chapter Summary
Autonomous driving systems are inherently safety-critical and, hence, safety

must be guaranteed in the presence of faults. To this end, an important
step in the deployment of software for autonomous driving applications
is to employ techniques like replication to support fault-tolerance require-
ments, which in turn require the selection of task replication parameters. In
this chapter, we considered software fault-tolerance techniques for safety-
critical real-time systems and derived the bounds on the recovery time of
different types of redundant tasks, such as active tasks, passive tasks and
their variants. We also derived conditions to map the recovery time re-
quirements (RTR) of a task to derive the redundant task type and related
replication parameters.
Chapter 6
Fault-tolerant Assignment of Tasks to

Computing Nodes
We have discussed the selection of task execution and replication param-

eters in Chapters 3 and 5. Given these parameters, the next step in the
system design process is to assign these tasks to the different distributed
nodes in our system. In [20], Kim et al. defined the Fault-tolerant Partitioned
Scheduling problem as one of assigning independent tasks to processors
where every member of a group, i.e., a primary task and its corresponding
standbys would obey the placement constraint of not being co-located on
the same processor. This ensures that in the case of a failure of a node both
the primary and the replica do not fail, thus eliminated some common-
mode failures. They also proposed the R-BFD heuristic that tries to mini-
mize the number of processors used to produce a feasible assignment while
satisfying the placement constraint.
In Section 6.1, we present our fault-tolerant task assignment heuristic
TPCD (Tiered Placement Constraint Decreasing) and compare its perfor-
mance against R-BFD. (Appendix B.1). We empirically compare the perfor-
mance of TPCD to an optimal allocation that we construct for evaluation
67
CHAPTER 6. FAULT-TOLERANT ASSIGNMENT OF TASKS TO COMPUTING NODES
68
purposes. We propose the TPCDC (Tiered Placement Constraint Decreas-

ing with Cold Standbys) heuristic which uses the run-time characteristics
of cold standbys to reduce the number of processors used. We also con-
sider the impact of overload operation on the task allocation in Section 6.2.
For these heuristics we assume that the replication parameters are known.
We then relax this requirement to accept only recovery-time requirements
for tasks. In Section 6.3, we propose heuristics to determine redundant-
task-type assignments and allocate these tasks to different nodes satisfying
the recovery time requirements of all tasks while attempting to optimize
resource utilization. We apply the Simulated Annealing method [44] to the
fault-tolerant task allocation problem and compare its performance to the
heuristics proposed in Section 6.4.
Steering Throttle Brake Control Safety Audio

Video Playback Audio Playback HVAC Task Control (0.1) (0.1)
Control (0.16)
task (0.55) Task (0.5) (0.4) (0.2)
tierOrder = 0 i.e. ψ0 tierOrder = 1 i.e. ψ1 Steering Throttle Brake Control Safety Audio
Control Control (0.16) (0.1) (0.1)
(0.2)
tierOrder = 2 i.e. ψ2 Brake Control Safety Audio

(0.1) (0.1)
Figure 6.1: Tasks to be allocated and their utilizations
6.1 Task Partitioning with Known Replication Parameters
In order to aid the discussion in the sections to follow we will use the figure
above for reference.
69
Figure 6.2: TPCD Task order
HVAC Task (0.4)

Steering Control
Throttle Control Safety Audio (0.1)
(0.2)
(0.16)
Brake Control
Steering Control Video Playback
(0.1)
(0.2) task (0.55)
Throttle Control
(0.16)
Safety Audio (0.1) Safety Audio (0.1)
Audio Playback
Brake Control Brake Control Task (0.5)
(0.1) (0.1)
Figure 6.3: TPCD Task Mapping
6.1.1 Tiered Placement Constraint Decreasing (TPCD)
A typical task assignment heuristic has two steps: task ordering and map-
ping tasks to processors. For TPCD, the task ordering is derived from three
pieces of intuition:
1. The members of a group should be placed as far away from each other in the task
order as possible. This reduces the chances of a task facing a placement conflict.
2. Members of larger groups have a greater probability of running into a placement

70
Brake Control
(0.1) Safety Audio (0.1)
Throttle Control
(0.16) Brake Control
(0.1)
HVAC Task (0.4) Steering Control
(0.2) Throttle Control
(0.16) Safety Audio (0.1)
Video Playback Audio Playback Steering Control Brake Control
task (0.55) Task (0.5) Safety Audio (0.1)
(0.2) (0.1)
Figure 6.4: BFD-P Heuristic Output
Brake Control
(0.1) Brake Control
(0.1)
Throttle Control
(0.16) Throttle Control
(0.16)
HVAC Task (0.4) Steering Control
(0.2) Steering Control
Brake Control
(0.2)
Video Playback Audio Playback (0.1)
task (0.55) Task (0.5) Safety Audio (0.1) Safety Audio (0.1) Safety Audio (0.1)
Figure 6.5: R-BFD Heuristic Output
conflict.
3. Having group members close to each other in the task order is more expensive
towards the end than at the beginning of the task order. This is because tasks
towards the end may result in new processors being allocated while having only a
few tasks left in the queue to populate them.
It is important to note here that these intuitions solely target the fault-
tolerant placement constraints, and are independent of the system pa-
rameters used to determine schedulability. In the following sections, we
consider task utilization while deciding schedulability for task allocation.
71
Other system resources like memory and bandwidth requirements, effects

of jitter in execution and message delivery can be incorporated at this stage.
These aspects are out of the scope of the discussion in this section, but the
intuitions above would still apply.
Based on the above observations, we now describe the TPCD heuristic.
Let f ( j) represents the number of replicas for τj capturing the fact that
different tasks have different application-dependent redundancy require-
ments. Consider a task set Γ = {τ00 , τ01 , ....τ10 ...τn0 , ...} where τji represents
a task with group number j( j = 0, 1, .., n) and replica order i (0, 1, .., f ( j)).
For example, task, τ50 means that the task’s group number is 5 and it is a
0th -order replica, which means that it is a primary task. Similarly, a task
with replica order 1 such as τ31 , would be the first backup and so on.
Let Ψ p represent a set of all pth order replicas. Each Ψ represents a tier
and we call each p in Ψ p as the tier order. Thus Ψ0 represents the set of
primary tasks, Ψ0 = {τj0 ∀ j} and Ψ1 = {τj1 ∀ j} represents the set of all first-
order replicas. Let m represent the highest tier order, i.e, max ( p). TPCD
arranges these tiers in a decreasing order of their placement constraint and
hence the name “tiered placement constraint decreasing". This means the mth
tier i.e. Ψm is allocated first followed by Ψm−1 and so on. It then uses the
well-known “Best Fit Decreasing" heuristic with placement constraints [20]
(referred to as BFD-P henceforth) to allocate tasks in each tier (Algorithm
1). This means that the TPCD heuristic allocates tasks with the highest
order first. In case of a tie, a task with greater processor utilization (i.e.,
the ratio Ci /Ti ) is allocated first. Since task utilization is independent of
the type of deadlines considered, the task ordering will remain identical
for tasksets with non-implicit deadlines. The impact of implicit deadlines
would need to be considered only while determining the schedulability of
a task on a given processor.
72
Algorithm 6 TPCD
1: procedure TPCD(Γ = {τ00 , τ01 , ....τ10 ...τn0 , ...})
2: for each task τj in Γ do
3: Ψi ← τji . Create tiers consisting of tasks of the same order
4: for each tier Ψi in Ψ do
5: Sort tasks in descending order
6: Task Assignment(α) ← BFD-P(Ψi )
7: return α . Return the task set assignment
Algorithm 7 TPCDC
1: procedure TPCDC(Γ = {τ00 , τ01 , ....τ10 ...τn0 , ...})
2: α ← TPCD(τji ∀ f ( j) 6= 0) . Step 1: Treat all tasks as hot-standbys
3: Step 2: Stop treating cold-standbys as hot standbys. Change all cold-standby
utilizations to default normal cold standby operation utilizations.
4: Task Assignment (α0 ) ← BFD-P(α, τji ∀ f ( j) == 0)
5: return α0
The worst-case running time of Algorithm 1 is O( N 2 ) where N rep-

resents the total number of tasks in Γ. Figures 6.1-6.3 help visualize the
TPCD heuristic in action. Figure 6.1 represents a sample task set where
each primary has 0, 1 or 2 backups. As can been seen in Figure 6.3, TPCD
is able to pack all the tasks in the task set from Figure 6.1, using only 3
processors, whereas BFD-P (Figure 6.4) and R-BFD (Figure 6.5) both take
5 processors each. Note that, since TPCD assigns tasks in tiers, it could
result in a dominant processor running only primary tasks as seen in figure
6.3 above. Appendix 6.1.2 discusses a solution to avoid such dominant pro-
cessors, though the standby assignment obeying the placement constraint
ensures that, even in the case of failure of this processor, the placement-
critical tasks remain operational.
73
Safety Audio
Traction Control Cruise Control
(0.05)
Infotainment HVAC (0.3) ( 0.3) (0.3)
System (0.55)
Traction Control Cruise Control Safety Audio

(Hot - 0.3) (Hot - 0.05)
(Hot - 0.3)
Harmonic
Task Set
(Cold - 0.05) (Cold - 0.05)
Safety Audio Safety Audio (Hot

(0.05) - 0.05) HVAC (0.3)
Cruise Control Cruise Control Cruise Control
TPCDC Step 1 (0.3) (Cold - 0.3)
(Hot - 0.3)
Infotainment
Traction Control Traction Control Traction Control System (0.55)
( 0.3) (Hot- 0.3) (Cold - 0.3)
Cruise Control
(Cold - 0.05)
Traction Control
(Cold - 0.05)
Safety Audio
(Hot - 0.05)
TPCDC Step 2 Safety Audio
(0.05)
HVAC (0.3)
Cruise Control Cruise Control
(0.3) (Hot - 0.3)
Infotainment
Traction Control Traction Control System (0.55)
( 0.3) (Hot - 0.3)
Figure 6.6: Tasks allocated using TPCDC and their utilizations (best viewed in color)
6.1.2 TPCD Primary Redistribution
As highlighted in Section 6.1.1 the TPCD algorithm can result in an alloca-

tion where a processor becomes dominant and runs only primaries. This
can be seen in the allocation produced by the example in Section 6.1.1.
Figure 6.8 represents the TPCD allocation produced with backup types an-
notated. As can be seen the last processor runs all primaries and becomes
the dominant processor. This is not ideal in a safety-critical systems. This
type of allocation is a result of the fact that TPCD allocates tasks in tiers.
This same allocation scheme also allows backups to be clubbed together. If
74
Safety Audio
(0.05)
Infotainment HVAC (0.3) ( 0.3) (0.3)
System (0.55)
Traction Control Safety Audio

Cruise Control
(Hot - 0.3) (Hot - 0.05)
(Hot - 0.3)
Harmonic
Task Set
(Cold - 0.05) (Cold - 0.05)
Safety Audio Safety Audio

(0.05) (Hot - 0.05) HVAC (0.3)
Cruise Control Cruise Control Cruise Control
TPCDC Step 1 (0.3) (Cold - 0.3)
(Hot - 0.3)
Infotainment
Traction Control Traction Control Traction Control System (0.55)
( 0.3) (Hot- 0.3) (Cold - 0.3)
Cruise Control
(Cold - 0.05)
Traction Control
(Cold - 0.05)
TPCDC Step 2 Safety Audio Safety Audio
(0.05) (Hot - 0.05)
HVAC (0.3)
Cruise Control Cruise Control
(0.3) (Hot - 0.3)
Infotainment
Traction Control Traction Control System (0.55)
( 0.3) (Hot - 0.3)
Figure 6.7: Tasks allocated using TPCDC and their utilizations with standby-
redistribution (best viewed in color)
the backups of some of the primaries on the dominant processor run iden-
tical copies as the primary it is possible to swap their positions with the
primary to produce a more balanced distribution of primaries and stand-
bys. This is highlighted in Figure 6.9.
75
HVAC Task (P) (0.4)

Steering Control (P)
(0.2) Safety Audio (P)(0.1)
Steering Control (B1)
(0.2) Brake Control (P) (0.1)
Video Playback task
(P) (0.55) Throttle Control (P)
Throttle Control (B1)
(0.16)
(0.16)
Safety Audio (B2)(0.1) Safety Audio (B1) (0.1) Audio Playback task
(P) (0.5)
Brake Control (B2) (0.1) Brake Control (B1) (0.1)
Figure 6.8: TPCD Solution to 6.1 with backup types highlighted (best viewed in color)
HVAC Task (P) (0.4)

Steering Control (P)
(0.2) Safety Audio (B2)(0.1)
Steering Control (B1)
(0.2) Brake Control (B2) (0.1)
Video Playback task
(P) (0.55) Throttle Control (P)
Throttle Control (B1)
(0.16)
(0.16)
Safety Audio (P)(0.1) Safety Audio (B1) (0.1) Audio Playback task
(P) (0.5)
Brake Control (P) (0.1) Brake Control (B1) (0.1)
Figure 6.9: TPCD Solution to 6.1 with Primary redistribution (P-Primary, B- Backup, best
viewed in color)
6.1.3 Tiered Placement Constraint Decreasing with Cold Standbys
(TPCDC)
The TPCD heuristic allocates primaries and standbys enforcing the place-
ment constraint while trying to minimize the number of processors used.
However, it does not leverage the fact that a cold standby under normal
operation runs with much lower processor utilization, allowing other non-
critical tasks to be scheduled along with the cold standby. This section
describes the TPCDC heuristic.
TPCDC divides all tasks into two categories as follows. A task is con-
sidered placement-critical if it has at least one backup or it is considered
application critical, else it is considered non-placement-critical. Any non-
76
placement-critical task can be terminated in order to allow a cold standby

to execute when a primary fails. TPCDC initially treats all standbys as hot
standbys, and uses the TPCD heuristic to allocate all the placement-critical
tasks (Algorithm 2). It then begins to assign the non-placement-critical
tasks, but, no longer treats the cold standbys as hot standbys (i.e. it consid-
ers the default lower cold standby utilization during normal operation). As
highlighted in section 6.1.1, TPCD assigns tasks in a tiered fashion start-
ing with the tier which has the highest order standbys. This tier typically
would consist of cold-standbys allowing multiple different cold standbys
of different tasks to get assigned to the same processor. This allows for
larger available spaces for assigning the non-placement-critical tasks once
the cold standbys normal run-time utilization is considered for task assign-
ment. Also, since the assignment when the cold standbys were treated as
hot standbys was feasible, it is guaranteed that on primary failure, the cor-
responding cold standbys will be able to execute when the non-placement-
critical tasks on that processor are terminated. This is illustrated in Figure
6.6. This allocation can be further optimized using a version of the standby
reallocation technique presented in the Appendix Section 6.1.2 to minimize
the utilization in a bin after applying the cold standby utilization values.
This improves the chances for the non-critical tasks to be assigned to the
bin. This is illustrated in Figure 6.7. When treating all cold standbys as hot
standbys TPCD uses 4 processors but TPCDC making use of the run-time
cold standby utilizations is able to produce a feasible assignment using
only 3 processors. The worst-case running time of Algorithm 1 is again
O( N 2 ) where N represents the total number of tasks in Γ.
77
6.1.4 Evaluation
In this section, we evaluate the performance benefits of using TPCD and

TPCDC. We use these heuristics on randomly generated task sets. We
assume that each processor can be completely packed1 .
TPCD Experimental Setup
We evaluate our heuristic against task sets with different characteristics.

First we vary the maximum single task utilization (Umax ) and the results
presented here are at Umax values of 0.3, 0.5 and 0.7. We also vary the
number of primary tasks in the task set, starting from task sets with a
single primary up to task sets with 40 primaries. We randomly assign 0 to
2 backups to each primary. Each data point is obtained by averaging sum
of all results from 10,000 task sets. Many automotive applications that are
prevalent today, especially up to Level 2 of driving automation [91], have
task sets with only a few computationally inexpensive placement-critical
tasks along with other non-placement-critical ones. We refer to such a
task set as an “L2 Task Set" and evaluate the performance of our heuristic
against randomly generated task sets of this type.
TPCD: Discussion
Figure 6.10 highlights the savings in number of processors per iteration that
would be obtained when using TPCD over R-BFD. As the figure indicates
TPCD saves up to 0.43 processors per iteration for a purely random task
set and up 0.91 processors per iteration for an L2 task set. Figures [6.11
-6.13] compares the performance of TPCD and R-BFD directly by plotting
the percentage of cases under which they outperform each other, i.e. they
1 Thisis feasible with EDF(Earliest Deadline First) and RMS (Rate-Monotonic Scheduling) with har-
monic task sets
78
Figure 6.10: R-BFD vs TPCD Processors saved by TPCD over R-BFD per task set (best
viewed in color)
use at least one less processor than the other for a feasible task assignment
while enforcing the placement constraint. As Figures [6.11 -6.13] show,
TPCD outperforms R-BFD as both Umax (0.3, 0.5 and 0.7) and number of
primaries (1 to 40) are varied.
The performance of TPCD against R-BFD is primarily influenced by two
factors
1. Presence of placement-critical tasks of low utilization: Any task with a low uti-
lization and a backup has a very high placement cost, as it can leave a processor
relatively empty on assignment, retaining the potential to cause placement con-
straints for its group members. However, a task with high utilization would have a
lower placement cost. This can be intuitively visualized as the extreme case where
79
Figure 6.11: Umax = 0.3, R-BFD vs TPCD: Percentage of task sets where one technique
does better (i.e.. uses at least one less processor) than the other (best viewed in color)
80
we have a task with utilization value of 1.0 with two backups. Any allocation
scheme would result in each of these tasks ending up on different processors and
not cause any placement constraints. TPCD performs much better when a task set
contains placement-critical tasks (i.e. tasks that have one or more backups) with
low utilization because, unlike R-BFD, TPCD prioritizes the allocation of higher
order tasks first. The effect of this can be seen in Figures 6.10 and 6.13, where
TPCD does much better than R-BFD when Umax is low.
2. Performance of BFD-P when single task utilizations are high and when the num-
ber of tasks increase: The significance of placement constraints decrease as Umax
increases or when the number of tasks in the system is high. This results in the
81
Algorithm 8 Generating Optimal an Task Assignment

1: procedure OptTaskAsgn(NumberOfPrimaries)
2: for <each primary, i> do
3: utilτi ← RandomDouble(0, 1) . Assign a random util between 0 and 1
4: B←τ . Add task to processor
5: if Butil > 1 then
6: utilτi ← (1 − ( Butil − utilτi ))
7: replicas = RandomInteger [1, 2] . Pick a random number of times
the processor should be replicated, number of backups per task on this processor =
(replicas-1).
8: for <each replica, j> do
9: α←B
10: B = Bnew . Add a new processor
11: return α . Return the optimal task assignment
curves tapering off where the number of primaries is large.
TPCD: Comparing to an Optimal Task Allocation
In this section, we describe the technique we use in order to generate a task

set that will produce an optimal assignment (i.e., where all processors are
completely packed) while obeying the placement constraint2 . We start by
adding a single processor. We then randomly generate tasks with different
utilization values, and assign the tasks to this processor. Once the total
utilization of the processor Butil becomes greater than 1, we change the
utilization of the task that got added last so that the Butil == 1. We then
randomly pick either 0, 1, or 2 replicas to be created for this processor.
If we pick 0, then it results in a processor of non-placement-critical tasks
without any backups. However, if we pick 1 or 2 replicas this step causes
the tasks on the processor to have the corresponding number of backups.
We then add the processor and its replica/s, if any, to the final assignment
and then open a new processor and repeat the above process until we have
2 Since
creating an optimal allocation given an arbitrary taskset is NP-Hard to compute, we instead
explicitly create a perfect solution that by definition represents an optimal allocation.
82
Figure 6.14: TPCD vs Optimal Allocation
assigned the required number of primaries (Algorithm 3). We compare the

number of processors used by the optimal solution against the number of
processors needed by TPCD to allocate the same (unpacked) task set.
Figure 6.14 compares the number of processors used per task set by
TPCD against the optimal allocation. TPCD, on an average, takes less than
or equal to 1 extra processor for allocation as compared to the optimal.
This loss of a processor can be attributed to the fact that BFD itself, which
is used as a part of TPCD, is not an optimal allocation.
83
16
Umax = 0.3 TPCD
14
Umax = 0.3 TPCDC
Bins used per iteration
12 Umax = 0.5 TPCD

Umax = 0.5 TPCDC
10 Umax = 0.7 TPCD
Umax = 0.7 TPCDC
8
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
Number of Primaries
Figure 6.15: TPCD vs TPCDC
TPCDC: Evaluation
In order to evaluate TPCDC, we use a similar setup that we used for eval-
uating TPCD. We vary the maximum single task utilization (Umax ) and the
results presented here are at (Umax ) values of 0.3, 0.5 and 0.7. We plot
the number of processors used by each technique and as shown in Fig-
ure 6.15. TPCDC is able to reduce the number of processors required to
have a feasible task assignment while obeying the placement constraint.
The gains for TPCDC increase as (Umax ) increases because this means that
cold standbys could have larger utilizations on initial assignment freeing
up larger spaces when lower normal mode utilizations of cold standbys are
used while assigning the non-placement-critical tasks.
We now have a technique for assigning tasks with standbys to differ-
ent processors. In the following sections, we present a run-time framework
based on AUTOSAR that supports seen allocation of hot and cold standbys,
and shows how the system recovers from failures. We also look at practical
84
considerations such as, the worst-case behaviors and overheads of our so-
lution. We complement our analysis with experimental results for failure
recovery latencies and overheads from a test bench running AUTOSAR-
compliant code.
6.2 Fault-tolerant Task Allocation for Mixed-Criticality
Systems with Overloaded Operation
In this section, until now, we have so far considered the (Ci , Ti , Di ) task
model for independent tasks, where Ci represents the worst-case utiliza-
tion of a task. We have assumed RM scheduling, a fixed-priority preemp-
tive scheduling policy which allows schedulability to be maximized while
respecting the deadlines of all tasks. This approach assumes that every task
is equally important, and that its worst-case execution time never exceeds
Ci .
Mixed-criticality systems offer more flexibility. In particular, tasks can
have different criticality levels. For example, in automobiles, actuation con-
trol tasks like brake, steering and throttle control are more critical than
tasks that control the HVAC system. Hence, if there is ever a situation
where we can only satisfy the deadline of some tasks, we should meet
those with higher criticality.
Calculating the worst-case execution time for many real-time cyber-
physical systems, including automobiles, is unfortunately difficult. This
is in part due to the run-time dependencies on the operating environment
that can influence the execution time of a given task. For example, consider
the task that detects, classifies and tracks objects in a vehicle’s environment.
As the number of objects increases, the computational resources needed to
85
successfully perform all necessary calculations also increase. In cases like

these, there are two basic approaches to decide the worst-case execution
time: set a practical worst-case execution time determined by experimental
evidence or choose a theoretical worst case. Depending on the applica-
tion, this theoretical worst case can be quite large. Mapping this to our
example task, let us assume that the theoretical worst-case utilization for
our object detection and tracking task is 80%, whereas experimentally, the
worst-case utilization is at most 30%. This implies that almost at all times
the task’s utilization is at most 30%, but could on occasion go up to the
theoretical maximum value of 80%. If we consider the worst-case utiliza-
tion for this task as 80%, then essentially 50% of the node is left unused
all of the time, which is very inefficient for a resource-constrained environ-
ment. Conversely, if we assign the worst-case execution time as 30%, then
on the occasion that the task runs up to the theoretical maximum of 80%,
deadlines for potentially other safety-critical tasks may be missed.
In order to account for the above considerations, we adopt the over-
loaded task model [92] which consists of two mutually-exclusive operating
states with a criticality level.
τi = (Ci , Cio , Ti , Di , ζ i ) (6.1)
where,
Cio is the overload execution budget, and
ζ i is the criticality of the task.
In this model, each task is assumed to be in one of the following oper-

ating states:
1. Normal, with an execution budget of Ci which represents the average-case load.

86
2. Overloaded, with an execution budget of Cio which represents the load during over-
loaded operation.
One straightforward approach for dealing with such mixed-criticality

tasks is to assign scheduling priorities based on criticality, also known as
Criticality As Priority Assignment (CAPA). This ensures that tasks with
higher criticality will always have higher priorities than tasks with lower
criticality. Hence, if a higher-criticality task gets overloaded, only tasks
with lower criticality will miss their deadlines. However, CAPA could re-
sult in significantly low schedulable utilization due to priority inversion
arising from tasks with low rate-monotonic scheduling priority that have a
high criticality.
Another approach is to use the Zero-Slack RMS (ZSRM) scheduling
algorithm [92, 93]. Under ZSRM scheduling, the execution of each task τi
is divided into two different modes: N (normal) and C (critical). In the
N mode, all active and otherwise non-suspended tasks in the system are
considered to be ready for scheduling purposes, whereas in the C mode of
task τi , all the tasks with lower criticality than τi are considered suspended
or blocked for scheduling purposes. The admission control algorithm then
calculates the execution time available for each mode.
Under ZSRM, the switch from the N mode to the C mode of a task
happens at its zero-slack instant. The zero-slack instant of any task τi is
defined as the latest time instant relative to the start of its period at which
suspending tasks with lower criticality than τi will ensure that jobs of τi
meet their zero-slack scheduling guarantee. Hence, with ZSRM scheduling,
admission control guarantees schedulability of the higher-criticality tasks
[94].
Another approach is to use period transformation, where the period
87
18
TPCD with ZSRM admission control

16 TPCD with CAPA admission control
14
12
10
2
0 5 10 15 20 25
Number of Primaries
Figure 6.16: CAPA-TPCD vs ZSRM-TPCD
and execution time of a higher-criticality task which has a longer period

than a lower-criticality task is sliced equally into smaller sections such that
the transformed period is smaller than that of the lower-criticality task. Pe-
riod transformation might not always be feasible in the presence of shared
resources and hardware accelerators. So, in this section, we limit our dis-
cussion to CAPA and ZSRM.
6.2.1 Fault-tolerant Task Allocation
As described in the previous section, in the uniprocessor context, mixed-

criticality tasks can be scheduled using techniques like CAPA and ZSRM,
since they guarantee that the low-criticality tasks cannot interfere with
high-criticality tasks under overload conditions. Schedulability on multi-
core processors can be further improved by appropriate task allocation [95].
To this end, we extend the TPCD heuristic to incorporate overload con-
ditions while ensuring that the fault-tolerant task allocation condition is
88
satisfied.
Impact of admission control on resource utilization
In this section, we compare the performance of CAPA and ZSRM in terms

of resource utilization by introducing their respective admission control
schemes while allocating tasks using the TPCD algorithm. Algorithm 4
presents this overload extension to the TPCD algorithm. As highlighted in
Algorithm 4, we maintain the task ordering while modifying the admis-
sion control technique. We evaluate the resource utilization of both tech-
niques by allocating randomly-generated task sets using both approaches
and comparing the number of bins used by each approach per task set.
The results are presented in Figure 6.22, where we vary the number of pri-
maries in each task set and plot the average number of bins used to allo-
cate the task sets across 50,000 task sets. We randomly generate overloaded
utilization values that are up to 100% greater than the normal utilization
values ensuring that no task has utilization greater than 1. Each task is also
assigned 0, 1 or 2 backups. We also randomly assign a criticality level of 1,
2 or 3 to each task, where criticality level 1 represents the highest criticality
tasks and level 3 represents the least criticality.
From Figure 6.16, we see that the ZSRM approach does consistently
better on average, i.e. it uses lower number of bins for allocating the same
task set compared to the CAPA approach.
Impact of task order on resource utilization
Task ordering can influence the allocation of certain tasks to certain nodes,
which in turn affects schedulability. To this end, we propose the C-TPCD
heuristic to evaluate the impact of the task order on resource utilization.
Since higher-criticality tasks cannot afford to miss deadlines, we prioritize
89
Algorithm 9 TPCD with Overload Considerations

ZSRM/
6: Task Assignment(α) ←−−−− BFD-P(Ψi )
CAPA
Algorithm 10 C-TPCD
3: ζ i ← τjk . Create tiers consisting of tasks of the same criticality
4: for each task τj in ζ do
8: Task Assignment(α) ← BFD-P(Ψi )
the placement of higher-criticality tasks first. In the first stage, C-TPCD

breaks the task set into different criticality tiers. Next, it applies the TPCD
algorithm to each tier. At every criticality stage, the C-TPCD algorithm uses
ZSRM for admission control. Algorithm 5 presents the C-TPCD heuristic.
Figures 6.17-6.21 highlight these stages with an example. Figure 6.17
shows an example task set where ’ζ’ represents the criticality level of each
task in the system. The tasks are also assigned their computation times in
normal and overload modes. Tasks with lower ζ values (i.e. tasks that
have higher criticality) have more backups, in this case, 2 each, as de-
picted in Figure 6.18. On the other hand, tasks with higher values of ζ
(i.e. tasks that have lower criticality) have fewer backups, in this case, tasks
with ζ = 3 have no backups (figure 6.18). In the C-TPCD heuristic, the
highest-criticality tier is allocated first. At each allocation, ZSRM is used to
90
Brake Safety Steering Throttle HVAC Audio Video

Control Audio (SA) Control Control ζ =3 Playback Playback
(BC) ζ =2 (SC) (TC) (40, 60, (AP) (VP)
ζ =1 (10, 20, ζ =2 ζ =2 100) ζ =3 ζ =3
(10, 20, 100) (20, 40, (16, 40, (Ui = 0.4 (50, 60, (55, 75,
100) (Ui = 0.1 100) 100) Uio = 0.6) 100) 100)
(Ui = 0.1 Uio = 0.2) (Ui = 0.2 (Ui = 0.16 (Ui = 0.4 (Ui = 0.55
Uio = 0.2) Uio = 0.4) Uio = 0.4) Uio = 0.5) Uio = 0.75)
Figure 6.17: C-TPCD: Example Taskset
BC’’ (0.1, 0.2) SC’ TC’ (0.16, 0.4) SA’’ (0.1, 0.2)
(0.2, 0.4)
BC’ (0.1, 0.2) SA’(0.1, 0.2)
BC (0.1, 0.2) SC
SC TC (0.16, 0.4)
SA (0.1, 0.2)
(0.2,
(0.2)
0.4)
ζ =1
ζ=2
VP AP HVAC
(0.55) (0.5) (0.4)
ζ =3
Figure 6.18: C-TPCD: Criticality Tiers
verify admission. This ensures that, in the case of overload, all the highest-
criticality tasks will meet their deadlines. Once the highest tier is allocated,
the next criticality tier is allocated in the same way. This is shown in Figure
6.19 and 6.20. Figure 6.21 depicts the final allocation.
We use an identical experimental setup as described in Section 6.2.1
to evaluate the performance of CTPCD-ZSRM against TPCD-ZSRM. As
Figure 6.22 depicts, TPCD-ZSRM still performs better in terms of resource
utilization. This implies that prioritizing the fault-tolerant packing insights
91
BC’’ (0.1, 0.2) BC’ (0.1, 0.2) BC (0.1, 0.2)

Node 1 Node 2 Node 3
Figure 6.19: C-TPCD: Highest criticality assignment
ζ=2
SC’ SA (0.1, 0.2)

(0.2, 0.4) SA’(0.1, 0.2)
SC
TC’ (0.16, 0.4) (0.2 0.4)
SA’’ (0.1, 0.2)
TC (0.16, 0.4)
BC’’ (0.1, 0.2) BC’ (0.1, 0.2) BC (0.1, 0.2)
Node 1 Node 2 Node 3 Node 4
Figure 6.20: C-TPCD: Medium criticality assignment
from Section 6.1.1 has a greater benefit in terms of resource utilization as

compared to task criticality.
As our final experiment, we evaluate all the three approaches C-TPCD-
ZSRM, TPCD-ZSRM and TPCD-CAPA to see which one produces a better
allocation than the others. Figure 6.23 presents the result. As seen from
the figure, the TPCD-ZSRM approach performs the best. However, it is
important to note that, up to 20% of the time, C-TPCD does outperform
the other approaches. It is also interesting to note that CAPA-TPCD can
92
AP
(0.5)
HVAC ζ=2
(0.4) VP
(0.55)
TC (0.16, 0.4)
SC’
(0.2, 0.4) SA’(0.1, 0.2)
SC
TC’ (0.16, 0.4) (0.2 0.4)
SA’’ (0.1, 0.2)
BC’’ (0.1, 0.2) BC’ (0.1, 0.2) BC (0.1, 0.2) SA (0.1, 0.2)
Node 1 Node 2 Node 3 Node 4
Figure 6.21: C-TPCD: Final assignment
18
16 CTPCD with ZSRM admission control
14
12
10
2
0 5 10 15 20 25
Number of Primaries
Figure 6.22: TPCD vs C-TPCD
produce a better allocation in a small but sizable fraction of the cases. This
is due to the fact that, for a particular task set, CAPA admission control
can result in a node assignment that improves the fault-tolerant allocation.
Hence, in practice, an ensemble approach [96] which applies all approaches
and picks the best allocation would be ideal.
93
0.6
0.5
Percentage of better allocations (atleast 1 bin)
0.4
CTPCDth ZSRM admission control
TPCD CAPA admission control
0.3
0.2
0.1
0
0 5 10 15 20 25
Number of Primaries
Figure 6.23: TPCD-CAPA vs C-TPCD-ZSRM vs TPCD-ZSRM
6.3 Task Partitioning with Recovery Time Constraints
In Section 4, we presented the fault-tolerant task allocation problem. We

now extend this problem to include the constraint that every backup task
satisfies the recovery-time requirement of the primary. Given our focus
on resource-constrained environments, we present heuristics to address
this problem while trying to minimize the number of processors used for
allocation. Based on the recovery-time bounds of Section 5.2.2, we derived
conditions to determine the standby type in Section 5.3. In this section, we
look at how the redundant-task type assignment can be incorporated in
the task allocation scheme to satisfy the recovery time requirement of each
primary.
The TPCD heuristic [81] produces an allocation satisfying the fault-
tolerant placement constraint while attempting to minimize the number
of nodes used. TPCD breaks the task set into tiers based on the backup
order to place members of a replica group as far away from each other
in the task order as possible. This reduces the chances of a task facing a
94
Algorithm 11 TPCDC+R
1: procedure TPCDC+R (Γ = {τ00 , τ01 , ....τ10 ...τn0 , ...}) . (τji : j → TaskId, i → TierOrder)
3: Ψi ← τji . Create tiers consisting of tasks with redundancies of the same order
5: Sort tasks in descending order of their utilizations
6: for each task τi in Ψi do
7: Check recovery time to primary and assign redundant-task type
8: Task Assignment(α) ← BFD-P(τi )
9: Apply lower run-time utilizations for cold standbys
10: Allocate the tasks that do not have redundancies
placement conflict. In each tier, TPCD arranges tasks in descending order

of utilization values, since, members of larger groups have a greater prob-
ability of running into a placement conflict. TPCD then allocates the tiers
from the highest-order tier to the lowest-order tier. The TPCDC heuristic
extends TPCD to leverage lower cold-standby utilizations. Any non-critical
task can be terminated in order to allow a cold standby to execute when a
primary fails. TPCDC initially treats all standbys as hot standbys from a
utilization standpoint.
6.3.1 The TPCDC+R Heuristic
We now extend TPCDC by introducing an explicit check for RTR. This

TPCDC+R heuristic is shown in Algorithm 1. Before assigning a task to a
node, we ensure that every task (primary or copy) on that node satisfies
RTR constraints. In order to determine the recovery time of a redundant
task, we must first assign the redundant-task type using Table 1. Since cold
standbys at run-time have very low utilization values, it allows for an opti-
mization where non-safety critical tasks can be assigned to processors with
cold standbys which can be terminated in case the cold standby needs to
95
Algorithm 12 TRTI
1: procedure TRTI (Γ = {τ00 , τ01 , ....τ10 ...τn0 , ...}) . (τji : j → TaskId, i → TierOrder)
3: Ψi ← τji . Create tiers consisting of tasks with redundancies of the same order
5: Sort tasks in ascending order of RTR constraints
6: for each task τi in Ψi do
7: Check recovery time to primary and assign redundant-task type
8: Task Assignment(α) ← BFD-P(τi )
9: Apply lower run-time utilizations for cold standbys
10: Allocate the tasks that do not have redundancies
take over primary execution. Hence, if multiple redundant task options are
available, we prioritize cold standbys over hot standbys and active replicas
because they are the most resource-efficient. Next, hot standbys do not
normally produce outputs. Hence, the overhead for duplicate suppres-
sion is avoided and hot standbys can potentially run a degraded version
of the primary with lower utilization values. However, they may have a
scheduling penalty since they need to satisfy RTR constraints. Therefore,
the heuristic first checks if the hot standby satisfies the RTR constraint of
the task. If so, it assign a hot standby. Else, it chooses an active replica
instead of opening a new node for assignment.
It must be noted that the choices among three redundant-task types
would be different if the goal was different. For example, if communication
bandwidth is constrained, the cold standby overheads for state transfer
need to be factored in.3
As stated before, we prioritize cold standbys over hot standbys or active
replicas. Figure 6.25a shows the distribution of standby types produced
by TPCDC+R. We plot the percentage of active, hot or cold redundant
3 We will consider this overall system resource optimization problem as part of our future work.
96
Task 1 Task 2 Task 3 Task 4 Task 5 Task 6

(4,10) (6,20) (3,5) (80,100) (80,100) (30,100)
(Ui = 0.4) (Ui = 0.3) (Ui = 0.6) (Ui = 0.8) (Ui = 0.8) (Ui = 0.4)
(2 copies) (2 copies) (1 copy) (0 copies) (0 copies) (0 copies)
(RTR = 0, (RTR = 1, (RTR=3,
p=0) p=0) p=3)
Task 1 Task 2 Task 3 Task 4 Task 5 Task 6
(0.4) (0.3) (0.6) (0.8) (0.8) (0.4)
(a) Input Task Set

Task 1 Task 2 Task 1 Task 1’ Task 1’’
(0.4) (0.3)
(0.4) (0.4) (0.4)
RTR Tier 1
Tier 2
Task 3 Task 1’ Task 2’

(0.6) (0.4) (0.3) Task 2 Task 2’ Task 2’’
(0.3) (0.3) (0.3)
Tier 1 RTR Tier 2
Task 3’ Task 1’’ Task 2’’ Task 3 Task 3’

(0.6) (0.4) (0.3) (0.6) (0.6)
Tier 0
RTR Tier 3
(b) TPCDC+R Tiering (c) RTT Tiering
Task 1’ Task 3’
(0.4) (0.6) (Hot)
(Active) Task 2’’
Task 2 (0.3) Task 2 Task 2’ Task 2’’
(0.3) (Cold) (0.3) (0.3) (Cold) (0.3) (Cold)
Task 3
(0.6) Task 3 Task 3’
Task 1 Task 1’’ Task 1 Task 1’ Task 1’’
Task 2’ (0.6) (0.6)
(0.4) (0.4) (0.4) (0.4) (0.4)
(0.3) (Cold) (Cold) (Hot)
(Cold) (Cold)
Node 1 Node 2 Node 3 Node 4 Node 1 Node 2 Node 3 Node 4 Node 5
(d) TPCDC-R+ critical task allocation (e) RTT critical task allocation
Task 4 Task 4 Task 5

(0.8) (0.8) (0.8)
Task 1’ Task 3’
(0.4) Task 6
(0.6) Task 5 (0.4)
(Active) (0.8)
Task 2 Task 2
(0.3) Task 3 (0.3)
(0.6) Task 2’’ (0.1)
Task 1 Task 2’ (0.1) Task 2’’ (0.1) Task 3 Task 3’
Task 1 (Cold) Task 6 (0.6) (0.6)
(0.4) Task 2’ (0.4) (Cold) (Cold)
Task 1’’ (0.1) (0.4) (Hot)
(0.1) (Cold) (Cold) Task 1’ (0.1) Task 1’’ (0.1)
(Cold) (Cold)
Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 1 Node 2 Node 3 Node 4 Node 5
(f) TPCDC-R+ non-critical task allocation (g) RTT non-critical task allocation
Figure 6.24: Example: TPCDC-R+ vs RTT (Best Viewed In Color)

97
Algorithm 13 RTT
1: procedure RTT(Γ = {τ00 , τ01 , ....τ10 ...τn0 , ...})
3: Ψi ← τji . Create tiers consisting of tasks of same RTR
5: TPCDC+R(Ψi )
TPCDC+R - Standby Distribution TRTI - Standby Distribution
Percentage of redundancy type

60
60
50 50
40 40
30 30
20 20
10 10
0 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Number of Primaries Number of Primaries
Active Replicas - TPCDC+R Hot Standbys - TPCDC+R Active Replicas - TRTI Hot Standbys - TRTI
Cold Standbys - TPCDC+R Cold Standbys - TRTI
(a) TPCDC+R: Standby Distribution (b) TRTI: Standby Distribution

RTT - Standby Distribution
60
50
% of allocations with fewer nodes
40
25
30 RTT
20 TRTI
20
15 TPCDC+R
10
0 10
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Number of Primaries 5
0
Active Replicas - RTT Hot Standbys - RTT 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Cold Standbys - RTT Number of Primaries
(c) RTT: Standby Distribution (d) Comparative Evaluation
Figure 6.25: Evaluation: RTT vs TRTI vs TPCDC+R
task assignments against the number of primary tasks in each task set.
The results are averaged across 50,000 tasksets, where tasks are randomly
generated. Each task is randomly assigned 0,1 or 2 redundancies, an RTR
constraint from 0 to 5, and a value for p (i.e., periods for cold standby
priming) from 0 to 5.
TPCDC+R prioritizes tasks with higher utilization values by assigning
them first in the task allocation order for each tier. This introduces addi-
tional placement constraints for tasks which have tight RTR requirements.
98
An example occurs when a task with low utilization with strict RTR re-
quirements gets placed later in the allocation order. As a result, cold stand-
bys may become unschedulable forcing the use of active replicas, which in
turn can cause new nodes to be added.
To address this problem, we introduce two new heuristics based on
TPCDC+R that prioritize RTR constraints in the task allocation order.
1. In the first heuristic, we order tasks within each tier of TPCDC by their RTR re-
quirements instead of utilization values. We refer to this extension as the Tiered
RTR constraint Increasing (TRTI) heuristic. Algorithm 2 captures this TRTI heuristic.
2. In the second heuristic, we divide tasks into groups with different RTR require-
ments and allocate each group using the TPCDC heuristic separately. We refer to
this as the RTR-tiered (RTT) heuristic. Algorithm 3 presents this heuristic.
Figure 6.24 depicts an example highlighting how prioritizing RTR con-

straints in the task allocation order can improve resource utilization by
comparing the outputs of the TPCDC+R and the RTT heuristics for the in-
put task set in Figure 6.24a. As shown in Figure 6.24b, TPCDC+R breaks
the critical tasks into tiers based on the number of backups and orders tasks
within a tier based on their utilization values. In contrast, RTT breaks the
tasks into tiers based on their RTR constraints. Figures 6.24e and 6.24d
show that the RTT heuristic allocates a greater number of cold standbys
compared to the TPCDC-R heuristic. This, in turn, results in an allocation
with fewer nodes as seen in Figures 6.24f and 6.24e. Notice that, when al-
locating the non-critical tasks, we consider the lower utilization values for
the cold standbys.
Figures 6.25b and 6.25c show the standby distributions for TRTI and
RTT heuristics. Both the heuristics result in a larger number of cold standby
allocations than for the TPCDC+R heuristic.
99
6.3.2 Evaluation and Discussion
In this section, we evaluate and compare the performance of the TPCDC+R,

TRTI and RTT heuristics. We also evaluate the impact of the increased cold
standby allocation on the number of nodes used for allocations using the
new heuristics. We plot the percentage of task sets for which a heuristic
produces an allocation with fewer nodes, i.e., uses at least one node less for
allocation compared to the other two heuristics. Figure 6.25d presents the
results for 50,000 randomly-generated tasksets generated using Stafford’s
Randfixedsum algorithm [97] for random total utilization values ranging
from 0.1 to number of primaries and random period ranges from 0 to 104 .
Each task is randomly assigned 0,1 or 2 copies, an RTR constraint from
0 to 5, and a value for p (i.e., periods for cold standby priming) from 0
to 5. As the figure shows, RTT produces an allocation with fewer nodes
on average when compared to the TRTI and TPCDC+R. For task sets with
24 primaries, it produces an allocation with fewer nodes than TRTI and
TPCDC+R for almost 23% of the task sets. This is consistent with the in-
tuition that increasing the number of cold standbys reduces CPU resource
utilization. Also, as the number of primaries increase, this trend becomes
more significant as we have more cold standby assignments to leverage.
Moreover, both heuristics that prioritize the RTR constraints perform bet-
ter than the TPCDC+R heuristic. It is important to note that increasing the
number of cold standbys will result in additional network latencies since
they need to have state information sent to them from their primaries. For
the purpose of these experiments, we assume that the delays incurred for
state transfer are short. For a network-constrained system, it may prove to
be more advantageous to have a lower number of cold standbys.
100
6.4 Applying Simulated Annealing to the Fault-Tolerant
Task Allocation Problem
In the previous section, we saw that the RTT heuristic on average pro-
duces a better solution than TPCDC+R and TRTI. In this section, we look
at further improving on the RTT heuristic solution by utilizing the simu-
lated annealing method to solve the fault-tolerant task allocation problem
instead.
Simulated annealing is a general-purpose combinatorial optimization
technique first proposed by Kirkpatrick et al. [44]. The fault-tolerant task
assignment problem can be stated as an optimization problem as follows,
Given n tasks (τ1 , τ2 , ..., τn ), with utilization (u1 , u2 , ..., un ), where ui ≤ 1,
find the number of nodes M of size 1 that are needed to pack all tasks
such that a primary task and its corresponding redundant copies obey
the placement constraint of not being co-located on the same node and
optimizing the following cost function [98]
M
c f = Maximize ∑ ( ∑ u i )2 (6.2)
j =1 i ∈ k j
where, k j represents the set of tasks in bin j.

The simulated annealing algorithm for fault-tolerant task allocation is
shown in Algorithm 14. The algorithm starts by using the RTT heuristic to
create an initial allocation, α. We use this as the initial state of the system.
To obtain a new state α0 from the initial state we randomly perform one
of the two operations described in Section 6.4.1. While performing either
of these operations, we ensure that the placement constraints for all tasks
remain satisfied. We also ensure that the new allocation is schedulable.
Here, we apply a greedy optimization: if a valid operation results in an
101
Algorithm 14 Simulated Annealing

1: procedure anneal (Γ = {τ00 , τ01 , ....τ10 ...τn0 , ...})
2: Task Assignment(α) = RTT(Γ)
3: T ← T∞
4: while T > T0 do
5: repeat
6: α’ = randomlyModifyCurrentSolution(α)
7: ∆C = c f (α) − c f (α0 ) . From Eqn 6.2
8: η = RANDOM(0, 1)
9: P(∆C ) = e(−∆C/T )
10: if ∆C < 0 or P(∆C ) > η then
11: α = α0
12: until thermal equilibrium
13: T ← F(T )
empty bin, we remove it from the allocation4 . The value of the objective
function is calculated for this new state. Let ∆C represent the change in
the cost function, i.e, ∆C = c f (α) − c f (α0 ). This state is unconditionally
accepted if ∆C < 0. If not, the Metropolis condition [99] is applied and the
state is accepted with a probability according to the following acceptance
function P = e(−∆C/T ) . We start with a large value for initial temperature
T = T∞ . When there is no appreciable change in the value of the cost
function across a few chains of computation or a maximum number of
iterations is reached, we lower the temperature. The annealing terminates
when the temperature T reaches a low-enough value, To and the current
best α is returned as the solution. We derive the values for T∞ and To for
the fault-tolerant task allocation problem in Section 6.4.2.
4 In
our experiments, we found no significant improvement in the quality of solutions obtained by
retaining an empty bin
102
6.4.1 Generating Random Solutions
In order to create random solutions from a given solution, we apply the

following two operations [100].
1. We randomly move a single task from a randomly-selected node k to another ran-

domly selected node l.
Lemma 10. The maximum reduction ∆C max for the cost function in Equation 6.2, for a
system of two nodes, k and l, by moving a task from node k to l occurs when Uk = 1 and
Ul = 0, where Uk and Ul are the total utilization values of the respective nodes.
Proof. Let ut represent the utilization of the task that is moved from bin k to l. Let
Uk0 and Ul0 be the transformed utilization values after a task is moved from node
k to l. Hence, Uk0 = Uk − ut and Ul0 = Ul + ut and ∆C for this operation can be
represented as,
2 2
∆C = Uk 2 + Ul 2 − Uk0 − Ul0 = Uk 2 + Ul 2 − (Uk − ut )2 − (Ul0 + ut )2
(6.3)
= 2 ∗ Uk ∗ ut − 2 ∗ Ul ∗ ut − 2 ∗ u2t
From Equation (6.3), ∆C is maximum when the positive terms are maximized and
the negative terms are minimized. Ul only appears in the second term which
is negative, and Uk appears only in the first term which is positive. Hence ∆C
is maximized when Uk = 1 and Ul = 0 corresponding to their maximum and
minimum possible values.
For the fault-tolerant task allocation problem, moving a task from one bin to an-
other can result in a different redundant-task-type assignment resulting in different
run-time utilizations. Let the factor s capture this utilization change. The associ-
ated change in the cost function for this operation is given by,
103
∆C =Uk2 + Ul2 − [(Uk − ut )2 + (Ul + s ∗ ut )2 ]

(6.4)
=2 ∗ Uk ∗ ut − u2t − 2 ∗ Ul ∗ s ∗ ut − (s ∗ ut ) 2
From Lemma 5, the maximum value of ∆C, which represents the largest reduction
in the cost function, occurs when a task is moved from a completely-packed node
to a completely-empty node. Since we apply a greedy optimization of removing
empty bins, we consider Ul = e. Hence,
∆C max 1 u2 ∗ ut − u2t − (s ∗ ut )2 (6.5)
2. We randomly select two tasks currently located in two different bins and swap
them.
Lemma 11. The maximum reduction ∆C max for the cost function in Equation 6.2, for a
system of two nodes, k and l, by swapping two tasks occurs when one of the nodes has
U = 1 and the other has U = e.
Proof. Let Uk and Ul be the total utilization values of the respective nodes. Let ut1
represent the utilization of the task that is moved from bin k to l and ut2 represent
the utilization of the task that is moved from bin l to k. Let Uk0 and Ul0 be the
transformed utilization values after the tasks are swapped. Hence, Uk0 = Uk −
ut1 + ut2 , Ul0 = Ul + ut1 − ut2 and ∆C for this operation can be represented as,
2 2
∆C = Uk 2 + Ul 2 − Uk0 − Ul0
= Uk 2 + Ul 2 − (Uk − ut1 + ut2 )2 − (Ul + ut1 − ut2 )2
= 2 ∗ Uk ∗ ut1 − 2 ∗ Uk ∗ ut2 + 2 ∗ ut1 ∗ ut2 − ut1 2 − ut2 2

(6.6)
2 2
+ 2 ∗ Ul ∗ ut2 − 2 ∗ Ul ∗ ut1 + 2 ∗ ut1 ∗ ut2 − ut1 − ut2
= 2 ∗ Uk ∗ (ut1 − ut2 ) − 2 ∗ Ul ∗ (ut1 − ut2 ) − 2 ∗ (ut1 − ut2 )2
= 2 ∗ (Uk − Ul ) ∗ (ut1 − ut2 ) − 2 ∗ (ut1 − ut2 )2

104
From Equation (6.6), ∆C is maximum when Uk − Ul u 1, since 0 < Uk , Ul ≤ 1.

Since we are swapping tasks between two nodes, a node cannot be empty. Hence,
∆C is maximized when one node has U = 1 and the other U = e.
For our fault-tolerant task allocation problem, let the factors st1 and st2 capture the
utilization changes after the swap. The associated change in the cost function for
this operation is given by,
∆C =Uk2 + Ul2 − [(Uk − ut1 + st2 ∗ ut2 )2 + (Ul + st1 ∗ ut1 − ut2 )2 ]
=2 ∗ Uk ∗ ut1 − 2 ∗ Uk ∗ st2 ∗ ut2 + 2 ∗ ut1 ∗ st2 ∗ ut2 − u2t1 − (st2 ∗ ut2 )2 + (6.7)
2 ∗ Ul ∗ ut2 − 2 ∗ Ul ∗ st1 ∗ ut1 + 2 ∗ ut2 ∗ st1 ∗ ut1 − u2t2 − (st1 ∗ ut1 )2
From Lemma 11, the cost function is maximized when one bin has U = 1 and the
other has U = e. Hence,
∆C max 2 u2 ∗ ut1 − 2 ∗ st2 ∗ ut2 + 2 ∗ ut1 ∗ st2 ∗ ut2 − u2t1 − (st2 ∗ ut2 )2 +
(6.8)
+ 2 ∗ ut2 ∗ st1 ∗ ut1 − u2t2 2
− (st1 ∗ ut1 ) → Uk = 1, Ul = e
Given a task set, the value of ∆C max = max (∆C max 1 , ∆C max 2 ) can be easily calcu-
lated by substituting actual values into Equations (6.5) and (6.8) for all combina-
tions of tasks.
6.4.2 Selecting an Annealing Schedule
The annealing schedule is described by quantitative choices for the three

parameters: the starting value of the temperature, T∞ , the stopping value of
the temperature To , and the decrement function F ( T ) which determines the
profile of the temperature from the beginning till the end of the annealing
process.
105
The starting temperature, T∞ , for a good annealing schedule, is usually

determined by monitoring the acceptance ratio at each temperature. The
upper bound for acceptance ratio ah (the fraction of generated states that
are accepted), is arbitrarily fixed at some high value such as 0.9 and the
temperature is increased to a value where this acceptance ratio is achieved
[100]. Given that we can calculate ∆C max for a given task set, we can cal-
culate the value T∞ , which can accommodate even the largest reduction in
the cost function at high temperatures, as follows.
ah = e(−∆Cmax /T∞ ) ⇒ ln(1/ah ) = ∆C max /T∞ ⇒ T∞ = ∆C max /ln(1/ah ) (6.9)
Similarly, To can be calculated for the lower bound of the acceptance

ratio al .
To = ∆C max /ln(1/al ) (6.10)
In our experiments, we also found that F ( T ) = 0.9 ∗ T works well for

the problem at hand.
6.4.3 Evaluation
In this section, we compare the performance of the simulated annealing

approach with that of the RTT heuristic. We plot the execution time of the
simulation annealing approach and the RTT heuristic against the number
of the primaries in the task set. Figure 6.26 presents the results averaged
across 5000 randomly-generated task sets. Each task is randomly assigned
0,1 or 2 redundancies, an RTR constraint from 0 to 5, and a value for p (i.e.,
periods for cold standby priming) from 0 to 5. Note that the Y-axis is in
log scale. Our heuristics are faster than the simulated annealing approach
106
Figure 6.26: Execution-Time Evaluation
Figure 6.27: Resource Utilization Evaluation

107
by more than 2 orders of magnitude. We also plot the number of nodes

utilized by each technique per iteration against the number of primaries in
the task set. Figure 6.27 presents the results for 5,000 randomly-generated
tasksets generated using Stafford’s Randfixedsum algorithm [97] for ran-
dom total utilization values ranging from 0.1 to number of primaries and
random period ranges from 0 to 104 . As Figures 6.26 and 6.27 show, though
the simulated annealing algorithm takes longer to complete, it produces an
allocation with fewer nodes on average when compared to RTT. This ap-
proach can be used for generating offline static allocations and in other
non-time-sensitive contexts. In contrast, our heuristics can be used for run-
time admission control and other environments that are time-sensitive.
6.5 Chapter Summary
In this chapter, we proposed a new heuristic, called Tiered Placement Con-

straint Decreasing (TPCD), for assigning a set of replicated tasks to specific
processors. The number of processors used needs to be minimized without
compromising fault-tolerance. Our approach saves at least one processor
up to 40% of the time for a random task set and up to 90% of the time for
an L2 task set (Task sets typical to Level 2 of driving automation), relative
to the best-known heuristic in the literature. Also, on an average, it uses
only up to one processor more than a carefully-constructed optimal allo-
cation. We also proposed a heuristic called Tiered Placement Constraint
Decreasing (TPCDC) which leverages the run-time characteristics of cold
standbys, further improving the resource utilization of the system. We
then derived the worst-case response time bounds of our task model and
presented a solution to improve this bound. We also analyzed the im-
pact of overload conditions on our task allocation scheme. We extended
108
the fault-tolerant task allocation problem to include these RTR constraints,

and proposed the TPCDC+R heuristic to satisfy these constraints. Finding
a core weakness in TPCDC+R, we then presented two additional heuristics
called Recovery-Time Tiered (RTT) and Tiered Recovery-Time Constraint
Increasing (TRTI) which prioritize the RTR constraints in the task alloca-
tion sequence. These two heuristics on average produce allocations with
fewer nodes than the TPCDC+R heuristic because they yield more assign-
ments of resource-efficient cold standbys. Overall, the RTT heuristic, which
tiers tasks based on their RTR values to prioritize the allocation of tasks
with strict RTR requirements first, performs the best. Finally, we used the
simulated annealing method to solve the fault-tolerant task allocation op-
timization problem and showed that it produces allocations utilizing fewer
computing resources than the proposed heuristics, but at the cost of sub-
stantial run-time.
Chapter 7
Software Architecture to Support and

Maintain Fault-Tolerance Guarantees
Providing all correct processors with consistent views of the processor-

group membership and guarantee bounded processor failure detection is
important to ensure reliable and safe operation of automotive systems. In
this chapter we present software architecture to support the safe execution
of tasks on platforms such as the AUTOSAR Classic Platform [22], the
AUTOSAR Adaptive Platform [101] and the CMU Autonomous Driving
Platform [102].
In chapter 7.1 we present the software architecture to support the ex-
ecution of different standby types in the AUTOSAR Classic Platform, fol-
lowed by a software framework built on service-oriented architecture for
an AUTOSAR Adaptive Platform in Section 7.2. Finally, in Section 7.3 we
present the SAFFIRE (Software Architecture For Fault-tolerant Imbed Real-
time Environments) designed to support fault-tolerant execution the CMU
Autonomous driving Platform. In each of these sections we present our
implementation and experimental evaluation of our architectures for each
of these frameworks.
109
CHAPTER 7. SOFTWARE ARCHITECTURE TO SUPPORT AND MAINTAIN
FAULT-TOLERANCE GUARANTEES 110
7.1 Fault-Tolerant Software Architecture for the
AUTOSAR Classic Platform
AUTOSAR (AUTomotive Open System ARchitecture [22]) is an open and

standardized automotive software architecture jointly developed by auto-
mobile manufacturers, suppliers, and tool vendors. AUTOSAR allows for
the standardization of system functions allowing for scalability to differ-
ent vehicle and platform variants. It also enables the integration of soft-
ware from multiple suppliers, maintainability throughout the entire prod-
uct life-cycle, as well as software updates and upgrades over the vehicle’s
lifetime. Hence, AUTOSAR is an excellent platform to verify the feasibil-
ity, performance, and timing behavior of our fault-tolerance primitives. In
this context, we design and implement a software library which supports
the creation and management of task-level software fault-tolerance prim-
itives. The core of this library is independent of any AUTOSAR-specific
constructs, so it can be easily ported to other real-time frameworks. The li-
brary allows creation of hot and cold standbys corresponding to each task
in the system. The library also defines primitives that each task can use
to form a group with its corresponding standbys. This allows the mem-
bers within the group to communicate their health, monitor status, detect
failures, and reconfigure their execution in case of failures.
In this section, we present our run-time framework to support hot and
cold standbys in the AUTOSAR Classic Platform. We consider the example
of a traction control application for our discussion.
GI+HBPRI Node2
Traction Traction
Control
HBHOT Hot-
Control
Standby
(Primary) (Hot)
Node1 ST
HBCOLD
GI+HBPRI
Symbol Meaning
HB Heartbeat Traction
Control
GI Group (Cold)
Information
ST State Node3
Figure 7.1: Group Formation
PRIMARY Node2
FAILURE
Traction
Traction Hot-
Control
Control Standby
(Primary)
(Primary)
GI+HBPRI HBCOLD
Node1
ST
Symbol Meaning
Control
GI Group (Cold)
Information
ST State
Node3
Figure 7.2: Failure Recovery
7.1.1 Group Formation and Health Monitoring
It is our central goal that a standby can take over execution when the
primary fails. This is achieved by the formation of task-level groups. A
group includes a single primary and all of its corresponding standbys. On
startup, the designated primary for each application initiates a group for-
mation protocol [45], [103] with its standbys resulting in the formation of
a task-level group for which the primary acts as the leader. For example,
Node2
HBHOT
Traction Traction
Control GI+HBPRI Control
(Hot) (Primary)
Node1 GI+HBPRI
HBCOLD
ST
Symbol Meaning
Control
GI Group (Cold)
Information
ST State Node3
Figure 7.3: Dynamic Reconfiguration
Figure 8a shows a group consisting of one primary traction control task,

supported by one hot standby and one cold standby. Most fault-tolerant
group membership protocols can be adapted for our architecture. We high-
light some of the messages exchanged as part of our architecture in Section
7.1.4.
The members of the group monitor each other’s health by exchanging
regular status messages (heartbeats). The reception of a heartbeat indicates
that a task is running. If a given number of consecutive heartbeats are
missed, then this is considered as an indication of task failure. The value
for the number of consecutive heartbeats to be monitored depends on var-
ious factors like the parameters of the task itself (Ci , Ti , Di ) and the relative
phasing of the executions of the primary and the standbys. The calcula-
tion of these parameters is detailed in Section 7.1.7. A task running on a
particular node could fail for one of many reasons which may include the
loss of power, hardware component faults, memory faults and unexpected
inputs. The problem of error detection has been studied extensively in
the literature and there are various detection mechanisms for each of these
failure types as highlighted in [104]. Our approach is generic and we can
substitute any error detection mechanism into our framework. For exam-
ple, if the operating system were to detect a memory fault, the AUTOSAR
BSW can communicate this information to a standby which can take over
primary operation.
7.1.2 Hot and Cold Standby Operation
The standbys in the group have a precedence order dictated by the primary
to decide which standby would take over in the case of a primary failure.
A standby failure while the primary is still running results in the standbys
next in line being promoted to a higher position in the precedence queue.
As one would expect, a hot standby sits higher in the precedence order
than a cold standby.
When a hot standby has detected ‘k’ consecutive missed heartbeats,
it declares that the primary has failed and begins to produce outputs as
shown in Figure 7.2. It also maintains the group information that it regu-
larly received from the previous primary. This allows the new primary to
take over as the leader of the group and starts producing heartbeats and
state messages for the other standbys.
As pointed out in Section 4.2, a typical use case of a cold standby
would involve it being scheduled to run on a node which runs other lower-
priority, non-placement-critical tasks. Since the cold standby does not
perform any calculations, it would be computationally very inexpensive,
thus allowing the other tasks to execute without missing any deadlines. In
case of the failure of a placement-critical primary task, some of the lower-
priority non-placement-critical tasks would be degraded or switched off to
make CPU resources available to allow the cold standby to take over for the
failed primary. The cold standby can also be configured to run as a pure
backup of a hot standby where once it detects a primary failure and a hot
standby takes over execution the cold standby gets immediately promoted
to a hot standby.
RxFTmsg TxFTmsg
Primary Failure Hot Produces Output
Primary
C+δ T C
δ
nTpri (n+1)Tpri
HB NO HB
Hot
nThot (n+1)Thot (n+2)Thot
Figure 7.4: Worst-case behavior: Fixed phasing
δ
Primary δ
nTpri (n+1)Tpri (n+2)Tpri

HB
HB
NO HB NO HB
Hot
Figure 7.5: Behavior: Variable phasing

Primary δ
C+δ T T
nTpri (n+1)Tpri
HB
Hot
Figure 7.6: Single missed heartbeat

7.1.3 State Maintenance
We divide user-level applications into two types:
1. Applications With State: These applications maintain state information derived from
previous inputs which they use to perform calculations on current inputs. In our
case, the traction control task employs digital filters that would require some state
information to process its current inputs. Use of such filters are very common to
automotive applications. A hot standby would calculate and maintain all the state
information necessary, whereas the cold standby would not produce any state in-
formation. Instead, the primary would send over this state information to the cold
standby. In our case state transfer occurs every primary period, but the frequency
of state transfer can be optimized based on the data freshness and recovery time
requirement of the cold standby. Quantifying this parameter is out of the scope
of this section and we will be consider this problem as part of our future work.
The cold standby also maintains a log of the input messages which allow it to re-
cover state after primary failure. The cold standby prunes this logs every time it
successfully receives state from the primary.
2. Applications Without State: These applications do not need any state to be main-
tained for their operation. In this case, no state information needs to be exchanged
between the primary and the cold standby.
7.1.4 Group Maintenance
In this section, we describe some of the messages that are passed between
the primary and its backups to maintain group membership. The primary
acts as the leader of the group and has information about all its members.
It shares this information with its group members so that, in the case of
a primary failure, the remaining group members and the new leader can
continue to maintain the group.
Since the task to node assignment is done beforehand, each task is pre-
defined to run as either a primary, hot or cold standby. On startup, all
nodes enter a group formation phase. In this phase, a primary periodically
broadcasts a Groupcreate message along with its heartbeat. It then waits for
a pre-configured interval Ttimeout pri (typically a multiple of its period Tpri )
to receive any Groupcreated responses, which would indicate that a primary
already exists. If there is no such response in this interval, the primary
moves out of the startup phase and enters normal operation, where it starts
producing application outputs, listening for heartbeats from its standbys
and producing heartbeats, state and group information (i.e. the normal life
cycle outlined in Section 4.2).
Each standby on startup periodically produces a Group join message.
On receiving the Groupcreate or Groupcreated message from the primary, it
transitions out of the startup phase and starts its normal operation and
producing heartbeat messages. If a standby does not receive a Groupcreate
message for a period of Ttimeoutsb units, then it declares primary failure and
transitions to the normal mode of operation as the new primary. Figure 7.1
represents the final stable state of the system after the groups are created
and also shows the messages exchanged within the group during normal
operation.
If a primary fails and is later restarted, it will broadcast a Groupcreate
message along with its heartbeat. This time, the standby will have taken
over as the new primary and it will respond to the Groupcreate message
with a Groupcreated message indicating to the re-launched primary that it
should run as a standby. Figure 7.3 represents the final stable state of the
system after such a dynamic reconfiguration.
7.1.5 Standby Reconfiguration
Problems like power or hardware failures cause permanent task failures

which are compatible with the assumption of the fail-stop model. There are
some other frequent causes of failures like memory faults or unexpected
inputs. In case of such a failure, the standby will take over operation from
the primary but the failed primary can be re-launched. On restart, the
primary rejoins the group as a standby. Figure 7.3 represents the final stable
state of the system after such a dynamic reconfiguration. In case the backup
runs a degraded copy, full operation can be recovered by using the group
membership framework to switch the primary to the recently re-launched
task, or application permitting, switch operation mode from degraded to
fully operational on the backup when the old primary recovers.
7.1.6 Comparison To Existing Fault-Tolerant Distributed
Architectures
In this section, we compare our architecture to that presented in [38], which

attempts to achieve similar goals. The architecture in this section makes
use of task-level groups where the primary and its backups form a group
among themselves and in case of failures the group members trigger re-
covery operations. In [38], Kim et. al make use of node-level groups
where a daemon running on every node keeps track of the health of all
the tasks running on its node. The master daemon in the system similarly
keeps track of the status of every other daemon and on task failure noti-
fies appropriate slave daemons to trigger recovery operations. Since only
the daemons communicate with each other, the communication overhead
for fault-tolerance protocol messages is lower compared to our approach.
However, in our approach, this performance overhead can be mitigated by
techniques like piggybacking fault-tolerance protocol messages with out-
put messages where possible. Another trade off exists on the distribution
of CPU resources for the fault-tolerance protocols: our architecture in this
section distributes the cycles across all tasks, while the architecture in [38]
concentrates this work in the daemons. This makes the daemons a critical
point of failure and the failure of a daemon could result in an entire node
being declared failed. Hence, we trade communication and cpu resources
for improved fault-tolerance capabilities.
7.1.7 The Worst-Case Failure & Recovery Times
An AUTOSAR runnable execution involves three separate phases: accept-

ing inputs, performing calculations and producing outputs. We imple-
mented the task-level fault-tolerance features as a library, that an AU-
TOSAR runnable corresponding to each software component can use, keep-
ing in mind the execution phases of the AUTOSAR runnable. This library
allows receiving, processing and producing all the fault-tolerance related
messages (FTmsg ) including heartbeats, states and group information as
shown in Figure 7.4. As described in Section 4, each task is defined by the
parameters (Ci , Ti , Di ). Let δ represent the worst-case delivery latency on
a message in the system [105, 106]. The phasing of execution of the pri-
mary relative to its standbys dictates the delay for a standby to detect a
heartbeat message from the primary and, hence, we look at different rel-
ative phasing between the primary and its standbys and their impact on
system performance. A fixed relative phasing between the primary and its
standbys can be readily achieved if (a) both run at the highest priority on
their respective nodes, (b) the nodes use synchronized clocks, and (c) the
tasks are scheduled at absolute times, or a distributed startup protocol is
used. Under normal operation, the worst-case phasing of a primary and a
standby occurs when the primary produces a heartbeat which reaches the
standby after the latter has just checked for the presence of the heartbeat as
shown in Figure 7.4. Assuming that the duration to check for heartbeat is
negligible, the standby would need up to Tpri + C + δ units to detect a fail-
ure and Tpri + 2C + δ to completely recover from a primary failure. Also,
as the figure shows, a standby will normally receive a heartbeat produced
by its primary at least once every cycle. Since we assume that the underly-
ing communication medium is fault tolerant, a single missed heartbeat is
sufficient to indicate the failure of a primary under normal operation. For
the case of variable relative phasing under normal operation, the largest
interval between two primary heartbeats is 2Tpri . This is usually caused
by preemptions from higher priority tasks. As shown in Figure 7.5, there
can be at most two unsuccessful heartbeat receptions during this inter-
val, which means that we need at least three consecutive heartbeats to be
missed before a standby can safely declare a primary to have failed. Un-
der variable relative phasing, the worst-case response also occurs when
the primary produces a heartbeat right after the standby has checked for
the presence of the heartbeat. The worst-case response time in this case is
2Tpri + C + δ as shown in Figure 7.6. Extending the result obtained in this
section for ‘n’ missed heartbeats, we need (n + 1) Tpri + C + δ units of time
for recovery. Hence, for three missed heartbeats, the worst-case recovery
time is 4Tpri + C + δ.
Table 7.1: Task-Level Fault Tolerance Library Overhead
Overhead Factor Time(µsec)

Library API Overhead 15.51
Message Passing Overhead 374.14
C f thot 399.65

δ δ
Primary T T
nTpri (n+1)Tpri (n+2)Tpri (n+3)Tpri

HB HB
Hot
nThot (n+1)Thot (n+2)Thot (n+3)Thot
Figure 7.7: Improving the worst-case execution bounds
7.1.8 Improving the Worst-Case Recovery Times
The worst-case performance in the case of variable phasing can be im-

proved by controlling the release times of the primary and its backups. A
possible solution is to first calculate the worst-case response time of the pri-
mary task and to release the first instant of the standby only after the worst-
case completion time of the primary, which is given by Tpri + δ, where δ
represents the worst-case communication delay bound. This is illustrated
in Figure 7.7. A single missed heartbeat is sufficient to detect a primary
failure in this configuration. The figure also highlights the worst-case fault
recovery time of 2Tpri + δ. We plan to study this family of solutions in more
detail as part of our future work.
7.1.9 Experimental Evaluation
Now that we have studied the overheads and the worst case behavior of
our implementation, in this section we present an experimental evaluation
to demonstrate the feasibility and performance of our solution.
Experimental Setup and Test Bench
We build our test applications and the fault-tolerant library using AU-
TOSAR Version 4.2.1-compliant software architecture from ARCCORE [107].
Every task is configured as an AUTOSAR runnable. Each task invokes
methods from the fault-tolerance library to produce heartbeats, form groups
and monitor the health of its group members. The assignment of the type
of standby is made statically offline. We use three STM3210C boards con-
nected to each other over a 100Mbps Ethernet connection [54], over a Fast
Ethernet switch. We ensure that the primary and the standbys are assigned
to different ECUs. We create different phase offsets between the tasks by
powering on the boards at different offset times.
Testing and Measurement
We simulate different workloads (C pri ) by busy-looping within each runnable.

We capture all messages transmitted over the network and analyze the
timing behavior of the system. We measure the phase offset between the
primary and any of its standbys by measuring the difference in the heart-
beat times for the first few cycles of execution, where each task is in the
startup phase and is designed to only produce group formation messages
and perform no computation. We programmatically inject failures into the
primary task just before it produces its output after which the primary
sends out one network packet indicating its failure and then it stops pro-
EXECUTION PHASE OFFSET VS RECOVERY TIME EXECUTION PHASE OFFSET VS RECOVERY TIME EXECUTION PHASE OFFSET VS RECOVERY TIME
(T=1SEC) (T=100MS)
91.1250
Recovery Time Task 1( T = 300ms)
997.8998
237.03
236.72
Recovery Time Task 2 (T = 300ms)
789.4926
Recovery Time (ms)

Recovery Time (ms)
Recovery Time (ms)
151.987
56.2295
151.23
441.5862
33.0210
10.322
10.312
13.9754
13.2218 440.9933 788.7833 997.1600 32.8219 55.9205 91.0463 10.305 151 236.568
Execution Phase Offset (ms) Execution Phase Offset (ms) Execution Phase Offset (ms)
Figure 7.8: Recovery Time - Fixed Execution Offset

Percentage of Experiments vs Recovery Time for T2 Percentage of Experiments vs Recovery Time for T1
(ECU1: T1=150ms, T2=200ms; ECU2: T1=250ms (ECU1: T1=50ms, T2=40ms; ECU2: T1=50ms T2=75ms)
Percentage of Experiments (%)

T2=200ms)
Percentage of Experiments (%)
20.00
20.00
23.33
20.00
20.00
16.67
13.33
13.33
10.00
10.00
10.00
10.00
6.67
3.33
3.33
0.00
0.00
0.00
0.00
0.00
0.00
0.00
360-400 400-440 440-480 480-520 520-560 560-600 600-640 640-680 680-720 720-760 760-800 90-100 100-110 110-120 120-130 130-140 140-150 150-160 160-170 170-180 180-190 190-200
Recovery Time (ms) Recovery Time (ms)
Figure 7.9: Recovery Time - Variable Execution Offset
ducing heartbeats or performing any computation. We use this packet to

capture the time of primary task failure. In the experiments to follow, we
measure the recovery time as the time from the primary failure to the time
the backup takes over as the primary and produces its first output. We
have also performed functional verification of our implementation by dis-
connecting power or Ethernet cables from our boards. Once the primary
fails, the standby takes over and starts producing output messages. We
calculate the difference between the time of the first output message and
the time of primary failure to report recovery time.
Experimental Results
Fixed relative execution offsets: In this section, we evaluate the perfor-

mance of our system when the primary and the backup run with fixed
relative execution offsets. We measure the recovery time by varying the
time period of the tasks and the number of tasks using our fault-tolerance
library. In our first experiment (Figure 7.8a), we run a single task on each
node such that it runs as the highest priority task on its node. Figure 7.8a
shows that recovery time is almost linearly proportional to the execution
phase offset. In our second experiment (Figure 7.8b), we run three identical
tasks on each node but only have the highest-priority task using our fault-
tolerance primitives. Figure 7.8b shows that the recovery time is almost
linearly proportional to the execution phase offset even when we shrink
the task period by a factor of ten. In our third experiment (Figure 7.8c), we
run two identical tasks on each node, but have both use our fault-tolerance
primitives. Every task in the third experiment has an identical priority on
each node and a common period. Figure 7.8c shows that it is possible for
multiple tasks to have fixed relative phasing. The experiments above show
that fixed relative execution offsets result in fixed recovery times. Also, it
is important to note that the recovery times can be shorter than the period
of the tasks. Hence, if tasks have a fixed execution offset shorter than the
slack available to the primary’s deadline it is possible for the backup to not
miss any deadlines even during failovers.
Variable relative execution offsets: We next evaluate the performance
of our system when the primary and the backup run with variable rela-
tive execution offsets. In our experiments (Figure 7.9), we vary the startup
offsets of the primary and the backups. We have a primary and a standby
assigned to two different nodes running with different priorities with other
tasks that have different periods. We declare primary failure on three con-
secutive missed heartbeats as described in Section 7.1.7. In Figure 7.9, we
plot the histogram of the recovery times as the percentage of experiments
that resulted in a particular range of recovery times. As can be seen from
the figure, the recovery time can have a range of values based on the start-
ing offsets and the instant of failure. Also, since we need three consecutive
heartbeats to detect primary failures, the recovery times are greater than
the task periods resulting in multiple missed deadlines. Hence, uncon-
trolled execution offsets should be used for tasks with soft real-time re-
quirements. Appendix 7.1.8 describes a technique where variable relative
execution offsets can be controlled to reduce the bounds on the recovery
time.
Overhead measurement
We measure the computational overhead incurred by a task while using

the fault-tolerance primitives of our library. In this experiment, we have
a primary task (C=20ms, T=100ms) that has been configured with one hot
and one cold standby. Table 7.1 shows the maximum observed computa-
tional overheads incurred by the primary while using the fault-tolerance
primitives of our library. We measure both the overheads due execution
of library methods and sending and receiving fault-tolerance related mes-
sages. As the Table shows, the total computational overhead of our library
is quite low compared to the computation resource demands of the tasks
themselves.
Discussion
As the above experiments highlight, the type of execution offset plays an

important role in determining the recovery time bounds for a given task.
Careful allocation of tasks and assignment of execution offsets can enable
tasksets with a combination of hard and soft real-time requirements to
meet their recovery-time requirements. We will consider this more general
situation as part of our future work.
7.2 Fault-Tolerant Software Architecture for the
AUTOSAR Adaptive Platform
The AUTomotive Open System ARchitecture (AUTOSAR) Classic Platform

(CP) [22] standard, which is a widely-accepted standardized software ar-
chitecture for automotive electronic control units (ECUs), only addresses
the needs of deeply-embedded low-complexity devices. Hence, the com-
plex needs of modern automotive systems described above cannot be ful-
filled by the Classic Platform. Therefore, AUTOSAR specifies a second
software platform, the AUTOSAR Adaptive Platform [101], to meet these
industry requirements.
The AUTOSAR Adaptive Platform provides mainly high-performance
computing and communication mechanisms. It also offers flexible soft-
ware configuration, e.g. to support software updates over the air. Fea-
tures specifically defined for the Classic Platform, such as access to elec-
trical signals and automotive-specific bus systems, can be integrated into
the Adaptive Platform as well [108]. Compared to the AUTOSAR Classic
platform, the Adaptive platform supports Service-Oriented Architecture
(SOA), which allows dynamic linking of services and clients during run-
time, making it flexible for the application developers. The Adaptive Plat-
form supports manycore processors and heterogeneous computing plat-
forms that offer parallel processing, as well as fast and high-bandwidth
communication technologies such as Ethernet. The Adaptive Platform also
supports several safety and security features like priority-based schedul-
ing, execution of authenticated code and controlled allocation of memory
and cpu resources.
In order to support scalability across various devices, the AUTOSAR
Adaptive platform leverages the Scalable service-Oriented MiddlewarE over
IP (SOME/IP) [109]. SOME/IP is an automotive middleware solution that
can be used for sending and receiving control messages. It was designed
to support devices of different sizes and different operating systems. This
includes small devices like cameras, AUTOSAR devices, telematics de-
vices and infotainment devices. While typical middleware solutions often
only support single features (e.g. RPC or Publish/Subscribe), SOME/IP
supports a wide range of features including serialization, remote proce-
dure call (RPC), service discovery (SD), publish/subscribe (Pub/Sub), and
segmentation of UDP messages. Unlike other middleware solutions like
DDS [110], SOME/IP does not support the notion of reliability through
abstractions like Quality of Service. Data delivery reliability and data time-
liness are important to guarantee system safety requirements.
Typically, in order to guarantee safe execution of software, automobiles
employ redundancy for crucial software tasks to tolerate permanent crash
faults. Redundancy can be of many types, but can be broadly categorized
into concurrent and non-concurrent redundancy. Concurrent redundancy
implies that both the primary application and the redundant version are
functionally operational at the same time, whereas non-concurrent redun-
dancy implies that the redundant component monitors the state of the pri-
mary application and begins fully functional execution only on the failure
of the primary.
The Adaptive AUTOSAR standard does not specify any fault-tolerance
requirements. In this Section, we highlight some gaps in the current AU-
TOSAR Adaptive Platfrom standard (version 18.10) and provide sugges-
tions to address them. We describe our software fault-tolerance framework
designed to support these replication strategies for the AUTOSAR Adap-
tive Platform. We analyze the fault detection and recovery time bounds of
Service Service
Application Application Application Non-platform Non-platform

service service
AUTOSAR Runtime for Adaptive Application (ARA)
API API API

Service Service
Execution Communication Logging and Software

Management Management Tracing Configuration Diagnostics
Management
API API API Service Service
Other Adaptive
Hardware Security
Persistency Platform
Acceleration Management
Services
Operating
System Adaptive Platform Services
API API
Platform Health Other Functional

Management Clusters
Bootloader Adaptive Platform Foundation
(Virtual) Machine / Hardware
Figure 7.10: Adaptive AUTOSAR Platform Architecture
our framework for applications using SOME/IP for communication. The

rest of this Section is organized as follows. In Section 7.2.1, we describe our
framework to support fault-tolerant execution in the AUTOSAR Adaptive
Platform and analyze the fault detection and recovery time performance of
different replication strategies for applications using SOME/IP. In Section
7.2.2, we describe our experimental setup and present the results of our
evaluation.
Figure 7.10 presents the architecture of the AUTOSAR Adaptive Plat-
form. The Adaptive Applications (AA) run on top of ARA, AUTOSAR
Runtime for Adaptive applications. ARA consists of application interfaces
provided by functional Clusters, which belong to either the Adaptive Plat-
form Foundation (APF) or Adaptive Platform Services (APS). APF provides
the core functions of AP, and APS provides supporting services. Any AA
can also provide services to other AAs, illustrated as Non-platform service
in the figure. The Operating System is responsible for run-time resource
management (including time) for all Applications on the Adaptive Plat-
form. Execution Management (EM) is responsible for all aspects of sys-
tem execution management including platform initialization and startup
/ shutdown of applications. The Communication Management (CM) is
FT-Wrapper FT-Wrapper FT-Wrapper FT-Wrapper FT-Wrapper

Client Client Client Service Service
Non-platform Non-platform
Application Application Application
service service
Adaptive Platform Foundation
(Virtual) Machine / Hardware
Figure 7.11: System Design
responsible for all aspects of communication between applications in a dis-

tributed real-time embedded environment [108]. In this section, we assume
that the CM leverages the SOME/IP stack to manage communication be-
tween applications.
We assume that all tasks in the system offer and subscribe to services
using the SOME/IP Service Discovery (SOME/IP-SD), which allows all
applications to dynamically setup communication paths at startup.
7.2.1 System Design for Fault-Tolerance Support
This section describes the main contribution of this section: a system de-
sign for fault-tolerance support in the AUTOSAR Adaptive Platform. Fig-
ure 7.11 presents a high-level view of our design. We design and imple-
ment a fault-tolerance wrapper for both clients and servers to make them
agnostic of fault-tolerance considerations. Maintaining the fault-tolerance
support in the application layer allows for portability across different types
of platforms and applications. We consider support for the following SOA
elements / abstractions that in combination make up an offered service:
1. Events: Notifications sent from servers to clients.
2. Fields: Fields represent data elements. The fields abstraction supports three op-
erations, set: assigning values to data elements, get: retrie ving values of data
elements, notify: notify clients of changes in the values of the data elements.
3. Methods: Remote procedure calls executed by the server on behalf of the client.
These SOA abstractions can be broadly divided into two categories based
on the unidirectional or bidirectional nature of messages exchanged be-
tween servers and clients.
1. Notification-Based Abstractions: Events and Field-Notify abstractions only result

in unidirectional messages from servers to clients.
2. Request-Reply-Based Abstractions: Field-Set, Field-Get and Method abstractions

result in bidirectional messages between servers and clients.
As part of our design we consider the following replication require-

ments in order to meet the system safety guarantees of the system.
1. Servers can fail: Multiple tasks must provide the same service. These services can
be identical copies or one can be a degraded copy of the other. Clients must be able
to locate, subscribe and use all copies of the service, allowing for implicit failure
handling.
2. Clients can fail: Multiple tasks must act as clients. Servers must be able to handle
commands and requests from multiple identical clients.
The current Adaptive AUTOSAR standard (version 18.10) does not spec-
ify requirements for fault-tolerance support. Having fault-tolerance re-
quirements defined in the standard is especially important for platform-
functional clusters like Execution Management and Communication Man-
agement since failures of these components can be catastrophic. Mapping
these replication requirements to the SOA abstractions described above we
have the following considerations,
1. Client-side handling: Events, Fields-Get and Fields-Notify abstractions only modify
data on the client side, hence the fault-tolerance support only applies to the client
side.
2. Server-side handling: The Fields-Set abstraction only modifies data on the server
side, hence the fault-tolerance support only applies to the server side.
3. Client and Server-side handling: The Method abstraction can modify data on both
the server side and client side, hence the fault-tolerance support applies to the both
the server side and the client side.
Fault-Tolerant API Design
Given that different abstractions require different types of handling on

the client-side and the server side, we develop two new abstractions: the
ReliableClient and the ReliableServer that clients and servers can use
respectively to support fault-tolerant operation. We implement these ab-
stractions as part of a FT-Library that all software applications can use.
The FT-Library supports the following features.
1. Data Time stamping: Every request and reply from a server or client is times-
tamped with the latest global time value.
2. Replication Information: Every request and reply from a server or client is ap-
pended with data essential for fault-tolerance support like replica-type, status and
task and node id.
3. Unique Identifier Interface: FT-Library supports a sequence number interface

which can be overridden by client and server applications to provide unique se-
quence numbers for requests and replies. In the automotive context, since most
applications deal with data from sensors, the sensor packet sequence number cou-
Client Server Concurrent Server Replica
Trigger Service Discovery Service Offer

Initiate Service Discovery
Service Tracking
Service Offer
Service Tracking
Field Get
Field Get
Field Value Field Value
Fusion
Time
Figure 7.12: Reliable Client Abstraction

Concurrent Client Replica
Server Client
Service Offer
RPC
RPC
Filter
RPC Output
RPC Output
Time
Figure 7.13: Reliable Server Abstraction
pled with replication information can be used to create a unique identifier for each
request or reply.
We now present the Reliable Client and Reliable Server abstractions and
describe them in detail.
Reliable Client
The ReliableClient abstraction exposes the following three APIs to the

clients.
1. Service Discovery Interface: This interface allows a client to initiate a service

discovery call to connect to a particular serviceID. The ReliableClient establishes
connections to all services with the specified serviceID and maintains their status.
2. Fusion Interface: This interface allows a client to supply a custom fusion strategy
to apply if duplicate servers for a requested serviceID exist. The ReliableClient
uses the unique identifier and time stamp primitives to identify duplicate inputs if
the fusion strategy requires it.
3. Request Interface: This interface allows a client to make a single request call which
is relayed to all servers maintained by the ReliableClient.
Figure 7.12 presents an example of a client application using the ReliableClient

API. The client application initiates the service discovery through the Ser-
vice Discovery Interface. The ReliableClient connects to all available ser-
vices and maintains their status. The client application then sets a fusion
strategy and makes a request through the request interface. On replies
from the servers the ReliableClient implements the fusion strategy and
returns the result.
Reliable Server
The ReliableServer abstraction exposes the following API to the server

applications.
1. Filter Interface: This interface allows a server to supply a custom filter strategy
to apply if duplicate clients request an identical service. The ReliableServer uses
the unique identifier and time stamp primitives to identify duplicate inputs if the
filter strategy requires it.
Figure 7.13 presents an example of a server application using the ReliableServer

API. The server accepts a filter strategy and on reception of requests from
clients implements the filter strategy if necessary and returns the output
back to all identical clients.
As mentioned earlier, replicas can be either concurrent or non-concurrent.
Both types of replication strategies can leverage the ReliableClient and
ReliableServer abstractions. With concurrent replicas which are always
operational, the APIs can be used unchanged. For non-concurrent replicas,
which become operational only on the failure of the primary, it is neces-
sary to detect primary failures. For servers, we do so leveraging SOME-
IP/SD by having the replicas subscribe to the primary service. For pure
clients that do not offer a service, the ReliableClient offers a service just
to exchange health messages with the replicas. When a primary fails, the
replica is hence able to detect the failure of a service and can hence start
providing the service instead. It is important to note that, even though non-
concurrent replicas are non-operational, they do participate in the service
discovery, they just do not implement the service until primary failure.
Timing Analysis for Failure Detection and Recovery for Service failures
Safety-critical applications running on automobiles must recover from fail-

ures in a timely fashion to meet required safety guarantees. In this section,
we analyze the fault-tolerance behavior of different replica types and iden-
tify the control knobs to achieve required safety levels in the presense of
faults.
For a given task τ, let Trecthres represent the maximum time a service
offered can afford to remain unavailable before safety guarantees are vio-
lated. For request-reply-based abstractions, this is the maximum time that
can be afforded between when a request is made to when it is executed.
For notification-based abstractions this is the maximum time a server is
unavailable to produce a notification. Let Trecmax represent the maximum
value of the recovery time for a task. Hence if Trecmax < Trecthres ∀τ, we
assume that the system meets its safety requirements in the presence of
faults.
We first present an overview of the SOME-IP/SD phases to describe
some of the important system control parameters. In SOME-IP/SD, a
service sends out an offer message to offer services within the network,
whereas the client sends out the find message to locate services within a
network. This happens in three phases: the Initial Wait Phase, the Repetition
Phase and the Main Phase.
1. Initial Wait Phase (IWP): In this phase, the client remains silent for a random time
between control parameters CliDel IWPmin and CliDel IWPmax . The server also remains
silent for a random time between between control parameters SerDel IWPmin and
SerDel IWPmax . If a client receives an offer of a service in this phase, it directly jumps
to the Main Phase.
2. Repetition Phase (RP): In this phase, the client starts sending out find messages.
The time between every two find messages increases successively up to a maxi-
mum number of find messages given by CliFindMsgMax. The delay between two
find messages can be controlled by setting a control parameter CliRepDel. A client
transitions to the Main Phase on reception of an offer message. A service behaves
almost exactly like the client with the exception of the reaction to a received find
message. In this case, the server waits a random amount between control parame-
ters SerDel RPmin and SerDel RPmax of time within predefined bounds and sends out
a unicast offer message. A service will transition into the main phase if a number
of offer messages set by the control parameter SerFindMsgMax at a rate given by
SerRepDel RP is reached.
3. Main Phase (MP): In this phase, the client remains silent and the service continues
to send out offer messages to indicate its availability at a period set by the control
parameter SerRepDel MP .
The above SOME-IP/SD phases can be categorized into two distinct
stages for the purpose of analyzing the fault recovery behavior as follows.
No connection has been established between the client and server
If the service fails before a connection has been established, the service
must be restarted. In the Adaptive Platform, this can be done using the
Platform Health Management and Execution Management Functional clus-
ters. The Platform Health management cluster sets a HW watchdog at
process launch. If the process launch fails and the HW watchdog expires,
the Execution Management is triggered to restart the process. Let Twd be
the expiration duration of the watchdog timer and let TEM represent the
time taken by the Execution Management Cluster to restart a process. Let
Tconnect represent the start-up delay associated with establishing a connec-
tion between a server and a client, the formal derivation of the startup
latency bound is derived in [111]. This latency bound can be controlled
using the control parameters described above. Hence, the maximum time
required for recovery is,
Trecmax = Tconnect + Twd + TEM (7.1)
It is possible that the task fails to start on multiple retries. In this case,
the Platform Health Management (PHM) Functional Cluster should be able
to trigger appropriate actions with the help of Execution Management if
Trecovery > Trecmax . An important addition to the PHM definition would be
to allow for different actions based on the number of times the watchdog
timer expires.
Client Non-Concurrent Replica Server Client Non-Concurrent Replica Server
Notification Notification
Notification Notification
Failure
Failure Stop Offer
Offer
Offer Stop Offer
Notification Failure Detected
Timeout, TTTL
Failure Detected
Notification
Time Time
(b)
(a)
Figure 7.14: Recovery example for non-concurrent replicas providing notification-based

services
Non-Concurrent Replica Server Client Non-Concurrent Replica Server
Client
Request
Request Request
Request
Reply
Reply
Reply Reply
Stop Offer Stop Offer
Request Failure Detected
Request
Request
Timeout, Ttout
Reply
Failure Detected
Reply
Time
Time
(a) (b)
Figure 7.15: Recovery example for non-concurrent replicas providing request-reply-

based services
After a successful connection between the client and server is formed
For concurrent replicas, if at least one replica has a successful connection,

the failure handling is implicit. For non-concurrent replicas, if a service
fails after a connection is established there are two possible outcomes based
on whether the failure is handled or unhandled. If a process can handle the
failures, for example by catching a SegFault signal, it can trigger appropri-
ate backup mechanisms. If a process cannot handle a failure, for example,
in the case of power failures, no backup actions can be triggerred.
Figure 7.14a presents an example of recovery for notification-based ab-
stractions with unhandled failures for non-concurrent replicas. The worst-
case recovery occurs when the service fails just after it produces an offer
message in its Main Phase indicating its availability. Each service offer mes-
sage in SOME-IP/SD is configured with a Time to Live (TTL) flag. On
expiry of this TTL flag, the replica can deem the primary server failed and
take over as the new server. If Tcomm is the time to send a packet from
server to the replica, the recovery time is,
Trecmax = TTTL + Tcomm (7.2)
In the case of handled failures, as presented in Figure 7.14b, the server

can send out a stop offer message indicating to the replica that the server is
about to fail so it can take over execution. Let Tsig represent the amount of
time required to handle the exception signal.
Trecmax = Tcomm + Tsig (7.3)
Figure 7.15a, presents an example of recovery for request-reply-based

abstractions with unhandled failures for non-concurrent replicas. Each
Reliableclient starts a timer set to Ttout associated with each request.
This timer value is selected such that it allows for the primary to process
the request and reply to the client. If the timer expires the client can deem
the server failed and take over as the new server. Hence,
Trecovery = Ttout (7.4)
In the case of handled failures, as presented in Figure 7.15b, the server

can send out a stop offer message indicating to the replica that the server is
about to fail so it can take over execution; hence,
Trecovery = Tcomm + Tsig (7.5)

Ethernet
Switch
Server Replica Server Primary
Renesas Client
Rcar-H3
QEMU 0
Figure 7.16: Experimental Setup
Server Parameter Value Client Parameter Value

SerDel IWPmin 0ms CliDel IWPmin 0ms
SerDel IWPmax 2ms CliDel IWPmax 2ms
SerDel RPmin 0.01ms CliRepDel 0.05ms
SerDel RPmax 0.02ms CliFindMsgMax 5
SerFindMsgMax 5
SerRepDel RP 0.05ms
SerRepDel MP 3ms
System Parameter Value

Ttout 5ms
Tttl 3ms
Table 7.2: Control Parameters
7.2.2 Experimental Evaluation
In this section, we describe our experimental setup and present our eval-
uation results. As Figure 7.16 depicts, we have a two-node system, where
each node is running the AUTOSAR Adaptive Platform implementation
from Electrobit, compliant in most parts with the 18.03 Adaptive AU-
TOSAR [101] Standard. One of the nodes is a Renesas R-Car-H3 devel-
opment kit [112] and the other is a virtual instance running on a Qemu
emulator on a X86 laptop. The two nodes communicate over an 100Mbps
Ethernet switch. We create a server client example for testing. Both the
server and client use the ReliableServer and ReliableClient APIs. The
service is replicated by a non-co-located replica. The control parameters
selected for the experiments are presented in Table ??.
Figure 7.17: Notification Based Services with handled exceptions
Figure 7.18: Notification Based Services with unhandled exceptions
7.2.3 Notification-Based Services
In our first experiment, the offered service produces periodic events and
supports mechanisms to handle exceptions. We inject artificial faults that
present as segmentation faults. Figure 7.17 presents the results. Since the
faults are handled, the primary is able to trigger the replica to take over
before it fails.
In our next experiment, the offered service produces periodic events,
but does not support mechanisms to handle exceptions. We inject artificial
faults by disconnecting the RCar-H3 board from its power source. Figure
7.18 presents the results. Since the faults are not handled by the primary, it
is unable to trigger the replica. The replica has to incur an additional delay
of Tttl before it can detect the failure of the service. Hence, the recovery
takes longer compared to handled failures.
7.2.4 Request-Reply-Based Services
In our next experiment, the offered service supports remote procedure calls
and mechanisms to handle exceptions. The client task periodically sends
out RPC requests with a period. We inject artificial faults into the service
that present as segmentation faults. Since the faults are handled, the pri-
mary is able to trigger the replica to take over before it fails. We observe
that, for most practical cases, the recovery is implicit since the replica takes
over from the primary before a new request from the client is made.
In our next experiment, the offered service supports remote procedure
calls, but does not support mechanisms to handle exceptions. We inject
artificial faults by disconnecting the RCar-H3 board from its power source.
Since the faults are not handled by the primary, it is unable to trigger the
replica. The replica waits till the timer Ttout expires before it can declare
the failure of the service and reply to the RPC request.
7.3 SAFFIRE: Software Architecture For Fault-tolerant
Imbed Real-time Environments
In this section, we present SAFFIRE (Software Architecture For Fault-tolerant

Imbed Real-time Environments), a framework to support fault-tolerant
execution on the Linux-based CMU Autonomous driving platform. We
also present the dynamic resource reallocation capabilities of the SAF-
FIRE framework to support different application and fault-tolerance re-
quirements in different ODDs and DALs.
Figure 7.19: CMU autonomous driving research platform [1]
7.3.1 Overview of the CMU Autonomous Driving Platform
Infrastructure
The CMU autonomous driving research platform (Figure 7.19) is capable

of a wide range of autonomous driving applications, including smooth and
comfortable trajectory generation and following; lane keeping, lane chang-
ing and lane merging; intersection handling with or without V2I and V2V;
and pedestrian, bicyclist, and workzone detection [1]. In order to achieve
these autonomous driving behaviors, the software infrastructure for the
platform was designed to support the perception, behavior, planning, and
other artificial intelligence components by providing an operator interface,
inter-process communications system, process launching and management
system, data logging system, configuration system, and task framework
libraries. The underlying inter-process communication framework on the
platform is based on a publish/subscribe model [102]. The platform also
supports some hardware fault-tolerance features. For example, the plat-
form allows the migration of running algorithms on the vehicle from a
given compute unit to other units in the case of failure by supporting iden-
tical hardware components. Network failure is also considered. To in-
crease the reliability, the platform supports two Gigabit Ethernet switches
for compute units to have two isolated networks. If one switch fails, the
other switch should be able to operate so that the platform can be driven
to a safe place and stopped for maintenance.
Type of Task Schedu Process FT Calculate Perform Produce Replica

led related inputs State Calculations Output Consistency
Primary  /    /
Active replica      
Effective clone      
Hot Standby      
Warm Standby      
Cold Standby      
Frigid Standby      NA
Figure 7.20: Standby Types Comparison
7.3.2 Effective Clones
In chapters 5 we presented various replication strategies strategies includ-

ing active replicas and hot and cold standbys. Figure 7.20 summarizes
these replication strategies and introduces a few more. Replication strate-
gies can be differentiated based on the following criteria: (a) if the are
scheduled or not, (b) if the process fault-tolerance related inputs or not, (c)
if they calculate state or not, (d) if it performs calculations or not (e) if the
produce application outputs in normal operation and (ff) if the replicas are
consistent or not.
As Figure 7.20 a primary is scheduled, it processes fault-tolerance if the
replication strategy is active and in case of a passive strategy it does not
process any inputs. It calculates state, preforms calculations and produces
outputs. The primary is consistent in the case of active replication and non-
consistent in the case of a passive strategy. A active replica is identical to
the primary whereas a hot standby for example does not produce outputs
whereas a cold standby does not perform any calculations along with not
producing any outputs, it copies state regularly from the primary.
Type of Replication Computational Communication Load Recovery

Load Latency
Active Replication High High Low
Effective Clones Medium Medium Low
Passive Replication Low Low High
Figure 7.21: Standby Types Trade offs
We now introduce a new standby strategy called effective clones. Effec-

tive clones are similar to active replicas except that they do not co-ordinate
with the primary to maintain input consistency. They execute in parallel
with the primary but since the inputs are not co-ordinated they are not
consistent with the primary. In the case of autonomous driving applica-
tions, the environment is sensed continuously i.e. the lidars, radars and
cameras capture the state of the environment around the vehicle at a given
instant of time. Hence, even though the primary and the effective clone do
not ensure identical inputs, they work with the latest snapshot of the en-
vironment available to them at the time of execution. Hence, we consider
the outputs of both the primary and the effective clone to be correct and
we refer to there states as being delta consistent as they differ only in value
but are identical in their impact on the system. Given, that effective clones
need not maintain input consistency the computational load and the com-
munication overhead are reduced compared to the active replicas, however
since the effective clones also run in parallel to the primary the recovery la-
tency continues to remain low. Figure 7.21 highlights the tradeoffs between
effective clones and the other replication strategies.
Figure 7.22: Effective Clones Operation: Normal Conditions
Figure 7.23: Effective Clones Operation: Failure Conditions
Figures 7.22 and 7.23 show the operation of the effective clones. Un-
der normal condition 7.22 the effective clones run alongside the primary.
In this example let us consider the BehaviorTask which produces behav-
ioral decisions that are used by the planning subsystem to plan trajectories
for driving. It is interesting to note here that the frequency of the be-
havioral outputs are effectively doubled because the effective clone of the
Behavior Task also produces outputs. However, this does not affect the
downstream module which in this case is the planning subsystem which
basically chooses the latest behavioral decision as the input. Figure 7.23
shows that the applications continuing to run successfully without the loss
of any functionality even when the primaries fail.
Figure 7.24: SAFFIRE: Mode Change support
7.3.3 SAFFIRE: Support for Dynamic Resource Reallocation
As previously mentioned in chapter 1 high automation levels may impose

more stringent fault-tolerance requirements in terms of the number of tasks
that need redundancies (standbys), as well the number of failures that are
required to be tolerated for each task (i.e., the number of standbys for each
task). Also, the operational design domain (ODD) of the automated vehicle
has a significant impact on the fault-tolerance requirements. For example,
in urban-driving applications, certain tasks like pedestrian detection, are
perhaps more safety-critical than highway-driving applications. Hence, the
type and number of standbys required for every task also depends heavily
on the ODD. This motivates the need for adaptive cost-optimized soft-
ware fault-tolerance solutions to reduce overall resource utilization. The
SAFFIRE framework levarages the Linux-RK [113] framework to support
this adaptive reallocation of resources. The SAFFIRE framework allows
the system designer to specify the frequency, cpu reservations and the
fault-tolerance requirement of every task for a given mode of operation.
For example, as shown in Figure 7.24, the Traffic Light Detector Task is
lower priority in a highway environment compared the Motion Planner
Task given the significantly higher speeds. The SAFFIRE framework in this
case allows the user to specify that the Traffic Light Detector Task should
run at a lower frequency allowing the Motion Planner Task to increase its
frequency.
7.4 Chapter Summary
Software architectures play a very important role in the operation and

maintenance of the different replica types to guarantee safe operation. In
this chapter, we showed that a high level of dependable operation can be
achieved in a real-time environment using software fault-tolerance schemes
like hot and cold standbys with relatively low resource utilization. We
also presented an experimental evaluation of our implementation on an
AUTOSAR Classic-compliant test bench to demonstrate the feasibility and
practicality of our approach. We conclude that by using our approach
Classic AUTOSAR-based automotive fail-operational systems can be re-
alized with reduced resource consumption compared to traditional fault-
tolerance approaches. In this chapter, we also presented a new frame-
work to support fault-tolerant execution using different replication strate-
gies on the AUTOSAR Adaptive Platform. We analyzed the fault de-
tection and recovery time bounds of our solution for applications using
SOME/IP. We highlighted some gaps in the current AUTOSAR Adaptive
Platfrom standard (version 18.10) and provided suggestions to address
them. We validated our model experimentally and presented our evalu-
ation results, which confirm the effectiveness and practicality of our fault-
tolerance framework. Finally, we presented the SAFFIRE framework de-
signed to support fault-tolerant execution on the CMU Autonomous driv-
ing Platform.
Chapter 8
Evaluation and Testing of Self-Driving

Safety-Critical Automotive Systems
8.1 Introduction
With advances in sensing, machine learning and computing systems, com-

plex autonomous driving applications have become feasible. Moreover,
with the advent of communication technologies like Dedicated Short Range
Communications (DSRC), autonomous vehicles can also communicate with
each other, pedestrians and the infrastructure in the environment around
them. As the number of applications and levels of autonomy increase, the
complexity of CAVs continues to increase.
A typical CAV system consists of a large number of sensors that sense
its environment. The data from these sensors are fed into a computing
system that runs a variety of software components. These software com-
ponents process the data, perform calculations and control actuators to
achieve autonomous driving applications. The design and modeling of the
various software components constitute an important step in the develop-
ment of CAV systems. This step also includes defining the interactions
148
CHAPTER 8. EVALUATION AND TESTING OF SELF-DRIVING SAFETY-CRITICAL
AUTOMOTIVE SYSTEMS 149
of these software components with each other and the hardware compo-
nents. The next step in the development process involves the deployment
of these components onto a real system. This deployment must ensure the
feasibility of correct operation of all components under myriad conditions.
CAVs are inherently safety-critical cyber-physical systems that interact
directly with the environment that they operate in. Testing requirements to
ensure correct system behavior add another significant layer of complex-
ity to the CAV development process. For any CAV, it is highly desirable
that the functional behavior of all hardware and software components is
throughly verified before on-road testing. In order to ensure safety, CAVs
also need to meet para-functional requirements like reliability, safety and
timeliness. The final level of testing involves on-road testing often un-
der controlled environments first. During this testing phase, the run-time
monitoring and data collection utilities are very useful. That allows system
verification and diagnosis in case of a deviation from system objectives.
Given the large number of hardware and software components and the
complexity of testing requirements a well-defined methodology supported
by a suite of tools is needed for validating correct and safe behavior. In this
chapter, we describe the tools and methodologies that we use to model,
design, develop and test applications for a CAV.
The organization of the rest of this chapter is as follows. In Section 8.2,
we discuss various elements in our design and development of CAVs. We
present a standard reference architecture for a CAV. We describe the var-
ious components within the reference architecture and their interactions.
We also present our development cycle and the process and tool flow that
we follow for developing CAV features. From Section ?? to Section ??, we
describe in detail the tools that we use as part of our development cycle. In
Section 8.4, we describe our run-time mechanisms for system monitoring
Figure 8.1: A Reference Architecture for CAVs
Figure 8.2: Sensor Installation [1]
and diagnosis. We presented related work in Section 2.5. We end with

concluding remarks in Section 8.6.
8.2 Connected and Autonomous Vehicle (CAV) Design
and Development
8.2.1 A Reference Architecture for CAVs
Figure 8.1 presents a reference architecture for a CAV. As shown in the fig-
ure, a CAV consists of various sensors (such as cameras, radars and lidars),
processors, actuators (such as controllers for braking, throttling and steer-
ing) and software components (e.g., behaviors and planning) that interact
with each other. Next, we present a brief overview of each component in
the reference architecture.
1. Sensors: A CAV uses a number of sensors to sense its environment. For example,
a CAV may use lidars and radars to detect, and track objects. It may also use sen-
sors like GPS, accelerometers, gyroscopes and wheel-speed sensors for localization.
Figure 8.2 highlights some of the sensors installed in our CAV.
2. V2X1 Communication Interfaces: A CAV can communicate and interact with V2X-
enabled vehicles, pedestrians and infrastructure. These V2X interfaces may make
use of DSRC which is a two-way short-to-medium-range wireless communications
technology [114]. DSRC messages can be used to alert CAVs about imminent
hazards like vehicles stopped ahead; potential collision at intersections or during
merging; sharp curves or slippery patches of roadway ahead [114]. DSRC mes-
sages can also be used to communicate intersection and traffic light information to
CAVs. CAVs can also co-ordinate and execute complex intersection protocols [115]
by exchanging DSRC messages.
3. Perception: A CAV continually needs to make decisions based on the its surround-
ing environment. The perception component is responsible for accepting and fus-
ing data from various sensors on the CAV. It is also the responsibility of the per-
ception system to interpret the data, identify and track various objects with a very
high degree of reliability.
4. Road-World Model: The road-world model accepts the processed sensor informa-
tion from the perception system, and using predefined map information, creates a
composite model of the world around the CAV for use by other sub-systems [1].
The composite model can be divided into several discrete interfaces: static obsta-
1 V2X - Vehicle to X, where X can represent infrastructure (V2I), other vehicles (V2V), or pedestrians
(V2P), bicyclists, cloud or other possibilities.
cle maps, dynamic obstacle maps, visibility, health, current vehicle pose, and road
structure [102].
5. Route Planning: The route planning component is responsible for utilizing map
information to generate an optimized route for travel. This route can be optimized
for various metrics like distance from current CAV location to a desired location,
time and fuel economy.
6. Behaviors: The behaviors component is responsible for high-level decisions the CAV
needs to make to safely navigate its environment and reach its intended destina-
tion. Decisions like slowing down at a red light or a stop sign, safely yielding
to a pedestrian, changing lanes to maintain the global route plan are some of the
decisions the behaviors component is responsible for.
7. Short-term path planning: Unlike route planning that attempts to find an end-to-end
route for the CAV, the short-term path planner is responsible for determining the
immediate near-term path on the road that the CAV must take. For example, the
short-term path planner is responsible for determining a safe path to follow while
avoiding static and dynamic obstacles on the road, maintaining a safe distance
from a bicyclist sharing the road or moving into or out of a parking spot.
8. Health Monitor: As a safety-critical system, a CAV must at all times ensure the
safety of its passengers and everything in its environment. The health monitor
component tracks the health status of all hardware and software components on
the CAV. It is responsible for reporting failures and any necessary actions that must
be taken by the user.
9. Vehicle By-Wire Controls: The drive-by-wire controls allow the software components
to control the motion and safe operation of the CAV. Primary controls include
braking, throttling, steering and gearing. Secondary controls may include turn
signals, hazard indicators, wipers and door locks.
10. Data Logger: The data logger logs state information from all software components.
The health monitor logs allow for diagnosis in the case of faults or excessive devi-
ation from expected behavior.
11. Embedded Computing Platform: The embedded computing platform hosts and ex-
ecutes all the software components. The embedded computing system consists
of multiple inter-connected computers. Typically, a CAV supports communication
technologies like CAN [82], Ethernet [54] and Flexray [?], that enable the proces-
sors to communicate with each other and to interface with the CAV sensors and
actuators.
12. User Interface: The user interface allows users to interact with the CAV, and ac-
cept inputs like destination location and preferred route. It also allows the user
to choose from different configuration settings for various components like the
route planner and behaviors. The user interface is responsible for communicating
information to the user such as the road-world model as seen by the CAV, the
short-term path, long-term route and system health statuses. The user interface is
also responsible for providing timely alerts to the user.
8.2.2 Para-Functional Requirements for a CAV
CAVs are safety-critical cyber-physical systems. The reference architecture

shown in Figure 8.1 captures the various component-level requirements to
meet the functional requirements for a CAV. In addition CAVs need to meet
para-functional requirements such as:
1. Reliability: A CAV must ensure reliable operation for prolonged periods of time
under varied load and failure conditions. For example, a CAV may have a reliabil-
ity requirement of maintaining normal operation in the presence of independent
software crash faults with a minimum inter-arrival period of two minutes.
System On Road Testing
Requirements and Analysis
and Architecture
Emulation Testing
System and Analysis
Modeling ANALYSIS &
DESIGN And Design Simulation Testing
TESTING
and Analysis
Implementation
DEVELOPMENT
Figure 8.3: Development Cycle for a CAV
2. Safety: A CAV must ensure at all times the safety of its passengers and its sur-
rounding environment. Safety must also be ensured in the case of faults. This
is typically achieved by a combination of hardware and software fault-tolerance
techniques. For example, a CAV may have a safety requirement of maintaining a
minimum inter-vehicular distance of 5 meters in urban driving conditions.
3. Timeliness: The end-to-end latency from sensing the environment to actuating must
be small enough to be responsive to the speed and distance constraints of the
operating context. The detection and recovery from faults should also be timely.
For example, a CAV may have a timeliness requirement of processing sensor inputs
and producing control outputs once every 33ms.
8.2.3 Development Cycle for a CAV
The development cycle for a CAV consists of three phases: a design phase,
a development phase and a testing phase. These stages are depicted in
Figure 8.3 and described next.
1. System Requirements and Architecture: This phase involves the gathering of system-
level requirements and defining a system architecture that meets the objectives of
the reference architecture in Section 8.2.1.
2. System Modeling and Design: In this phase, a system designer models various soft-
ware and hardware components and chooses specific components to implement
the CAV architecture. In addition, the software tasks interfaces must be designed
and mapped to the hardware configuration.
3. Implementation: In the implementation phase, the hardware and software system

components are developed and integrated.
4. Simulation testing: Since CAVs are safety-critical, in this phase, the software com-
ponents are first tested in a simulated environment for correct behavior. A wide
range of scenarios that mimic real-world situations are tested to ensure the func-
tionality of the software implementation. Simulation testing must include injecting
artificial faults into the system to test system behavior in the presence of faults.
5. Emulation testing: The emulation testing phase involves hardware-in-the-loop test-

ing where the functional integration and the adequacy of the performance of the
hardware and software components is tested.
6. On-Road testing: The on-road testing phase involves testing the system initially
in controlled real-world environments to verify the integrity of the entire system.
Information from such tests is used to validate design choices and fine-tune con-
troller parameters, task allocation and frequencies. After confidence in correct sys-
tem behavior is established, on-road testing is commenced on public roads with
uncontrolled traffic. Complex scenarios are introduced gradually and considerable
caution is exercised throughout.
8.2.4 Tool/Process Flow
In the previous sub-sections, we described the system requirements and the

development cycle for a CAV. In this sub-section, we introduce the various
tools and processes we use as part of our development cycle. The usage
flow of these tools is illustrated in Figure 8.4.
Design & Development

SysWeaver
Run-Time
Diagnosis SysAnalyzer
Framework
Analysis & Testing AutoSim TROCS
Figure 8.4: Tool/Process Flow
1. SysWeaver: SysWeaver is a model-based design, integration, and analysis frame-

work introduced by de Niz et. al [116] for embedded real-time systems. SysWeaver
allows software components and their interactions to be modeled. It also allows
communication interfaces between hardware and software components to be mod-
eled. Once the system is modeled and verified, SysWeaver generates infrastructure
and middleware code that binds all the software components together.
2. SysAnalyzer: SysAnalyzer is used to analyze the schedulability of various software

components and ensure timeliness of software components executed in a real-time
computing environment. Given a task set, SysAnalyzer can produce feasible as-
signments of tasks to computing nodes. Also, in order to ensure the safety in the
presence of faults, SysAnalyzer can assign and deploy software backups.
3. Tartan Racing On-board Computer System (TROCS): TROCS is a system-level hybrid

emulator and simulator for autonomous vehicles. It can simulate various driving
scenarios that allow for software functional logic to be verified. It also allows the
injection of faults that can be used to verify the performance of the system in the
presence of faults. TROCS is also an emulator. It can be used to verify the integrity
of various hardware and software component interactions.
4. AutoSim: AutoSim is an application-level hybrid emulator and simulator. It can be

used to generate various application-level scenarios like lane changes, merges and
overtaking. AutoSim additionally supports V2V and V2I simulations, enabling the
development and testing of V2X protocols. AutoSim can also act as an emulator
that interfaces with the real vehicle and in turn enable the vehicle to sense and
interact with different virtual on-road test scenarios.
5. Run-Time Diagnostics Service: A run-time diagnostic service runs on the vehicle

during on-road tests, allowing the user to collect valuable diagnostic information
like CPU and memory resources being consumed, recovery time obtained during
faults and end-to-end latencies being encountered. This information can be fed
back into SysWeaver and SysAnalyzer to further refine and enhance the system
implementation.
6. EMulator/simulator for Embedded Real-time autonomous Intelligent Driving (EMER-

ALD) : EMERALD is a system-level 3D graphical simulator for autonomous ve-
hicles.Unlike TROCS it can simulate physical sensors like lidars, radars, cameras
and ultrasound sensors. It allows the injection of faults, including different sensor
noise model, malfunctioning lights etc., that can be used to verify the performance
of the system in the presence of faults.
Various combinations of these tools are used to develop and test each
subsystem from the reference architecture. For example, the path planning
subsystem uses SysWeaver and SysAnalyzer to generate and deploy code,
and TROCS for testing functional behavior.
In this section, we presented the various pieces in our CAV reference
architecture capturing our development and tool flow. Detailed descrip-
tions of SysWeaver, TROCS, AutoSim can be found in [24]. In the sections
to follow, we describe SysAnalyzer, the Run-Time Diagnostics Framework
and EMERALD in more detail.
Brake Safety Steering Throttle HVAC Audio Video
Control (BC) Audio (SA) Control (SC) Control (TC) (40,100) Playback Playback
(10,100) (10,100) (20,100) (16,100) (Ui = 0.4) (AP) (VP)
(Ui = 0.1) (Ui = 0.1) (Ui = 0.2) (U = 0.16) (40,100) (55,100)
i ( 0 replicas)
( 2 replicas) ( 2 replicas) ( 1 replicas) ( 1 replicas) (Ui = 0.5) (Ui = 0.55)
( 0 replicas) ( 0 replicas)
Minimize number of hardware

components
SysDeployer
HVAC
SC
(0.4)
(0.2) SA (0.1)
VP BC (0.1)
(0.55) TC (0.16)
TC’ (0.16)
AP
SC’ (0.5)
(0.2)
SA’’ (0.1) SA’ (0.1)
BC’’ (0.1) BC’ (0.1)
Node 1 Node 2 Node 3
Figure 8.5: SysAnalyzer WorkFlow
8.3 SysAnalyzer
System deployment involves the mapping of software components to the

hardware components ensuring that the system remains schedulable. As
described in the previous section, SysWeaver is able to design, model and
analyze software. Though SysWeaver has some basic task deployment fea-
tures, SysAnalyzer is a tool that can produce mappings for software com-
ponents to hardware components while trying to optimize various param-
eters.
8.3.1 SysAnalyzer Workflow
SysAnalyzer accepts various software component parameters like, the pe-

riod (T), the deadline (D), the WCET, the number of replicas assigned for
each component and the cost function. Currently, SysAnalyzer supports
two cost functions: minimizing the number of hardware components used
or, given a fixed number of hardware components, producing the most-
balanced deployment. If the number of replicas assigned is non-zero, then
SysAnalyzer ensures that no two identical components are assigned to the
same hardware component. This is referred to as the Fault-tolerant Parti-
tioned Scheduling problem as defined by Kim et al. in [20]. This ensures that
when nodes fail independently, they do not result in application failures.
The problem of assigning tasks to processors is a well-known bin-packing
problem [18] and it has been proven to be NP-hard [19]. Hence, SysAna-
lyzer uses an ensemble [96] approach where its implements a large number
of heuristics (including TPCD [81] and R-BFD [20]) and selects the best de-
ployment optimizing the given cost function. Figure 8.5 depicts a standard
SysAnalyzer workflow. It accepts the component parameters along with
the number of replicas for each component along with an optimization
function. SysAnalyzer then produces a deployment which ensures that
all components remain schedulable while optimizing 2 the cost function.
SysAnalyzer supports various fixed-priority scheduling policies including
RMS and Deadline-Montonic Scheduling (DMS) [117]. SysAnalyzer can
also accept constraints on fault recovery such as a bound on the failure re-
covery time, based on which SysAnalyzer will decide the nature (i.e., either
Hot or Cold [81]) and number of replicas. It then uses this information to
find an optimized deployment for the software components.
2 SysAnalyzer obtains a good near-optimal solution but cannot guarantee the absolute optimum.
8.4 Run-Time Diagnostics Framework
In the previous sections, we described various tools that we used to model,

design and test CAV capabilities to prepare for on-road tests. In this sec-
tion, we describe our run-time diagnostics framework that we employ dur-
ing on-road tests.
The run-time diagnostics framework has two main functions:
1. Monitoring: The run-time diagnostics framework constantly monitors system per-

formance. It tracks the status of all active hardware and software components.
For example, the run-time diagnosis framework tracks the memory consumption
of each software component. It also tracks the CPU resources utilized by each
software component. In the case of abnormal memory or CPU usage, the run-time
diagnostics framework communicates to the user interface which in turn warns the
user appropriately.
2. Data Logging: The run-time diagnostics framework is also responsible for logging
important information during on-road tests. Not only does the framework log
exceptional conditions, it is also used to create logs that can be used to re-create
tests using the playback feature in TROCS. The playback tests in combination with
detailed logs can be used to fix problems in the functional behavior of the CAV.
8.5 EMulator/simulator for Embedded Real-time
autonomous Intelligent Driving (EMERALD)
EMERALD is a system-level 3D graphical simulator for autonomous vehi-

cles. EMERALD is a highly configurable platform where all its features are
procedurally generated from configuration files. It supports the following
features:
1. Sensor Simulation: EMERALD supports the simulation of several sensors frequently
used for autonomous driving application like lidars, radars, ultrasound and cam-
era sensors. The user can test these sensors in various configurations and modes.
For example, the placement, type and range of the sensors are configurable. A
system designer could test various combinations of sensor to ensure the system is
able to meet its requirements.
2. Weather Simulation: EMERALD supports the simulation of varied weather patterns

including, rain, snow and sandstorms. A user can also control the intensity of these
weather phenomenon. The EMERALD simulation environment accounts for the
effects of weather patterns on the autonomous driving applications by controlling
parameters like the surface friction coefficients, camera visibility, effects of these
weather pattern on the lidar and radar sensors.
3. Fault Injection: EMERALD supports a large suite of fault injection features like,
varying noise models for sensors for different weather patterns, failures in system
and application processes, faults sensor outputs etc.
4. Exception Logging: The EMERALD logging framework logs exception conditions

on every simulation run. For example, it logs cases where the CAV is too close to
a curb or a pedestrian, if it hits an obstacle or is unable to reach its goal.
5. Scenario Generation: Simulation testing is a very important step in ensuring the safe
operation of various autonomous driving applications. Many scenarios are quite
dangerous to test in the real world, for example testing CAV behavior in the pres-
ence of unexpected jay walking pedestrians. EMERALD allows a system design
to generate and test autonomous driving applications under various conditions in
simulation by simply editing a set of configuration files that present all the editable
parameters.
6. Automated Testing: EMERLAD also can generate random scenarios to continuously
stress test the system to identify edge-case and gaps in the application logic.
Figure 8.6: EMERALD: Scenario Generation
Figure 8.7: Weather Simulation Figure 8.8: Lidar Simulation
8.6 Chapter Summary
In this chapter, we presented a reference architecture for a Connected and

automated vehicle (CAV) and described the various components in the ref-
erence architecture along with their interactions. We described the func-
tional and para-functional requirements for the design and development
of a CAV. We detailed our development process and the tool flow that we
adopt for building and testing our CAV. We described the tools we created
in detail. We conclude that, given the scale and the nature of the increasing
complexity in CAV systems, having the right tools and methodologies in
place is critical to developing functionally correct, safe and reliable CAV
applications.
Chapter 9
Conclusions
9.1 Research Contributions
In this dissertation, we presented several analyzable approaches to redun-

dancy management, techniques to drive demonstrably-efficient task exe-
cution and redundancy parameters, and approaches to structured testing
and validation for different levels of automation and operating contexts.
Specifically, we looked at different aspects of the system design together to
meet the safety and reliability requirements of connected and autonomous
driving vehicles. The following summarizes our contributions.
9.1.1 Selection of Task Execution Parameters
The first step in the deployment of autonomous driving applications is

selecting the task execution parameters. In chapter 3, given the benefits
of harmonic task sets, we addressed the problem of assigning harmonic
periods to an arbitrary task set such that every task gets assigned an in-
teger period less than or equal to its application-specified upper bound
and the task utilization of every task is less than 100%. Using a mathe-
matical framework to represent integer harmonic task sets, we first pre-
164
CHAPTER 9. CONCLUSIONS 165
sented a brute-force harmonic search algorithm. Next, we presented the

Branch-and-Bound Harmonic Search algorithm which improves on the per-
formance of the brute-force harmonic search by a few orders of magnitude.
We also presented two algorithms, the “Discrete Piecewise Harmonic Search”
(DPHS) and the “Period Ratio Harmonic Search” (PRHS), to find sub-optimal
solutions which can be used by run-time admission control and other con-
texts that are time-sensitive and can afford sub-optimal solutions. We also
demonstrated the benefits of our approach by considering real-world task
sets.
9.1.2 Selection of Replication Parameters
Autonomous driving systems are inherently safety-critical and hence, safety

must be guaranteed in the presence of faults. To this end, the next step in
the deployment process is to employ techniques like replication to sup-
port fault-tolerance requirements, which in turn require the selection of
task replication parameters. In Chapter 5, we considered software fault
tolerance techniques for safety-critical real-time systems and derived the
bounds on the recovery time of different types of redundant tasks, such
as active tasks, passive tasks and their variants. We also derived condi-
tions to map the recovery time requirements (RTR) of a task to derive the
redundant task type and related replication parameters.
9.1.3 Fault-tolerant Assignment of Tasks to Computing Nodes
Autonomous driving systems are not only safety-critical, but they are also
resource-constrained. Hence, once we have selected the task execution and
replication parameters, the next step is to assign these tasks to comput-
ing nodes such that we minimize the system resource utilization. To this
end, in Chapter 6, we proposed a new heuristic, called Tiered Placement

Constraint Decreasing (TPCD), for task assignment. It saves at least one
processor up to 40% of the time for a random task set and up to 90% of the
time for an L2 task set, relative to the best-known heuristic in the literature.
Also, on an average, it uses only up to one processor more than a carefully-
constructed optimal allocation. We also proposed a heuristic called Tiered
Placement Constraint Decreasing (TPCDC) which leverages the run-time
characteristics of cold standbys, further improving the resource utilization
of the system. We then derived the worst-case response time bounds of
our task model and presented a solution to improve this bound. We also
analyzed the impact of overload conditions on our task allocation scheme.
We extended the fault-tolerant task allocation problem to include these
RTR constraints, and proposed the TPCDC+R heuristic to satisfy these
constraints. Finding a core weakness in TPCDC+R, we then presented
two additional heuristics called Recovery-Time Tiered (RTT) and Tiered
Recovery-Time Constraint Increasing (TRTI) which prioritize the RTR con-
straints in the task allocation sequence. These two heuristics on average
produce allocations with fewer nodes than the TPCDC+R heuristic because
they yield more assignments of resource-efficient cold standbys. Overall,
the RTT heuristic, which tiers tasks based on their RTR values to prioritize
the allocation of tasks with strict RTR requirements first, performs the best.
Finally, we used the simulated annealing method to solve the fault-tolerant
task allocation optimization problem and showed that it produces alloca-
tions utilizing fewer computing resources than the proposed heuristics, at
the cost of substantial run-time.
9.1.4 Software Architecture to Support and Maintain Fault-Tolerance
Guarantees
With the tasks assigned to different computing nodes, we now need a

software architecture that can support the operation and maintenance of
the different replica types to guarantee safe operation. In Section 7.1 we
presented the software architecture to support the execution of different
standby types in the AUTOSAR Classic Platform, followed by a software
framework built on service-oriented architecture for an AUTOSAR Adap-
tive Platform in Section 7.2. Finally, in Section 7.3 we presented the SAF-
FIRE (Software Architecture For Fault-tolerant Imbed Real-time Environ-
ments) framework designed to support fault-tolerant execution the CMU
Autonomous driving Platform. In all sections of the chapter we presented
our implementation and experimental evaluation of our architectures for
each of these frameworks.
9.1.5 Evaluation and Testing of Self-Driving Safety-Critical
Automotive Systems
Finally, the last stage in the deployment of an autonomous driving is test-

ing and verification to ensure the system meets its requirements. In this
chapter 8, we presented a reference architecture for a Connected and au-
tomated vehicle (CAV) and described the various components in the ref-
erence architecture along with their interactions. We described the func-
tional and para-functional requirements for the design and development
of a CAV. We detailed our development process and the tool flow that we
adopt for building and testing a CAV. We also described each of these tools
in detail. We concluded that, given the scale and the nature of the increas-
ing complexity in CAV systems, having the right tools and methodologies
in place is critical to developing functionally correct, safe and reliable CAV
applications.
9.2 Future Work
As autonomous driving systems become more and more prevalent day by

day, there are many possible research directions for future work to improve
safety guarantees for these systems. We discuss below some topics for
future research.
9.2.1 Application-Specific Fault-tolerance schemes
This dissertation focused mainly on the system-level aspects of fault-tolerant

design for connected and autonomous driving systems. These can be
strategically coupled with application-specific fault-tolerance schemes. For
example, tasks can be specifically designed to be purely stateless increasing
the gains obtained from the redundancy-type allocation.
9.2.2 Photo-Realistic Simulation Environment
Many modern machine-learning algorithms have been developed to pro-

cess video stream to detect, classify and track objects in the environment.
Adding photo-realistic simulation capabilities in EMERALD will allow for
these algorithms to be tested in simulation, instead of using techniques like
transfer learning. Moreover, these techniques perform better given larger
and diverse training sets, which can be obtained easily if the simulation
were photo-realistic.
9.2.3 Hardware Architectures to Support Fault-Tolerant Computation
Software techniques like replication face significant bottlenecks given hard-

ware limitations, such as when a specific sensor can only interface with a
single computing unit. A failure in this computing unit now results in a
the failure of the entire sensor subsystem. In order to avoid these single
points of failure, the hardware architecture and organization must be care-
fully selected to guarantee that the software primitives can meet system
requirements.
9.2.4 Fault-tolerance Primitives for High-Performance Computing
Devices
Complex computing platforms are essential to support the increase in the

complexity of autonomous driving applications. High-performance com-
puting platforms with multiple GPUs, many core systems and machine-
learning accelerators are now becoming more prevalent. These platforms
allow for the consolidation of software components in the vehicle, thus
bringing down costs. However, this results in additional single points of
failure. Hence, it is important to understand the failure modalities of these
high-performance computing platforms and develop fault-tolerance prim-
itives for these platforms.
Appendix A
Glossary
Acronyms Full Names

AUTOSAR AUTomotive Open System ARchitecture
BBHS Branch-And-Bound Harmonic Search Algorithm
BFHS Brute-Force Harmonic Search Algorithm
CAPA Criticality As Priority Assignment
DAL Driving Automation Level
DPHS Discrete Piecewise Harmonic Search
EMERALD EMulator/simulator for Embedded Real-time autonomous Intelligent Driving
FOE First Order Error
MPE Maximum Percentage Error
ODD Operational Design Domain
PRHS Period-Ratio Harmonic Search
R-BFD Reliable Best-Fit Decreasing
RMS Rate-Monotonic Scheduling
RTR Recovery Time Requirement
RTT RTR-tiered
SAFFIRE Software Architecture For Fault-tolerant Imbed Real-time Environments
SOA Service-Oriented Architecture
SOME/IP Scalable service-Oriented MiddlewarE over IP
TPCD Tiered Placement Constraint Decreasing
TPCDC Tiered Placement Constraint Decreasing with Cold Standbys
TPE Total Percentage Error
TROCS Tartan Racing On-board Computer System
TRTI Tiered RTR constraint Increasing
TSU Total System Utilization
WCCT Worst-Case Completion Time
WCET Worst-Case Execution Time
ZSRM Zero-Slack RMS
170
Appendix B
Existing Task Partitioning Heuristics
B.1 The BFD-P and R-BFD Heuristics
In this section, we provide a brief overview of the BFD-P and R-BFD heuris-
tics [20].
The BFD-P algorithm follows the below steps,
1. Sort tasks including replicas in the decreasing order of utilization.
2. Fit every task into the best fit processor obeying the placement constraint, i.e., any
task should not be co-located with its replica.
3. Add a new processor if a task does not fit any bin.
4. Iterate until no tasks remain.
The R-BFD algorithm follows the below steps,
1. The given tasks are sorted in decreasing order of utilization.
2. The primary tasks are extracted and allocated first using the BFD-P heuristic.
171
APPENDIX B. EXISTING TASK PARTITIONING HEURISTICS 172
3. The replicas are then allocated one by one, highest order replicas first, i.e., opposite
to the TPCD approach.
4. Add a new processor if a task does not fit any bin.
5. Iterate until no tasks remain.

Bibliography
[1] J. Wei, J. M. Snider, J. Kim, J. M. Dolan, R. Rajkumar, and B. Litkouhi. Towards

a viable autonomous driving research platform. In 2013 IEEE Intelligent Vehicles
Symposium (IV), pages 763–770, June 2013. xv, 141, 150, 151
[2] Ragunathan (Raj) Rajkumar, Insup Lee, Lui Sha, and John Stankovic. Cyber-
physical systems: The next computing revolution. In Proceedings of the 47th Design
Automation Conference, DAC ’10, pages 731–736, New York, NY, USA, 2010. ACM.
3, 17
[3] C. L. Liu and James W. Layland. Scheduling algorithms for multiprogramming in

a hard-real-time environment. J. ACM, 20(1):46–61, January 1973. 3, 17, 18, 48
[4] Minsoo Ryu and Seongsoo Hong. A period assignment algorithm for real-time
system design. In W. Rance Cleaveland, editor, Tools and Algorithms for the Con-
struction and Analysis of Systems, pages 34–43, Berlin, Heidelberg, 1999. Springer
Berlin Heidelberg. 3, 17
[5] D. Seto, J. P. Lehoczky, L. Sha, and K. G. Shin. On task schedulability in real-time

control systems. In 17th IEEE Real-Time Systems Symposium, pages 13–21, Dec 1996.
3, 17
[6] Joseph Y. T. Leung. A new algorithm for scheduling periodic, real-time tasks.
Algorithmica, 4(1):209, Jun 1989. 3, 17
173
BIBLIOGRAPHY 174
[7] C. C. Han and H. Y. Tyan. A better polynomial-time schedulability test for real-
time fixed-priority scheduling algorithms. In Proceedings Real-Time Systems Sympo-
sium, pages 36–45, Dec 1997. 3, 8, 17, 30, 43
[8] V. Bonifaci, A. Marchetti-Spaccamela, N. Megow, and A. Wiese. Polynomial-time

exact schedulability tests for harmonic real-time tasks. In 2013 IEEE 34th Real-Time
Systems Symposium, pages 236–245, Dec 2013. 3, 18
[9] R. Gerber, Seongsoo Hong, and M. Saksena. Guaranteeing real-time requirements

with resource-based calibration of periodic processes. IEEE Transactions on Software
Engineering, 21(7):579–592, Jul 1995. 3, 18
[10] S. DSouza, A. Bhat, and R. Rajkumar. Sleep scheduling for energy-savings in multi-
core processors. In 2016 28th Euromicro Conference on Real-Time Systems (ECRTS),
pages 226–236, July 2016. 3, 18
[11] Anand Bhat, Soheil Samii, and Ragunathan (Raj) Rajkumar. Recovery Time Con-
siderations in Real-Time Systems Employing Software Fault Tolerance. In Sebas-
tian Altmeyer, editor, 30th Euromicro Conference on Real-Time Systems (ECRTS 2018),
volume 106 of Leibniz International Proceedings in Informatics (LIPIcs), pages 23:1–
23:22, Dagstuhl, Germany, 2018. Schloss Dagstuhl–Leibniz-Zentrum fuer Infor-
matik. 4
[12] Seong Woo Kwak and Jung-Min Yang. Optimal checkpoint placement on real-time
tasks with harmonic periods. Journal of Computer Science and Technology, 27(1):105–
112, Jan 2012. 4
[13] H. Kopetz. On the design of distributed time-triggered embedded systems. Journal

of Computing Science and Engineering, 2(4):340–356, 2008. 4, 18
[14] R.B. Dodd, DEFENCE SCIENCE, TECHNOLOGY ORGANISATION VICTORIA

(Australia), Defence Science, and Technology Organisation (Australia). Air Op-
BIBLIOGRAPHY 175
erations Division. An analysis of task-scheduling for a generic avionics mission

computer. 2006. 4, 18
[15] J. V. Busquets-Mataix, J. J. Serrano, R. Ors, P. Gil, and A. Wellings. Using harmonic

task-sets to increase the schedulable utilization of cache-based preemptive real-
time systems. In Proceedings of 3rd International Workshop on Real-Time Computing
Systems and Applications, pages 195–202, Oct 1996. 4, 18
[16] T. King. An overview of arinc 653 part 4. In 2012 IEEE/AIAA 31st Digital Avionics
Systems Conference (DASC), pages 1–7, Oct 2012. 4, 18
[17] A. Easwaran, I. Lee, O. Sokolsky, and S. Vestal. A compositional scheduling frame-

work for digital avionics systems. In 2009 15th IEEE International Conference on Em-
bedded and Real-Time Computing Systems and Applications, pages 371–380, Aug 2009.
4, 18
[18] D. Oh and T. Baker. Utilization bounds for n-processor rate monotonic scheduling
with static processor assignment. In Real-Time System, pages 15:183–192, 1998. 5,
159
[19] D. Johnson. Near optimal allocation algorithms. Ph.D. Dissertation, MIT, MA. 5,
57, 159
[20] J. Kim et al. R-BATCH: task partitioning for fault-tolerant multiprocessor real-
time systems. In CIT 2010, Bradford, West Yorkshire, UK, June 29-July 1, 2010, pages
1872–1879, 2010. 5, 12, 14, 57, 67, 71, 159, 171
[21] A. Bhat, S. Samii, and R. Rajkumar. Practical task allocation for software fault-
tolerance and its implementation in embedded automotive systems. In 2017 IEEE
Real-Time and Embedded Technology and Applications Symposium (RTAS), pages 87–98,
April 2017. 5
[22] Autosar. http://www.autosar.org. 6, 50, 109, 110, 125

BIBLIOGRAPHY 176
[23] Lingfeng Wang and K. C. Tan. Software testing for safety critical applications.
IEEE Instrumentation Measurement Magazine, 8(2):38–47, June 2005. 7
[24] A. Bhat, S. Aoki, and R. Rajkumar. Tools and methodologies for autonomous
driving systems. Proceedings of the IEEE, 106(9):1700–1716, Sep. 2018. 7, 157
[25] D. Henriksson and A. Cervin. Optimal on-line sampling period assignment for
real-time control tasks based on plant state information. In Proceedings of the 44th
IEEE Conference on Decision and Control, pages 4469–4474, Dec 2005. 8
[26] A. Cervin, M. Velasco, P. Marti, and A. Camacho. Optimal online sampling pe-
riod assignment: Theory and experiments. IEEE Transactions on Control Systems
Technology, 19(4):902–910, July 2011. 8
[27] Nasro Min-Allah, Samee Ullah Khan, and Wang Yongji. Optimal task execution
times for periodic tasks using nonlinear constrained optimization. The Journal of
Supercomputing, 59(3):1120–1138, Mar 2012. 9
[28] M. Nasri, G. Fohler, and M. Kargahi. A framework to construct customized har-

monic periods for real-time systems. In 2014 26th Euromicro Conference on Real-Time
Systems, pages 211–220, July 2014. 9
[29] M. Nasri and G. Fohler. An efficient method for assigning harmonic periods to
hard real-time tasks with period ranges. In 2015 27th Euromicro Conference on Real-
Time Systems, pages 149–159, July 2015. 9
[30] Morteza Mohaqeqi, Mitra Nasri, Yang Xu, Anton Cervin, and Karl-Erik Årzén.
Optimal harmonic period assignment: complexity results and approximation al-
gorithms. Real-Time Systems, Apr 2018. 9
[31] Jean claude Laprie and Brian Randell. Fundamental concepts of computer systems
dependability. In In Proceedings of the 3rd IEEE Information Survivability, Boston,
Massachusetts, USA, October 2000, pages 24–26, 2001. 10, 50
BIBLIOGRAPHY 177
[32] F. V. Brasileiro, P. D. Ezhilchelvan, S. K. Shrivastava, N. A. Speirs, and S. Tao.

Implementing fail-silent nodes for distributed systems. IEEE Transactions on Com-
puters, 45(11):1226–1238, Nov 1996. 10, 50
[33] Rachid Guerraoui and André Schiper. Fault-tolerance by replication in distributed

systems. In Alfred Strohmeier, editor, Reliable Software Technologies — Ada-Europe
’96, pages 38–57, Berlin, Heidelberg, 1996. Springer Berlin Heidelberg. 10, 50
[34] J. J. Chen, C. Y. Yang, T. W. Kuo, and S. Y. Tseng. Real-time task replication for fault
tolerance in identical multiprocessor systems. In 13th IEEE Real Time and Embedded
Technology and Applications Symposium (RTAS’07), pages 249–258, April 2007. 11
[35] S. Gopalakrishnan and M. Caccamo. Task partitioning with replication upon het-
erogeneous multiprocessor systems. In 12th IEEE Real-Time and Embedded Technol-
ogy and Applications Symposium (RTAS’06), pages 199–207, April 2006. 11
[36] C. Pinello, L. P. Carloni, and A. L. Sangiovanni-Vincentelli. Fault-tolerant dis-

tributed deployment of embedded control software. IEEE Transactions on Computer-
Aided Design of Integrated Circuits and Systems, 27(5):906–919, May 2008. 11
[37] Yingfeng Oh and Sang H. Son. Enhancing fault-tolerance in rate-monotonic

scheduling. Real-Time Systems, 7(3):315–329, Nov 1994. 11
[38] J. Kim et al. Safer: System-level architecture for failure evasion in real-time ap-
plications. In Real-Time Systems Symposium (RTSS), 2012 IEEE 33rd, 2012. 12, 117,
118
[39] P. Guo and Z. Xue. Improved task partition based fault-tolerant rate-monotonic
scheduling algorithm. In 2016 International Conference on Security of Smart Cities,
Industrial Control System and Communications (SSIC), pages 1–5, July 2016. 12
BIBLIOGRAPHY 178
[40] A. A. Bertossi, L. V. Mancini, and A. Menapace. Scheduling hard-real-time tasks

with backup phasing delay. In 2006 Tenth IEEE International Symposium on Dis-
tributed Simulation and Real-Time Applications, pages 107–118, Oct 2006. 12
[41] Kay Klobedanz et al. Embedded Systems: Design, Analysis and Verification: 4th IFIP
TC 10, IESS 2013, Paderborn, Germany, June 17-19, 2013. Proceedings, pages 238–249.
Springer Berlin Heidelberg, Berlin, Heidelberg, 2013. 12, 14
[42] Krithi Ramamritham. Allocation and scheduling of precedence-related periodic

tasks. IEEE Transactions on Parallel and Distributed Systems, 6:412–420, 1995. 12
[43] Ping Zhu, Fumin Yang, and Gang Tu. Fault-tolerant rate-monotonic compact-
factor-driven scheduling in hard-real-time systems. Wuhan University Journal of
Natural Sciences, 15(3):217–221, 2010. 12
[44] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization by simulated anneal-

ing. SCIENCE, 220(4598):671–680, 1983. 13, 68, 100
[45] F. Cristian. Reaching agreement on processor-group membrship in synchronous

distributed systems. Distributed Computing, 4(4):175–187, 1991. 13, 111
[46] J. Balasubramanian et al. Middleware for resource-aware deployment and config-

uration of fault-tolerant real-time systems. RTAS ’10, pages 69–78, 2010. 13
[47] P. Narasimhan P. Felber. Experiences, strategies, and challenges in building fault-

tolerant CORBA systems. IEEE Trans. Computers, 53(5):497–511, 2004. 13
[48] P. Narasimhan et al. Mead: support for real-time fault-tolerant corba. Concurrency
and Computation: Practice and Experience, 17(12):1527–1545, 2005. 13
[49] Caroline Lu, Jean-Charles Fabre, and Marc-Olivier Killijian. An approach for im-
proving Fault-Tolerance in Automotive Modular Embedded Software. 14
BIBLIOGRAPHY 179
April 2017. 14
[51] Jean-Charles Fabre, Marc-Olivier Killijian, and François Taïani. Robustness of au-
tomotive applications using reflective computing: lessons learnt. In SAC, 2011.
14
[52] Traian Pop, Paul Pop, Petru Eles, Zebo Peng, and Alexandru Andrei. Timing
analysis of the flexray communication protocol. Real-Time Systems, 39(1):205–235,
Aug 2008. 14, 52, 59
[53] IEEE802.1cb-frame replication and elimination for reliability, howpublished =

http://www.ieee802.org/1/pages/802.1cb.html, note = Accessed: 2018-01-12.
14, 52
[54] D. Thiele, P. Axer, and R. Ernst. Improving formal timing analysis of switched
ethernet by exploiting fifo scheduling. In 2015 52nd ACM/EDAC/IEEE Design Au-
tomation Conference (DAC), pages 1–6, June 2015. 14, 52, 121, 153
[55] Charles Thorpe, Martial H Hebert, Takeo Kanade, and Steven A Shafer. Vision and
navigation for the carnegie-mellon navlab. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 10(3):362–373, 1988. 15
[56] Chris Urmson, Joshua Anhalt, Drew Bagnell, Christopher Baker, Robert Bittner,
MN Clark, John Dolan, Dave Duggins, Tugrul Galatali, Chris Geyer, et al. Au-
tonomous driving in urban environments: Boss and the urban challenge. Journal
of Field Robotics, 25(8):425–466, 2008. 15
[57] Michael Montemerlo, Jan Becker, Suhrid Bhat, Hendrik Dahlkamp, Dmitri Dolgov,
Scott Ettinger, Dirk Haehnel, Tim Hilden, Gabe Hoffmann, Burkhard Huhnke,
BIBLIOGRAPHY 180
et al. Junior: The stanford entry in the urban challenge. Journal of field Robotics,
25(9):569–597, 2008. 15
[58] Andrew Bacha, Cheryl Bauman, Ruel Faruque, Michael Fleming, Chris Terwelp,
Charles Reinholtz, Dennis Hong, Al Wicks, Thomas Alberi, David Anderson, et al.
Odin: Team victortango’s entry in the darpa urban challenge. Journal of Field
Robotics, 25(8):467–492, 2008. 15
[59] John Leonard, Jonathan How, Seth Teller, Mitch Berger, Stefan Campbell, Gaston
Fiore, Luke Fletcher, Emilio Frazzoli, Albert Huang, Sertac Karaman, et al. A
perception-driven autonomous urban vehicle. Journal of Field Robotics, 25(10):727–
774, 2008. 15
[60] Jonathan Bohren, Tully Foote, Jim Keller, Alex Kushleyev, Daniel Lee, Alex Stew-
art, Paul Vernaza, Jason Derenick, John Spletzer, and Brian Satterfield. Little ben:
The ben franklin racing team’s entry in the 2007 darpa urban challenge. Journal of
Field Robotics, 25(9):598–614, 2008. 15
[61] Isaac Miller, Mark Campbell, Dan Huttenlocher, Frank-Robert Kline, Aaron
Nathan, Sergei Lupashin, Jason Catlin, Brian Schimpf, Pete Moran, Noah Zych,
et al. Team cornell’s skynet: Robust perception and planning in an urban environ-
ment. Journal of Field Robotics, 25(8):493–527, 2008. 15
[62] Fred W Rauskolb, Kai Berger, Christian Lipski, Marcus Magnor, Karsten Cor-
nelsen, Jan Effertz, Thomas Form, Fabian Graefe, Sebastian Ohl, Walter Schu-
macher, et al. Caroline: An autonomously driving vehicle for urban environments.
Journal of Field Robotics, 25(9):674–724, 2008. 15
[63] Benjamin J Patz, Yiannis Papelis, Remo Pillat, Gary Stein, and Don Harper. A
practical approach to robotic design for the darpa urban challenge. Journal of Field
Robotics, 25(8):528–566, 2008. 15
BIBLIOGRAPHY 181
[64] Yi-Liang Chen, Venkataraman Sundareswaran, Craig Anderson, Alberto Broggi,

Paolo Grisleri, Pier Paolo Porta, Paolo Zani, and John Beck. TerramaxTM : Team
oshkosh urban robot. Journal of Field Robotics, 25(10):841–860, 2008. 15
[65] James R McBride, Jerome C Ivan, Doug S Rhode, Jeffrey D Rupp, Matthew Y
Rupp, Jeffrey D Higgins, Doug D Turner, and Ryan M Eustice. A perspective on
emerging automotive safety applications, derived from lessons learned through
participation in the darpa grand challenges. Journal of Field Robotics, 25(10):808–
840, 2008. 15
[66] Felix Von Hundelshausen, Michael Himmelsbach, Falk Hecker, Andre Mueller,
and Hans-Joachim Wuensche. Driving with tentacles: Integral structures for sens-
ing and motion. Journal of Field Robotics, 25(9):640–673, 2008. 15
[67] S Shah, D Dey, C Lovett, and A Kapoor. Aerial informatics and robotics platform.
Technical report, Technical report MSR-TR-9, Microsoft Research, 2017. 16
[68] Udacity, An Open Source Self-Driving Car. https://www.udacity.com/

self-driving-car. 16
[69] Daniel Krajzewicz, Georg Hertkorn, Christian Rössel, and Peter Wagner. Sumo
(simulation of urban mobility)-an open-source traffic simulation. In Proceedings
of the 4th middle East Symposium on Simulation and Modelling (MESM20002), pages
183–187, 2002. 16
[70] Martin Fellendorf and Peter Vortisch. Validation of the microscopic traffic flow
model vissim in different real-world situations. In transportation research board 80th
annual meeting, 2001. 16
[71] Michal Piorkowski, Maxim Raya, A Lezama Lugo, Panagiotis Papadimitratos,

Matthias Grossglauser, and J-P Hubaux. Trans: realistic joint traffic and network
BIBLIOGRAPHY 182
simulator for vanets. ACM SIGMOBILE mobile computing and communications re-
view, 12(1):31–33, 2008. 16
[72] Christoph Sommer, Reinhard German, and Falko Dressler. Bidirectionally coupled
network and road traffic simulation for improved ivc analysis. IEEE Transactions
on Mobile Computing, 10(1):3–15, 2011. 16
[73] M. Joseph and P. Pandya. Finding response times in a real-time system. The
Computer Journal, 29(5):390–395, 1986. 20
[74] C. D. Locke, D. R. Vogel, and T. J. Mesler. Building a predictable avionics platform

in ada: a case study. In [1991] Proceedings Twelfth Real-Time Systems Symposium,
pages 181–189, Dec 1991. 44
[75] L. Sha, R. Rajkumar, and J. P. Lehoczky. Priority inheritance protocols: an approach

to real-time synchronization. IEEE Transactions on Computers, 39(9):1175–1185, Sep
1990. 48
[76] N. Audsley. Applying new scheduling theory to static priority pre-emptive

scheduling. Software Engineering Journal, 8:284–292(8), September 1993. 48, 49
[77] A. Burns, R. Davis, and S. Punnekkat. Feasibility analysis of fault-tolerant real-

time task sets. In Proceedings of the Eighth Euromicro Workshop on Real-Time Systems,
pages 29–33, Jun 1996. 51
[78] J. J. Horning, H. C. Lauer, P. M. Melliar-Smith, and B. Randell. A Program Structure

for Error Detection and Recovery, pages 53–68. Springer Berlin Heidelberg, Berlin,
Heidelberg, 1985. 51
[79] J. Kim, G. Bhatia, R. Rajkumar, and M. Jochim. Safer: System-level architecture

for failure evasion in real-time applications. In 2012 IEEE 33rd Real-Time Systems
Symposium, pages 227–236, Dec 2012. 51, 52, 54
BIBLIOGRAPHY 183
[80] Kay Klobedanz, Jan Jatzkowski, Achim Rettberg, and Wolfgang Mueller. Fault-
tolerant deployment of real-time software in autosar ecu networks. In Gunar
Schirner, Marcelo Götz, Achim Rettberg, Mauro C. Zanella, and Franz J. Ram-
mig, editors, Embedded Systems: Design, Analysis and Verification, pages 238–249,
Berlin, Heidelberg, 2013. Springer Berlin Heidelberg. 51
April 2017. 52, 54, 57, 59, 93, 159
[82] Robert I. Davis, Alan Burns, Reinder J. Bril, and Johan J. Lukkien. Controller area
network (can) schedulability analysis: Refuted, revisited and revised. Real-Time
Systems, 35(3):239–272, Apr 2007. 52, 59, 153
[83] Chris Urmson, Joshua Anhalt, Drew Bagnell, Christopher Baker, Robert Bittner,
M. N. Clark, John Dolan, Dave Duggins, Tugrul Galatali, Chris Geyer, Michele
Gittleman, Sam Harbaugh, Martial Hebert, Thomas M. Howard, Sascha Kolski,
Alonzo Kelly, Maxim Likhachev, Matt McNaughton, Nick Miller, Kevin Peterson,
Brian Pilnick, Raj Rajkumar, Paul Rybski, Bryan Salesky, Young-Woo Seo, Sanjiv
Singh, Jarrod Snider, Anthony Stentz, William “Red” Whittaker, Ziv Wolkowicki,
Jason Ziglar, Hong Bae, Thomas Brown, Daniel Demitrish, Bakhtiar Litkouhi, Jim
Nickolaou, Varsha Sadekar, Wende Zhang, Joshua Struble, Michael Taylor, Michael
Darms, and Dave Ferguson. Autonomous Driving in Urban Environments: Boss and
the Urban Challenge, pages 1–59. Springer Berlin Heidelberg, Berlin, Heidelberg,
2009. 54
[84] C. Schonfeld. Redundancy approaches in spacecraft computers. In 28th Israel

Annual Conference on Aviation and Astronautics, pages 148–156, 1986. 54
BIBLIOGRAPHY 184
[85] Thomas Wolf and Alfred Strohmeier. Fault tolerance by transparent replication for
distributed ada 95. In Michael González Harbour and Juan A. de la Puente, editors,
Reliable Software Technologies — Ada-Europe’ 99, pages 412–424, Berlin, Heidelberg,
1999. Springer Berlin Heidelberg. 54
[86] K Hasimoto, Tatsuhiro Tsuchiya, and T Kikuno. Effective scheduling of duplicated

tasks for fault tolerance in multiprocessor systems. E85D:525–534, 03 2002. 54
[87] Navin Budhiraja, Keith Marzullo, Fred B. Schneider, and Sam Toueg. Distributed
systems (2nd ed.). chapter The Primary-backup Approach, pages 199–216. ACM
Press/Addison-Wesley Publishing Co., New York, NY, USA, 1993. 54
[88] KapDae Ahn, Jong Kim, and SungJe Hong. Fault-tolerant real-time scheduling
using passive replicas. In Proceedings Pacific Rim International Symposium on Fault-
Tolerant Systems, pages 98–103, Dec 1997. 54
[89] Dong-Ik Oh and T.P. Bakker. Utilization bounds for n-processor rate monotone
scheduling with static processor assignment. Real-Time Systems, 15(2):183–192, Sep
1998. 57
[90] Jorge Real and Alfons Crespo. Mode change protocols for real-time systems: A
survey and a new proposal. Real-Time Systems, 26(2):161–197, Mar 2004. 65
[91] Taxonomy and definitions for terms related to on-road motor vehicle automated
driving systems., . 77
[92] Karthik Lakshmanan, Dionisio De Niz, Ragunathan (RAJ) Rajkumar, and Gabriel
Moreno. Overload provisioning in mixed-criticality cyber-physical systems. ACM
Trans. Embed. Comput. Syst., 11(4):83:1–83:24, January 2013. 85, 86
[93] H. Huang, C. Gill, and C. Lu. Implementation and evaluation of mixed-criticality

scheduling approaches for periodic tasks. In 2012 IEEE 18th Real Time and Embedded
Technology and Applications Symposium, pages 23–32, April 2012. 86
BIBLIOGRAPHY 185
[94] D. d. Niz, K. Lakshmanan, and R. Rajkumar. On the scheduling of mixed-criticality

real-time task sets. In 2009 30th IEEE Real-Time Systems Symposium, pages 291–300,
Dec 2009. 86
[95] K. Lakshmanan, D. d. Niz, R. Rajkumar, and G. Moreno. Resource allocation in

distributed mixed-criticality cyber-physical systems. In 2010 IEEE 30th International
Conference on Distributed Computing Systems, pages 169–178, June 2010. 87
[96] Mike Phillips, Venkatraman Narayanan, Sandip Aine, and Maxim Likhachev. Ef-
ficient search with an ensemble of heuristics. In Proceedings of the 24th International
Conference on Artificial Intelligence, IJCAI’15, pages 784–791. AAAI Press, 2015. 92,
159
[97] Paul Emberson, Roger Stafford, and Robert I Davis. Techniques for the synthesis
of multiprocessor tasksets. In proceedings 1st International Workshop on Analysis Tools
and Methodologies for Embedded and Real-time Systems (WATERS 2010), pages 6–11,
2010. 99, 107
[98] Krzysztof Fleszar and Khalil S. Hindi. New heuristics for one-dimensional bin-
packing. Comput. Oper. Res., 29(7):821–839, June 2002. 100
[99] N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, and E. Teller. Equa-
tion of state calculations by fast computing machines. jcp, 21:1087–1092, jun 1953.
101
[100] R.L. Rao and S.S. Iyengar. Bin-packing by simulated annealing. Computers and
Mathematics with Applications, 27(5):71 – 82, 1994. 102, 105
[101] Autosar adaptive platform. https://www.autosar.org/standards/adaptive-

platform/. 109, 125, 138
[102] Christopher Urmson, Joshua Anhalt, J. Andrew (Drew) Bagnell, Christopher R.

Baker, Robert E. Bittner, John M. Dolan, David Duggins, David Ferguson, Tu-
BIBLIOGRAPHY 186
grul Galatali, Hartmut Geyer, Michele Gittleman, Sam Harbaugh, Martial Hebert,
Thomas Howard, Alonzo Kelly, David Kohanbash, Maxim Likhachev, Nick Miller,
Kevin Peterson, Raj Rajkumar, Paul Rybski, Bryan Salesky, Sebastian Scherer,
Young-Woo Seo, Reid Simmons, Sanjiv Singh, Jarrod M. Snider, Anthony (Tony)
Stentz, William (Red) L. Whittaker, and Jason Ziglar. Tartan racing: A multi-modal
approach to the darpa urban challenge. Technical Report CMU-RI-TR-, Pittsburgh,
PA, April 2007. 109, 141, 152
[103] R. Rajkumar and M. Gagliardi. High availability in the real-time pub-

lisher/subscriber inter-process communication model. In 17th IEEE Real-Time Sys-
tems Symposium, pages 136–141, Dec 1996. 111
[104] A. Avizienis et al. Basic concepts and taxonomy of dependable and secure com-
puting. Dependable and Secure Computing, IEEE Transactions on, 2004. 113
[105] Robert I. Davis, Alan Burns, Reinder J. Bril, and Johan J. Lukkien. Controller area
network (can) schedulability analysis: Refuted, revisited and revised. Real-Time
Systems, 35(3):239–272, Apr 2007. 118
[106] T. Pop, P. Pop, P. Eles, Z. Peng, and A. Andrei. Timing analysis of the flexray com-
munication protocol. In 18th Euromicro Conference on Real-Time Systems (ECRTS’06),
pages 11 pp.–216, July 2006. 118
[107] Arccore. http://www.arccore.com. 121
[108] Autosar adaptive platform standard documentation.

https://www.autosar.org/standards/adaptive-platform/adaptive-platform-
1803/. 125, 128
[109] Some/ip. http://some-ip.com/. 126
[110] Object management group (omg), data distribution service for real-time systems.
https://www.omg.org/spec/DDS/1.4/. 126
BIBLIOGRAPHY 187
[111] J. R. Seyler, T. Streichert, M. Glaß, N. Navet, and J. Teich. Formal analysis of

the startup delay of some/ip service discovery. In 2015 Design, Automation Test in
Europe Conference Exhibition (DATE), pages 49–54, 2015. 135
[112] Renesas r-car-h3. https://www.renesas.com/us/en/solutions/automotive/soc/r-

car-h3.html. 138
[113] S. Oikawa and R. Rajkumar. Portable rk: a portable resource kernel for guaranteed
and enforced timing behavior. In Proceedings of the Fifth IEEE Real-Time Technology
and Applications Symposium, pages 111–120, 1999. 146
[114] USDOT, Intelligent Transportation Systems. https://www.its.dot.gov/

factsheets/dsrc_factsheet.htm. 151
[115] Reza Azimi. Co-operative Driving at Intersections using Vehicular Networks and Vehicle-
Resident Sensing. PhD dissertation, Carnegie Mellon University, 2015. 151
[116] Dionisio de Niz, Raj Rajkumar, and Gaurav Bhatia. Model-based development
of embedded systems: The sysweaver approach. 2013 IEEE 19th Real-Time and
Embedded Technology and Applications Symposium (RTAS), 00:231–242, 2006. 156
[117] C. L. Liu and James W. Layland. Scheduling algorithms for multiprogramming in

a hard-real-time environment. J. ACM, 20(1):46–61, January 1973. 159

Anand Bhat PHD Thesis

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Anand Bhat PHD Thesis

Uploaded by

Copyright:

Available Formats

Practical Solutions for Fault-Tolerance in

Connected and Autonomous Vehicles (CAVs)

Submitted in partial fulfillment of the requirements for

Anand Ganpat Bhat

B.S., Vivekanand Education Society’s Institute of Technology

Carnegie Mellon University

List of Figures xii

2 Background and Related Work 8

3 Selection of Task Execution Parameters 17

3.2 System Model and Problem Definition . . . . . . . . . . . . . . . . . . . . . . 20

5 Selection of Replication Parameters 54

6 Fault-tolerant Assignment of Tasks to Computing Nodes 67

7 Software Architecture to Support and Maintain Fault-Tolerance Guarantees 109

7.3 SAFFIRE: Software Architecture For Fault-tolerant Imbed Real-time Envi-

8 Evaluation and Testing of Self-Driving Safety-Critical Automotive Systems 148

B Existing Task Partitioning Heuristics 171

3.1 BFHS Example for c f = FOE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5.1 Conditions for Redundant Task Selection . . . . . . . . . . . . . . . . . . . . . . 64

7.1 Task-Level Fault Tolerance Library Overhead . . . . . . . . . . . . . . . . . . . . 120

4.1 Detecting Primary Failure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.1 Defining Recovery Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

6.1 Tasks to be allocated and their utilizations . . . . . . . . . . . . . . . . . . . . . 68

6.2 TPCD Task order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

6.21 C-TPCD: Final assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

7.1 Group Formation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

7.19 CMU autonomous driving research platform [1] . . . . . . . . . . . . . . . . . . 141

8.1 A Reference Architecture for CAVs . . . . . . . . . . . . . . . . . . . . . . . . . . 150

1.2 Thesis Statement

Our thesis statement is as follows.

The fault-tolerance requirements for computing systems in resource-constrained

1.3 Scope and Approach of the Thesis

In order to achieve this objective, as part of this thesis, we consider the

1. Selection of Task Execution Parameters:

2. Selection of Replication Parameters

For safety-critical systems employing software redundancy, it is also important to

Another aspect to consider is the criticality of a given application. Based on the

3. Fault-tolerant Assignment of Tasks to Computing Nodes

Another aspect to consider is that the fault-tolerant assignment of tasks influences

4. Software Architecture to Support and Maintain Fault-Tolerance Guarantees

it is essential to have a framework to support the operation of these replicas. Given

As mentioned before, the replication requirements for automotive applications are

5. Evaluation and Testing of Self-Driving Safety-Critical Automotive Systems

Background and Related Work

2.1 Selection of Task Execution Parameters

2.2 Selection of Replication Parameters

Replication is an important technique that deals with permanent crash

2. Hot Standby: A hot standby is based on the primary-backup approach. It performs

3. Cold Standby: A cold standby is also based on the primary-backup approach. It

In order employ effective replication, a system designer must decide

2.3 Fault-tolerant Assignment of Tasks to Computing

Prior research has studied real-time task allocation algorithms in order

optimized resource usage by activating some backups only in case of fail-

heuristic TPCD (Tiered Placement Constraint Decreasing) and compare its

2.4 Software Architecture to Support and Maintain

Software fault-tolerance in a real-time distributed system has been pre-

tation extremely portable, which is very important in a framework like

2.5 Evaluation and Testing of Self-Driving Safety-Critical

allowed them to visualize data in a simulator. Our Run-Time Diagnostics

Selection of Task Execution Parameters

Real-time systems closely interact with the environments in which they

task set is an integer multiple of its shorter periods. Also, polynomial-time

discrete nature of time in real-time computer systems.