A

r
c
h
i
t
e
c
t
i
n
g

F
a
u
l
t
-
T
o
l
e
r
a
n
t

S
o
f
t
w
a
r
e

S
y
s
t
e
m
s

























H
a
s
a
n

S
ö
z
e
r

Hasan Sözer
Architecting
Fault-Tolerant
Software Systems
ISBN 978-90-365-2788-0
The increasing size and complexity of software systems
makes it hard to prevent or remove all possible faults. Faults
that remain in the system can eventually lead to a system
failure. Fault tolerance techniques are introduced for
enabling systems to recover and continue operation when
they are subject to faults. Many fault tolerance techniques
are available but incorporating them in a system is not
always trivial. In this thesis, we introduce methods and tools
for the application of fault tolerance techniques to increase
the reliability and availability of software systems.
Architecting Fault-Tolerant Software Systems Invitation
to the public defense
of my thesis
Architecting
Fault-Tolerant
Software Systems
on Thursday,
January 29, 2009
at 16:45 in Collegezaal 2
of the Spiegel building at the
University of Twente.
At 16:30 I will give a brief
introduction to the subject of
my thesis.
The defense will be followed
by a reception in the
same building.
Hasan Sözer
Architecting Fault-Tolerant Software Systems
Hasan S¨ozer
Architecting Fault-Tolerant Software Systems
Hasan S¨ozer
Ph.D. di ssertation committee:
Chairman and secretary:
Prof. Dr. Ir. A.J. Mouthaan, University of Twente, The Netherlands
Promoter:
Prof. Dr. Ir. M. Ak¸sit, University of Twente, The Netherlands
Assistant promoter:
Dr. Ir. B. Tekinerdo˘gan, Bilkent University, Turkey
Members:
Dr. Ir. J. Broenink, University of Twente, The Netherlands
Prof. Dr. Ir. A. van Gemund, Delft University of Technology, The Netherlands
Dr. R. de Lemos, University of Kent, United Kingdom
Prof. Dr. A. Romanovsky, Newcastle University, United Kingdom
Prof. Dr. Ir. G. Smit, University of Twente, The Netherlands
CTIT Ph.D. thesis series no. 09-135. Centre for Telematics and Information Tech-
nology (CTIT), P.O. Box 217 - 7500 AE Enschede, The Netherlands.
This work has been carried out as part of the Trader project under the responsibil-
ity of the Embedded Systems Institute. This project is partially supported by the
Dutch Government under the Bsik program. The work in this thesis has been car-
ried out under the auspices of the research school IPA (Institute for Programming
research and Algorithmics).
ISBN 978-90-365-2788-0
ISSN 1381-36-17 (CTIT Ph.D. thesis series no. 09-135)
IPA Dissertation Series 2009-05
Cover design by Hasan S¨ozer
Printed by PrintPartners Ipskamp, Enschede, The Netherlands
Copyright c 2009, Hasan S¨ozer, Enschede, The Netherlands
Architecting Fault-Tolerant Software Systems
DISSERTATION
to obtain
the degree of doctor at the University of Twente,
on the authority of the rector magnificus,
Prof. Dr. H. Brinksma,
on account of the decision of the graduation committee,
to be publicly defended
on Thursday the 29th of January 2009 at 16.45
by
Hasan S¨ozer
born on the 21st of August 1980
in Bursa, Turkey
This dissertation is approved by
Prof. Dr. Ir. M. Ak¸sit (promoter)
Dr. Ir. Bedir Tekinerdo˘gan (assistant promoter)
“The more you know, the more you realize you know nothing.”
- Socrates
Acknowledgements
When I was a M.Sc. student at Bilkent University, I have met with Bedir Tekin-
erdo˘gan. He was a visiting assistant professor there at that time. Towards the end
of my M.Sc. studies, he has notified me about the vacancy for a Ph.D. position at
the University of Twente. He has also recommended me for this position. First of
all, I would like to thank him for the faith he had in me. Following my admission
to this position, he became my daily supervisor and we have been working very
closely thereafter. I have always been impressed by his ability to abstract away key
points out of details and his writing/presentation skills based on a true empathy
towards the intended audience. I would like to thank him for his contributions to
my intellectual growth and for his continuous encouragement, which has been an
important source of motivation for me.
I have carried out my Ph.D. studies at the software engineering group lead by
Mehmet Ak¸sit. We have had regular meetings with him to discuss my progress
and future research directions. In these meetings, I have sometimes been exposed
to challenging critics but always with a positive, optimistic attitude and encourage-
ment. Over the years, I have witnessed his ability to foresee pitfalls and I have been
convinced about the accuracy of his predictions in research. I would like to thank
him for his reliable guidance.
During my studies, I have also had the opportunity to work together with Hichem
Boudali and Mari¨elle Stoelinga from the formal methods group. I have learned a lot
from them and an important part of this thesis (Section 5.10) presents the results
of our collaboration. I would like to thank them for their contribution.
I would like to thank to the members of my Ph.D. committee: Jan Broenink, Ar-
jan van Gemund, Rog´erio de Lemos, Alexander Romanovsky, and Gerard Smit for
spending their valuable time and energy to evaluate my work. Their useful com-
ments enabled me to dramatically improve this thesis.
vii
I would like to thank to the members of the Trader project for their useful feedback
during our regular project meetings. In particular, David Watts, Jozef Hooman and
Teun Hendriks have reviewed my work closely. Ben Pronk brought up the research
direction on local recovery, which later happened to be the main focus of my work.
He has also spent his valuable time to provide us TV domain knowledge together
with Rob Golsteijn. Previously we had several discussions with Iulian Nitescu, Paul
L. Janson and Pierre van de Laar on failure scenarios, fault/error/failure classes and
recovery strategies. These discussions have also directly or indirectly contributed to
this thesis.
The members of the software engineering group have provided me useful feedback
during our regular seminars. I would like to thank them also for the comfortable
working environment I have had. In particular, I thank my roommates over the
years: Joost, Christian and Somayeh. In addition to the Dutch courses provided
by the university, Joost has given me a ‘complementary’ course on Dutch language
and Dutch culture. He has also read and corrected my official Dutch letters, which
would have caused quite some trouble if they were not corrected. Christian and
Somayeh have always been open to give their opinion about any issue I may bring
up and help me if necessary.
I would like to thank Ellen Roberts-Tieke, Joke Lammerink, Elvira Dijkhuis, Hilda
Ferweda and Nathalie van Zetten for their invaluable administrative support.
To be able to finish this work, first of all I had to feel secure and comfortable in my
social environment. In the following, I would like to extend my gratitude to people,
who have provided me such an environment during the last four years.
When I first arrived in Enschede, G¨ urcan was one of the few people I knew at the
university. He has helped me a lot to get acquainted with the new environment. He
has provided me a useful set of survival strategies to deal with never-ending official
procedures. The set of strategies has been later extended for surviving at mountains
and at the military service as well.
I have been sharing an apartment with Espen during the last three years. The
life is a lot easier if you always have a reliable friend around to talk to. Espen is
very effective in killing stress and boosting courage in any circumstance (almost like
alcohol, but almost healthy at the same time). Besides Espen, I had the chance
to meet with several other good friends while I was living at a student house (the
infamous 399) in the campus. I am sure that we will keep in touch in the future, in
one way or another.
Although I have been living abroad, I have also had the chance to meet with many
new Turkish friends during my studies. They have became very close friends of mine
and they have helped me not to feel so much homesick. There are many people to
count in this group and I will shortly refer to them as ‘¨oztwenteliler’. Selim and
Emre are also in this group and I especially thank them for “supporting” me during
my defense.
I would like to thank all the people who have contributed to Tusat. Similarly, I am
grateful to people who have volunteered to work for making our life more social and
enjoyable in the university, for example, members of Esn Twente over the years.
Stichting Kleurrijke Dans has also made my life more colorful lately. In addition,
I would like to thank all my friends who accompany me in recreational trips and
various other activities during the last four years.
I have had endless love and support from my family throughout my life. I thank
foremost my parents for always standing by me regardless of the geographic distance
between us.
Abstract
The increasing size and complexity of software systems makes it hard to prevent or
remove all possible faults. Faults that remain in the system can eventually lead to
a system failure. Fault tolerance techniques are introduced for enabling systems to
recover and continue operation when they are subject to faults. Many fault tolerance
techniques are available but incorporating them in a system is not always trivial. We
consider the following problems in designing a fault-tolerant system. First, existing
reliability analysis techniques generally do not prioritize potential failures from the
end-user perspective and accordingly do not identify sensitivity points of a system.
Second, existing architecture styles are not well-suited for specifying, communicating
and analyzing design decisions that are particularly related to the fault-tolerant
aspects of a system. Third, there are no adequate analysis techniques that evaluate
the impact of fault tolerance techniques on the functional decomposition of software
architecture. Fourth, realizing a fault-tolerant design usually requires a substantial
development and maintenance effort.
To tackle the first problem, we propose a scenario-based software architecture reli-
ability analysis method, called SARAH that benefits from mature reliability engi-
neering techniques (i.e. FMEA, FTA) to provide an early reliability analysis of the
software architecture design. SARAH evaluates potential failures from the end-user
perspective to identify sensitive points of a system without requiring an implemen-
tation.
As a new architectural style, we introduce Recovery Style for specifying fault-tolerant
aspects of software architecture. Recovery Style is used for communicating and
analyzing architectural design decisions and for supporting detailed design with
respect to recovery.
As a solution for the third problem, we propose a systematic method for optimizing
the decomposition of software architecture for local recovery, which is an effective
fault tolerance technique to attain high system availability. To support the method,
we have developed an integrated set of tools that employ optimization techniques,
state-based analytical models (i.e. CTMCs) and dynamic analysis on the system.
xi
The method enables the following: i ) modeling the design space of the possible
decomposition alternatives, ii ) reducing the design space with respect to domain
and stakeholder constraints and iii ) making the desired trade-off between availability
and performance metrics.
To reduce the development and maintenance effort, we propose a framework, FLORA
that supports the decomposition and implementation of software architecture for lo-
cal recovery. The framework provides reusable abstractions for defining recoverable
units and for incorporating the necessary coordination and communication protocols
for recovery.
Contents
1 Introduction 1
1.1 Thesis Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 The Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.1 Software architecture reliability analysis using failure scenarios 5
1.3.2 Architectural style for recovery . . . . . . . . . . . . . . . . . 5
1.3.3 Quantitative analysis and optimization of software architec-
ture decomposition for recovery . . . . . . . . . . . . . . . . . 6
1.3.4 Framework for the realization of software architecture recovery
design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Background and Definitions 9
2.1 Dependability and Fault Tolerance . . . . . . . . . . . . . . . . . . . 9
2.1.1 Dependability and Related Quality Attributes . . . . . . . . . 10
2.1.2 Dependability Means . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.3 Fault Tolerance and Error Handling . . . . . . . . . . . . . . . 12
2.2 Software Architecture Design and Analysis . . . . . . . . . . . . . . . 14
2.2.1 Software Architecture Descriptions . . . . . . . . . . . . . . . 14
2.2.2 Software Architecture Analysis . . . . . . . . . . . . . . . . . 16
xiii
2.2.3 Architectural Tactics, Patterns and Styles . . . . . . . . . . . 17
3 Scenario-Based Software Architecture Reliability Analysis 19
3.1 Scenario-Based Software Architecture
Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 FMEA and FTA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 SARAH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.1 Case Study: Digital TV . . . . . . . . . . . . . . . . . . . . . 24
3.3.2 The Top-Level Process . . . . . . . . . . . . . . . . . . . . . . 26
3.3.3 Software Architecture and Failure Scenario Definition . . . . . 28
3.3.4 Software Architecture Reliability Analysis . . . . . . . . . . . 38
3.3.5 Architectural Adjustment . . . . . . . . . . . . . . . . . . . . 47
3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4 Software Architecture Recovery Style 57
4.1 Software Architecture Views and Models . . . . . . . . . . . . . . . . 58
4.2 The Need for a Quality-Based View . . . . . . . . . . . . . . . . . . . 59
4.3 Case Study: MPlayer . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3.1 Refactoring MPlayer for Recovery . . . . . . . . . . . . . . . . 61
4.3.2 Documenting the Recovery Design . . . . . . . . . . . . . . . 62
4.4 Recovery Style . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.5 Local Recovery Style . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.6 Using the Recovery Style . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.8 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5 Quantitative Analysis and Optimization of Software Architecture
Decomposition for Recovery 73
5.1 Source Code and Run-time Analysis . . . . . . . . . . . . . . . . . . . 74
5.2 Analysis based on Analytical Models . . . . . . . . . . . . . . . . . . 75
5.2.1 The I/O-IMC formalism . . . . . . . . . . . . . . . . . . . . . 77
5.3 Software Architecture Decomposition for Local Recovery . . . . . . . 78
5.3.1 Design Space . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.3.2 Criteria for Selecting Decomposition Alternatives . . . . . . . 80
5.4 The Overall Process and the Analysis Tool . . . . . . . . . . . . . . . 81
5.5 Software Architecture Definition . . . . . . . . . . . . . . . . . . . . . 86
5.6 Constraint Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.7 Design Alternative Generation . . . . . . . . . . . . . . . . . . . . . . 91
5.8 Performance Overhead Analysis . . . . . . . . . . . . . . . . . . . . . 92
5.8.1 Function Dependency Analysis . . . . . . . . . . . . . . . . . 92
5.8.2 Data Dependency Analysis . . . . . . . . . . . . . . . . . . . . 95
5.8.3 Depicting Analysis Results . . . . . . . . . . . . . . . . . . . . 98
5.9 Decomposition Alternative Selection . . . . . . . . . . . . . . . . . . 98
5.10 Availability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.10.1 Modeling Approach . . . . . . . . . . . . . . . . . . . . . . . . 101
5.10.2 Analysis Results . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.11 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.12 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.13 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.14 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.15 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6 Realization of Software Architecture Recovery Design 117
6.1 Requirements for Local Recovery . . . . . . . . . . . . . . . . . . . . 118
6.2 FLORA: A Framework for Local Recovery . . . . . . . . . . . . . . . 118
6.3 Application of FLORA . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
7 Conclusion 133
7.1 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
7.2 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
7.2.1 Early Reliability Analysis . . . . . . . . . . . . . . . . . . . . 135
7.2.2 Modeling Software Architecture for Recovery . . . . . . . . . . 135
7.2.3 Optimization of Software Decomposition for Recovery . . . . . 136
7.2.4 Realization of Local Recovery . . . . . . . . . . . . . . . . . . 136
7.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
A SARAH Fault Tree Set Calculations 139
B I/O-IMC Model Specification and Generation 141
B.1 Example I/O-IMC Models . . . . . . . . . . . . . . . . . . . . . . . . 141
B.1.1 The module I/O-IMC . . . . . . . . . . . . . . . . . . . . . . . 142
B.1.2 The failure interface I/O-IMC . . . . . . . . . . . . . . . . . . 142
B.1.3 The recovery interface I/O-IMC . . . . . . . . . . . . . . . . . 143
B.1.4 The recovery manager I/O-IMC . . . . . . . . . . . . . . . . . 145
B.2 I/O-IMC Model Specification with MIOA . . . . . . . . . . . . . . . . 146
B.3 I/O-IMC Model Generation . . . . . . . . . . . . . . . . . . . . . . . 149
B.4 Composition and Analysis Script Generation . . . . . . . . . . . . . . 150
Bibliography 151
Samenvatting 164
Index 166
Chapter 1
Introduction
A system is said to be reliable [3] if it can continue to provide the correct service,
which implements the required system function. A failure occurs when the delivered
service deviates from the correct service. The system state that leads to a failure is
defined as an error and the cause of an error is called a fault [3]. It becomes harder
to prevent or remove all possible faults in a system as the size and complexity
of software increases. Moreover, the behavior of current systems are affected by an
increasing number of external factors since they are generally integrated in networked
environments, interacting with many systems and users. It may therefore not be
economically and/or technically feasible to implement a fault-free system. As a
consequence, the system needs to be able to tolerate faults to increase its reliability.
Potential failures can be prevented by designing the system to be recoverable from
errors regardless of the faults that cause them. This is the motivation for fault-
tolerant design, which aims at enabling a system to continue operation in case of
an error. By this way, the system can remain available to its users, possibly with
reduced functionality and performance rather than failing completely. In this thesis,
we introduce methods and tools for the application of fault tolerance techniques to
increase the reliability and availability of software systems.
1
2 Chapter 1. Introduction
1.1 Thesis Scope
The work presented in this thesis has been carried out as a part of the TRADER
1
[120]
project. The objective of the project is to develop methods and tools for ensuring
reliability of digital television (DTV) sets. A number of important trends can be
observed in the development of embedded systems like DTVs. First, due to the high
industrial competition and the advances in hardware and software technology, there
is a continuous demand for products with more functionality. Second, the imple-
mentation of functionality is shifting from hardware to software. Third, products
are not solely developed by just one manufacturer only but it is host to multiple
parties. Finally, embedded systems are more and more integrated in networked
environments that affect these systems in ways that might not have been foreseen
during their construction. Altogether, these trends increase the size and complexity
of software in embedded systems and as such make software faults a primary threat
for reliability.
For a long period, reliability and fault-tolerant aspects of embedded systems have
been basically addressed at hardware level or source code. However, in face of the
current trends it has now been recognized that reliability analysis should focus more
on software components. In addition, incorporation of some fault tolerance tech-
niques should be considered at a higher abstraction level than source code. It must
be ensured that the design of the software architecture supports the application
of necessary fault tolerance techniques. Software architecture represents the gross-
level structure of the system that directly influences the subsequent analysis, design
and implementation. Hence, it is important to evaluate the impact of fault tolerance
techniques on the software architecture. By this way, the quality of the system can
be assessed before realizing the fault-tolerant design. This is essential to identify
potential risks and avoid costly redesigns and reimplementations.
Different type of fault tolerance techniques are employed in different application
domains depending on their requirements. For safety-critical systems, such as the
ones used in nuclear power plants and airplanes, safety is the primary concern and
any failure that can cause harm to people and environment must be prevented. In
that context, the additional cost of the fault-tolerant design due to the required
hardware/software resources is a minor issue. In the TRADER project [120], we
have focused on the consumer electronics domain, in particular, DTV systems. For
such systems, which are probably less subjected to catastrophic failures, the cost
and the perception of the user turn out to be the primary concerns, instead.
These systems are very cost-sensitive and failures that are not directly perceived
by the user can be accepted to some extent, whereas failures that can be directly
1
TRADER stands for Television Related Architecture and Design to Enhance Reliability.
Chapter 1. Introduction 3
observed by the user require a special attention. In this context, the fault tolerance
techniques to be applied must be evaluated with respect to their additional costs
and their effectiveness based on the user perception.
1.2 Motivation
Traditional software fault tolerance techniques are mostly based on design diversity
and replication because software failures are generally caused by design and coding
faults, which are permanent [58]. So, the erroneous system state that is caused by
such faults has also been assumed to be permanent. On the other hand, software
systems are also exposed to so-called transient faults. These faults are mostly ac-
tivated by timing issues and peak conditions in workload that could not have been
anticipated before. Errors that are caused by such faults are likely to be resolved
when the software is re-executed after a clean-up and initialization [58]. As a re-
sult, it is possible to design a system that can recover from a significant fraction of
errors [16] without replication and design diversity and as such without requiring
substantial hardware/software resources [58]. Many such fault tolerance techniques
are available but developing a fault-tolerant system is not always trivial.
First of all, we need to know which fault tolerance techniques to select and where
in the system to apply the selected set of techniques. Due to the cost-sensitivity
of consumer electronics products like DTVs, it is not feasible to design a system
that can tolerate all the faults of each of its elements. Thus, we need to analyze
potential failures and prioritize them based on user perception. Accordingly, we
should identify sensitive elements of the system, whose failures might cause the
most critical system failures. The set of fault tolerance techniques should be then
selected based on the type of faults activated by the identified sensitive elements.
Existing reliability analysis techniques do not generally prioritize potential failures
from the end-user perspective and they do not identify sensitive points of a system
accordingly.
After we analyze the system and select the set of fault tolerance techniques accord-
ingly, we should adapt the software architecture description to specify the fault-
tolerant design. The software architecture of a system is usually described using
more than one architectural view. Each view supports the modeling, understand-
ing, communication and analysis of the software architecture for different concerns.
This is because current software systems are too complex to represent all the con-
cerns in one model. An analysis of the current practice for representing architectural
views reveals that they focus mainly on functional concerns and are not well-suited
for communicating and analyzing design decisions that are particularly related to
4 Chapter 1. Introduction
the fault-tolerant aspects of a system. New architectural styles are required to be
able to document the software architecture from the fault tolerance point of view.
Fault tolerance techniques may influence the decomposition of software architec-
ture. Local recovery is such a fault tolerance technique, which aims at making the
system ready for correct service as much as possible, and as such attaining high
system availability [3]. For achieving local recovery the architecture needs to be
decomposed into separate units (i.e. recoverable units) that can be recovered in iso-
lation. Usually there are many different alternative ways to decompose the system
for local recovery. Increasing the number of recoverable units can provide higher
availability. However, this will also introduce an additional performance overhead
since more modules will be isolated from each other. On the other hand, keeping
the modules together in one recoverable unit will increase the performance, but will
result in a lower availability since the failure of one module will affect the others as
well. As a result, for selecting a decomposition alternative we have to cope with a
trade-off between availability and performance. There are no adequate integrated
set of analysis techniques to directly support this trade-off analysis, which requires
optimization techniques, construction and analysis of quality models, and analysis
of the existing code base to automatically derive dependencies between modules of
the system. We need the utilization and integration of several analysis techniques
to optimize the decomposition of software architecture for recovery.
The optimal decomposition for recovery is usually not aligned with the existing
decomposition of the system. As a result, the realization of local recovery, while
preserving the existing decomposition, is not trivial and requires a substantial de-
velopment and maintenance effort [26]. Developers need to be supported for the
implementation of the selected recovery design.
Accordingly, this thesis provides software architecture modeling, analysis and real-
ization techniques to improve the reliability and availability of software systems by
introducing fault tolerance techniques.
1.3 The Approach
In the following subsections, we summarize the approaches that we have taken for
supporting the design and implementation of fault-tolerant software systems. The
overall goal is to employ fault tolerance techniques that can mainly tolerate transient
faults and as such to improve the reliability and availability of cost-sensitive systems
from the user point of view.
Chapter 1. Introduction 5
1.3.1 Software architecture reliability analysis using failure
scenarios
Our first approach aims at analyzing the potential failures and the sensitivity points
at the software architecture design phase before the fault-tolerant design is im-
plemented. Since implementing the software architecture is a costly process, it is
important to predict the quality of the system and identify potential risks, before
committing enormous organizational resources [31]. Similarly, it is of importance
to analyze the hazards that can lead to failures and to analyze their impact on the
reliability of the system before we select and implement fault tolerance techniques.
For this purpose, we introduce a software architecture reliability analysis approach (
SARAH) that benefits from mature reliability engineering techniques and scenario-
based software architecture analysis to provide an early software reliability analysis.
SARAH defines the notion of failure scenario model that is based on the Failure
Modes and Effects Analysis method (FMEA) in the reliability engineering domain.
The failure scenario model is applied to represent so-called failure scenarios that
define a Fault Tree Set (FTS). FTS is used for providing a severity analysis for
the overall software architecture and the individual architectural elements. Despite
conventional reliability analysis techniques which prioritize failures based on criteria
such as safety concerns, in SARAH failure scenarios are prioritized based on sever-
ity from the end-user perspective. The analysis results can be used for identifying
so-called architectural tactics [5] to improve the reliability. Hereby, architectural
tactics form building blocks of design patterns for fault tolerance.
1.3.2 Architectural style for recovery
Once we have selected the appropriate fault tolerance techniques and related archi-
tectural tactics, they should be incorporated into the existing software architecture.
Introduction of fault tolerance mechanisms usually requires dedicated architectural
elements and relations that impact the software architecture decomposition. Our
second approach aims at modeling the resulting decomposition explicitly by pro-
viding a practical and easy-to-use method to document the software architecture
from a recovery point of view. For this purpose, we introduce the recovery style for
modeling the structure of the system related to the recovery concern. It is used for
communicating and analyzing architectural design decisions and supporting detailed
design with respect to recovery. The recovery style considers recoverable units as
first class architectural elements, which represent the units of isolation, error con-
tainment and recovery control. The style defines basic relations for coordination
and application of recovery actions. As a further specialization of the recovery style,
6 Chapter 1. Introduction
the local recovery style is provided, which is used for documenting a local recovery
design including the decomposition of software architecture into recoverable units
and the way that these units are controlled.
1.3.3 Quantitative analysis and optimization of software ar-
chitecture decomposition for recovery
To introduce local recovery to the system, first we need to select a decomposition
among many alternatives. We propose a systematic approach dedicated to opti-
mizing the decomposition of software architecture for local recovery. To support
the approach, we have developed an integrated set of tools that employ i ) dynamic
program analysis to estimate the performance overhead introduced by different de-
composition alternatives, ii ) state-based analytical models (i.e. CTMCs) to estimate
the availability achieved by different decomposition alternatives, and iii ) optimiza-
tion techniques for automatic evaluation of decomposition alternatives with respect
to performance and availability metrics. The approach enables the following.
• modeling the design space of the possible decomposition alternatives
• reducing the design space with respect to domain and stakeholder constraints
• making the desired trade-off between availability and performance metrics
With this approach, the designer can systematically evaluate and compare decom-
position alternatives, and select an optimal decomposition.
1.3.4 Framework for the realization of software architecture
recovery design
After the optimal decomposition for recovery is selected, the software architecture
should be partitioned accordingly. In addition, new supplementary architectural el-
ements and relations should be implemented to enable local recovery. To reduce the
resulting development and maintenance efforts we introduce a framework, FLORA
that supports the decomposition and implementation of software architecture for
local recovery. The framework provides reusable abstractions for defining recover-
able units and the necessary coordination and communication protocols for recov-
ery. Using our framework, we have introduced local recovery to the open-source
media player called MPlayer for several decomposition alternatives. We have then
Chapter 1. Introduction 7
performed measurements on these implementations to validate the results of our
analysis approaches.
1.4 Thesis Overview
The thesis is organized as follows.
Chapter 2 provides background information and a set of definitions that is used
throughout this thesis. It introduces the basic concepts of reliability, fault tolerance
and software architectures.
Chapter 3 presents the software architecture reliability analysis method (SARAH).
SARAH is a scenario-based analysis method, which aims at providing an early eval-
uation and feedback at the architecture design phase. It utilizes mature reliability
engineering techniques to prioritize failure scenarios from the user perspective and
identifying sensitive elements of the architecture accordingly. The output of SARAH
can be utilized as an input by the techniques that are introduced in Chapter 4 and
Chapter 5. This chapter is a revised version of the work described in [111], [117],
and [118].
Chapter 4 introduces a new architectural style, called Recovery Style for modeling
the structure of the software architecture that is related to the fault tolerance prop-
erties of a system. The style is used for communicating and analyzing architectural
design decisions and supporting detailed design with respect to recovery. This style
is used in Chapter 6 to represent the designs to be realized. It also supports the
understanding of Chapter 5. This chapter is a revised version of the work described
in [110].
Chapter 5 proposes a systematic approach dedicated to optimizing the decompo-
sition of software architecture for local recovery. In this chapter, we explain several
analysis techniques and tools that employ dynamic analysis, analytical models and
optimization techniques. These are all integrated to support the approach.
Chapter 6 presents the framework FLORA that supports the decomposition and
implementation of software architecture for local recovery. The framework provides
reusable abstractions for defining recoverable units and the necessary coordination
and communication protocols for recovery. This chapter is a revised and extended
version of the work described in [112].
Chapter 7 provides our conclusions. The evaluations, discussions and related work
for the particular contributions are provided in the corresponding chapters.
8 Chapter 1. Introduction
An overview of the main chapters is depicted in Figure 1.1. Hereby, the rounded
rectangles represent the chapters of the thesis. The solid arrows represent the rec-
ommended reading order of these chapters. After reading Chapter 2, the reader
can immediately start reading Chapter 3, 4 or 5. Chapter 4 should be read be-
fore Chapter 6. All the other chapters are self-contained. All the chapters provide
complementary work and the works that are presented through chapters 4 to 6 are
directly related.
KEY
dependent
chapter
recommended
reading flow
Chapter 3.
Scenario-Based
Software Architecture
Reliability Analysis
Chapter 4.
Software Architecture
Recovery Style
Chapter 5.
Quantitative Analysis
and Optimization of
Software Architecture
Decomposition for
Recovery
Chapter 6.
Realization of
Software Architecture
Recovery Design
self-contained
chapter
dependency
Figure 1.1: The main chapters of the thesis and the recommended reading order.
Chapter 2
Background and Definitions
In our work, we utilize concepts and techniques from both the areas of dependability
and software architectures. In this chapter, we provide background information on
these two areas and we introduce a set of definitions that will be used throughout
the thesis.
2.1 Dependability and Fault Tolerance
Dependability is the ability of a system to deliver service that can justifiably be
trusted [3]. It is an integrative concept that encompasses several quality attributes
including reliability, availability, safety, integrity and maintainability. A system is
considered to be dependable if it can avoid failures (service failures) that are more
frequent or more severe than is acceptable [3]. A failure occurs when the delivered
service of a system deviates from the required system function [3]. An error is
defined as the system state that is liable to lead to a failure and the cause of an
error is called a fault [3]. Figure 2.1 depicts the fundamental chain of these concepts
that leads to a failure. As an example, assume that a software developer allocates an
insufficient amount of memory for an input buffer. This is the fault. At some point
during the execution of the software, the size of the incoming data overflows this
buffer. This is the error. As a result, the operating system kills the corresponding
process and the user observes that the software crashes. This is the failure.
Fault Error Failure
activation propagation
Figure 2.1: The fundamental chain of dependability threats leading to a failure
9
10 Chapter 2. Background and Definitions
Figure 2.1 shows the simplest possible chain of dependability threats. Usually, there
are multiple errors involved in the chain, where an error propagates to other errors
and finally leads to a failure [3].
2.1.1 Dependability and Related Quality Attributes
Dependability encompasses the following quality attributes [3]:
• reliability: continuity of correct service.
• availability: readiness for correct service.
• safety: absence of catastrophic consequences on the user(s) and the environ-
ment.
• integrity: absence of improper system alterations.
• maintainability: ability to undergo modifications and repairs.
Depending on the application domain, different emphasis might be put on different
attributes. In this thesis, we have considered reliability and availability, whereas
safety, integrity and maintainability are out of the scope of our work. Reliability
and availability are very important quality attributes in the context of fault-tolerant
systems and they are closely related. Reliability is the ability of a system to perform
its required functions under stated conditions for a specified period of time. That is
the ability of a system to function without a failure. Availability is the proportion of
time, where a system is in a functioning condition. Ideally, a fault-tolerant system
can recover from errors before any failure is observed by the user (i.e. reliability).
However, this is practically not always possible and the system can be unavailable
during its recovery. After a fault is activated, a fault-tolerant system must become
operational again as soon as possible to increase its availability.
Even if there are no faults activated, fault tolerance techniques introduce a perfor-
mance overhead during the operational time of the system. The overhead can be
caused, for instance, by monitoring of the system for error detection, collecting and
logging system traces for diagnosis, saving data for recovery and wrapping system
elements for isolation. For this reason, in addition to reliability and availability, we
consider performance as a relevant quality attribute in this work although it is not
a dependability attribute. Performance is defined as the degree to which a system
accomplishes its designated functions within given constraints, such as speed and
accuracy [60].
Chapter 2. Background and Definitions 11
In traditional reliability and availability analysis, a system is assumed to be either
up and running, or it is not. However, some fault-tolerant systems can also be par-
tially available (also known as performance degradation). This fact is taken into
account by performability [52], which combines reliability, availability and perfor-
mance quality attributes. The quantification of performability rewards the system
for the performance that is delivered not only during its normal operational time
but also during partial failures and their recovery [52]. Essentially, it measures how
well the system performs in the presence of failures over a specified period of time.
2.1.2 Dependability Means
To prevent a failure, the chain of dependability threats as shown in Figure 2.1 must
be broken. This is possible through i ) preventing occurrence of faults, ii ) removing
existing faults, or iii ) tolerating faults. In the last approach, we accept that faults
may occur but we deal with their consequences before they lead to a failure, if
possible. Error detection is the first necessary step for fault tolerance. In addition,
detected errors must be recovered. Based on [3], Figure 2.2 depicts the dependability
means and the features of fault tolerance as a simple feature diagram.
Dependability
Means
Fault Prevention Fault Tolerance Fault Removal
Error Detection Recovery
KEY
feature
optional
sub-feature
mandatory
sub-feature
Fault Forecasting
Error Handling Fault Handling
Figure 2.2: Basic features of dependability means and fault tolerance
A feature diagram is a tree, where the root element represents the domain or concept.
12 Chapter 2. Background and Definitions
The other nodes represent its features. The solid circles indicate mandatory features
(i.e. the feature is required if its parent feature is selected). The empty circles
indicate optional features. Mandatory features that are connected through an arc
decorated edge (See Figure 2.3) are alternatives to each other (i.e. exactly one of
them is required if their parent feature is selected).
In addition to fault prevention, fault removal and fault tolerance, fault forecasting
is also included in Figure 2.2 as a dependability means. Fault forecasting aims at
evaluating the system behavior with respect to fault occurrence or activation [3].
Figure 2.2 also shows the two mandatory features of fault tolerance: error detection
and recovery. Recovery has one mandatory feature, error handling, which eliminates
errors from the system state [3]. The fault handling feature prevents faults from
being activated again. This requires further features such as diagnosis, which reveals
and localizes the cause(s) of error(s) [3]. Diagnosis can also enable a more effective
error handling. If the cause of the error is localized, the recovery procedure can take
actions concerning the associated components without impacting the other parts of
the system and related system functionality.
2.1.3 Fault Tolerance and Error Handling
When faults manifest themselves during system operations, fault tolerance tech-
niques provide the necessary mechanisms to detect and recover from errors, if pos-
sible, before they propagate and cause a system failure. Error recovery is generally
defined as the action, with which the system is set to a correct state from an er-
roneous state [3]. The domain of recovery is quite broad due to different type of
faults (e.g. transient, permanent) to be tolerated and different requirements (e.g.
cost-effectiveness, high performance, high availability) imposed by different type of
systems (e.g. safety-critical systems, consumer electronics). Figure 2.3 shows a
partial view of error handling features for fault tolerance. We have derived the fea-
tures of the recovery domain through a domain analysis based on the corresponding
literature [3, 29, 36, 58]
Chapter 2. Background and Definitions 13
error handling
compensation forward recovery backward recovery
stable
storage
log-based check-pointing
coordinated uncoordinated communication
induced
blocking non-blocking pessimistic optimistic causal
replication graceful
degradation
KEY
feature
optional
sub-feature
mandatory
sub-feature
alternative
sub-features
exception
handling
N-version
programming
recovery
blocks
Figure 2.3: A partial view of error handling features for fault tolerance
As shown in Figure 2.3, error handling can be organized into three categories; com-
pensation, backward recovery and forward recovery [3]. Compensation means that
the system continues to operate without any loss of function or data in case of an
error. This requires replication of system functionality and data. N-version program-
ming is a compensation technique, where N independently developed functionally
equivalent versions of a software are executed in parallel. All the outputs of these
versions are compared to determine the correct, or best output, if one exists [81].
Backward recovery (i.e. rollback) puts the system in a previous state, which was
known to be error free. Recovery blocks approach uses multiple versions of a soft-
ware for backward recovery. After the execution of the first version, the output is
tested. If the output is not acceptable, the state of the system is rolled back to the
state before the first version is executed. Similarly, several versions are executed
and tested sequentially until the output is acceptable [81]. The system fails if no
acceptable output is obtained after all the versions are tried. Restarting a system
is also an example for backward recovery, in which the system is put back to its
initial state. Backward recovery can employ different features for saving data to be
restored after recovery (i.e. check-pointing) or for saving messages and events to
replay them after recovery (i.e. log-based recovery). In either case, a stable storage
is required to store recovery-related information (data, messages etc.). Forward re-
covery (i.e. rollforward) puts the system in a new state to recover from an error.
14 Chapter 2. Background and Definitions
Exception handling is an example forward recovery technique, where the execution is
transfered to the corresponding handler when an exception occurs. Graceful degra-
dation [107] is a forward recovery approach that puts the system in a state with
reduced functionality and performance.
Different error handling features can be utilized based on the fault assumptions and
system characteristics. The granularity of the error handling in recovery can differ
as well. In the case of global recovery, the recovery mechanism can take actions on
the system as a whole (e.g. restart the whole system). In the case of local recovery,
erroneous parts can be isolated and recovered while the rest of the system is available.
Thus, a system with local recovery can provide a higher system availability to its
users in case of component failures.
2.2 Software Architecture Design and Analysis
A software architecture for a program or computing system consists of the struc-
ture or structures of that system, which comprise elements, the externally visible
properties of those elements, and the relationships among them [5].
Software architecture represents a common abstraction of a system [5] and as such
it forms a basis for mutual understanding and communication among architects,
developers, system engineers and anybody who has an interest in the construction
of the software system. As one of the earliest artifact of the software development
life cycle, software architecture embodies early design decisions, which impacts the
system’s detailed design, implementation, deployment and maintenance. Hence, it
must be carefully documented and analyzed. Software architecture also promotes
large-scale reuse by transferring architectural models across systems that exhibit
common quality attributes and functional requirements [5].
In the following subsections, we introduce basic techniques and concepts that are
used for i ) describing architectures ii ) analyzing quality properties of an architecture
and iii ) achieving or supporting qualities in the architecture design.
2.2.1 Software Architecture Descriptions
An architecture description is a collection of documents to describe a system’s archi-
tecture [78]. The IEEE 1471 standard [78] is a recommended practice for architec-
tural description of software-intensive systems. It introduces a set of concepts and
relations among them as depicted in Figure 2.4 with a UML (The Unified Modeling
Chapter 2. Background and Definitions 15
Language) [100] diagram. Hereby, the key concepts are marked with bold rectangles
and two types of relations are defined: association and aggregation. Associations are
labeled with a role and cardinality. For example, Figure 2.4 shows that a concern is
important to (the role) 1 or more (the cardinality) stakeholders and a stakeholder
has 1 or more concerns. Aggregations are identified with a diamond shape at the
end of an edge and they represent part-whole relationships. For example, Figure 2.4
shows that a view is a part of an architectural description.
KEY concept
association
<role>
<cardinality>
aggregation
<role>
<cardinality>
Figure 2.4: Basic concepts of architecture description (IEEE 1471 [78])
The people or organizations that are interested in the construction of the software
system are called stakeholders. These might include, for instance, end users, ar-
chitects, developers, system engineers and maintainers. A concern is an interest,
which pertain to the systems development, its operation or any other aspects that
are critical or otherwise important to one or more stakeholders. Stakeholders may
have different, possibly conflicting, concerns that they wish the system to provide
or optimize. These might include, for instance, certain run-time behavior, perfor-
mance, reliability and evolvability. A view is a representation of the whole system
from the perspective of a related set of concerns. A viewpoint is a specification of
16 Chapter 2. Background and Definitions
the conventions for constructing and using a view.
To summarize the key concepts as described in the IEEE 1471 framework:
• A system has an architecture.
• An architecture is described by one or more architectural descriptions.
• An architectural description selects one or more viewpoints.
• A viewpoint covers one or more concerns of stakeholders.
• A view conforms to a viewpoint and it consists of a set of models that represent
one aspect of an entire system.
This conceptual framework provides a set of definitions for key terms and outlines
the content requirements for describing a software architecture. However, it does
not standardize or put restrictions to how an architecture is designed and how its
description is produced. There are several software design processes and architecture
design methods proposed in the literature, such as, the Rational Unified Process [64],
Attribute-Driven Design [5] and Synthesis-Based Software Architecture Design [115].
Software architecture design processes and methods are out of cope of this thesis.
Another issue that is not standardized by IEEE 1471 [78] is the notation or for-
mat that is used for describing architectures. UML [103] is an example standard
notation for modeling object-oriented designs, which can be utilized for describing
architectures as well. Similarly, several architecture description languages (ADLs)
have been introduced as modeling notations to support architecture-based develop-
ment. There have been both general-purpose and domain-specific ADLs proposed.
Some ADLs are designed to have a simple, understandable, and possibly graphical
syntax, but not necessarily formally defined semantics. Some other ADLs encompass
formal syntax and semantics, supported by powerful analysis tools, model checkers,
parsers, compilers and code synthesis tools [82].
2.2.2 Software Architecture Analysis
Software architecture forms one of the key artifacts in software development life
cycle since it embodies early design decisions. Accordingly, it is important that
the architecture design supports the required qualities of a software system. Soft-
ware architecture analysis helps to predict the risks and the quality of a system
before it is built, thereby reducing unnecessary maintenance costs. On the other
hand, usually it is also necessary to evaluate the architecture of a legacy system if
Chapter 2. Background and Definitions 17
it is subject to major modification, porting or integration with other systems [5].
Software architecture constitutes an abstraction of the system, which enables to
suppress the unnecessary details and focus only on the relevant aspects for analysis.
Basically, there are two complementary software architecture analysis techniques:
i ) questioning techniques and ii ) measuring techniques [5].
Questioning techniques use scenarios, questionnaires and check-lists to review how
the architecture responds to various situations [19]. Most of the software architec-
ture analysis methods that are based on questioning techniques use scenarios for
evaluating architectures [31]. These methods take as input the architecture design
and estimate the impact of predefined scenarios on it to identify the potential risks
and the sensitivity points of the architecture. Questioning techniques sometimes
employ measurements as well but these are mostly intuitive estimations relying on
hypothetical models without formal and detailed semantics.
Measuring techniques use architectural metrics, simulations and static analysis of
formal architectural models [82] to provide quantitative measures of qualities such
as performance and availability. The type of analysis depends on the underlying
semantic model of the ADL, where usually a quality model is applied, such as
queuing networks [22].
In general, measuring techniques provide more objective results compared to ques-
tioning techniques. As a drawback, they require the presence of a working artifact
(e.g. a prototype implementation, a model with enough semantics) for measure-
ment. On the other hand, questioning techniques can be applied on hypothetical
architectures much earlier in the life cycle [5]. However, (possibly quantitative) re-
sults of questioning techniques are inherently subjective. In this thesis, we explore
both analysis approaches. In Chapter 3, we present SARAH, which is an analy-
sis method based on questioning techniques. In Chapter 5, we present an analysis
approach based on measuring techniques.
2.2.3 Architectural Tactics, Patterns and Styles
A software architect makes a wide range of design decisions, while designing the
software architecture of a system. Depending on the application domain, many of
these design decisions are made to provide required functionalities. On the other
hand, there are also several design decisions made for supporting a desired qual-
ity attribute (e.g. to use redundancy for providing fault tolerance and in turn to
increase system dependability). Such architectural decisions are characterized as
architectural tactics [4]. Architectural tactics are viewed as basic design decisions
and building blocks of patterns and styles [5].
18 Chapter 2. Background and Definitions
An architectural pattern is a description of element and relation types together with
a set of constraints on how they may be used [5]. The term architectural style
is also used for describing the same concept. Similar to Object-Oriented design
patterns [41], architectural patterns/styles provide a common design vocabulary
(e.g. clients and servers, pipes and filters, etc.) [12]. They capture recurring idioms,
which constrain the design of the system to support certain qualities [109]. Many
styles are also equipped with semantic models, analysis tools and methods that
enable style-specific analysis and property checks.
Chapter 3
Scenario-Based Software
Architecture Reliability Analysis
To select and apply appropriate fault tolerance techniques, we need to analyze po-
tential system failures and identify architectural elements that cause system failures.
In this chapter, we propose the Software Architecture Reliability Analysis Approach
(SARAH), which prioritizes failure scenarios based on user perception and provides
an early software reliability analysis of the architecture design. It is a scenario-
based software architecture analysis method that benefits from mature reliability
engineering techniques, FMEA and FTA.
The chapter is organized as follows. In the following two sections, we introduce
background information on scenario-based software architecture analysis, FMEA
and FTA. In section 3.3, we present SARAH and illustrate it for analyzing reliability
of the software architecture of the next release of a Digital TV. We conclude the
chapter after discussing lessons learned and related work in sections 3.4 and 3.5,
respectively.
19
20 Chapter 3. Scenario-Based Software Architecture Reliability Analysis
3.1 Scenario-Based Software Architecture
Analysis
Scenario-based software architecture analysis methods take as input a model of the
architecture design and measure the impact of predefined scenarios on it to identify
the potential risks and the sensitivity points of the architecture [31]. Different
analysis methods use different type of scenarios (e.g. usage scenarios [19], change
scenarios [7]) depending on the quality attributes that they focus on. Some methods
define scenarios just as brief descriptions, while some other methods define them in
a more structured way with annotations [19].
Software Architecture Analysis Method (SAAM) can be considered as the first
scenario-based architecture analysis method. It is simple, practical and a mature
method, which has been validated in various cases studies [19]. Most of the other
scenario-based analysis methods are proposed as extensions to SAAM or in some
way they adopt the concepts used in this method [31]. The basic activities of SAAM
are illustrated with a UML [100] activity diagram in Figure 3.1. The filled circle
is the starting point and the filled circle with a border is the ending point. The
rounded rectangles represent activities and arrows (i.e. flows) represent transitions
between activities. The beginning of parallel activities are denoted with a black bar
with one flow going into it and several leaving it. In the following, SAAM activities
as depicted in Figure 3.1 are explained.
• Describe architectures: The candidate architecture designs are described, which
include the systems’ computation/data components and their relationships.
• Define scenarios: Scenarios are developed for stakeholders to illustrate the
kinds of activities the system must support (usage scenarios) and the antici-
pated changes that will be made to the system over time (change scenarios).
• Classify/Prioritize scenarios: Scenarios are prioritized according to their im-
portance as defined by the stakeholders.
• Individually evaluate indirect scenarios: Scenarios that can be directly sup-
ported by the architecture are called direct scenarios. Scenarios that require
the redesign of the architecture are called indirect scenarios. The required
changes for the architecture in case of indirect scenarios are attributed to the
fact that the architecture has not been appropriately designed to meet the
given requirements. For each indirect scenario the required changes to the
architecture are listed and the cost for performing these changes is estimated.
Chapter 3. Scenario-Based Software Architecture Reliability Analysis 21
Describe Architecture(s) Define Scenarios
Classify/Prioritize Scenarios
Ìndividually Evaluate
Ìndirect Scenarios
Assess Scenario Ìnteraction
Create Overall Evaluation
Figure 3.1: SAAM activities [19]
• Assess scenario interaction: Determining scenario interaction is a process of
identifying scenarios that affect a common set of components. Scenario inter-
action measures the extent to which the architecture supports an appropriate
separation of concerns. Semantically close scenarios should interact at the
same component. Semantically distinct scenarios that interact point out a
wrong decomposition.
• Create overall evaluation: Each scenario and the scenario interactions are
weighted in terms of their relative importance and this weighting determines
an overall ranking.
SAAM was originally developed to analyze the modifiability of an architecture [19].
Later, numerous scenario-based architecture analysis methods have been developed
each focusing on a particular quality attribute or attributes [31]. For example,
SAAMCS [75] has focused on analyzing complexity of an architecture, SAAMER [77]
on reusability and evolution, ALPSM [8] on maintainability, ALMA [7] on modifia-
bility, ESAAMI [84] on reuse from existing component libraries, and ASAAM [116]
on identifying aspects for increasing maintainability.
Hereby, it is implicitly assumed that scenarios correspond to the particular qual-
ity attributes that need to be analyzed. Some methods such as SAEM [33] and
ATAM [19] have considered the need for a specific quality model for deriving the
22 Chapter 3. Scenario-Based Software Architecture Reliability Analysis
corresponding scenarios. ATAM has also addressed the interactions among multiple
quality attributes and trade-off issues emerging from these interactions.
In this chapter, we propose a reliability analysis method, which uses failure scenarios
for analysis. We define a failure scenario model that is based on the established
Failure Modes and Effects Analysis method (FMEA) in the reliability engineering
domain as explained in the next subsection.
3.2 FMEA and FTA
Failure Modes and Effects Analysis method (FMEA) [99] is a well-known and mature
reliability analysis method for eliciting and evaluating potential risks of a system
systematically. The basic operations of the method are i ) to question the ways that
each component fails (failure modes) and ii ) to consider the reaction of the system
to these failures (effects). The analysis results are organized by means of a work-
sheet, which comprises information about each failure mode, its causes, its local and
global effects (concerning other parts of the product and the environment) and the
associated component. Failure Modes, Effects and Criticality Analysis (FMECA)
extends FMEA with severity and probability assessments of failure occurrence. A
simplified FMECA worksheet template is presented in Figure 3.2.







System: Car Engine
Date: 10-10-2000
Compiled by: J. Smith
Approved by: D. Green
ID Item ID
Failure
Mode
Failure Causes Failure Effects
Severity
Class
1
CE5 fails to
operate
Motor shorted Motor overheats
and burns
V
2 … … … … …

Figure 3.2: An example FMECA worksheet based on MIL-STD-1629A [30]
In FMECA, 6 attributes of a failure scenario are identified; failure id, related com-
ponent, failure mode, failure cause, failure effect and severity. A failure mode is
defined as the manner in which the element fails. A failure cause is the possible
cause of a failure mode. A failure effect is the (undesirable) consequence of a failure
mode. Severity is associated with the cost of repair.
FMEA and FMECA can be employed for risk assessment and for discovering poten-
tial single-point failures. Systematic analysis increases the insight in the system and
the analysis results can be used for guiding the design, its evaluation and improve-
ment. At the downside, the analysis is subjective [97]. Some components failure
Chapter 3. Scenario-Based Software Architecture Reliability Analysis 23
modes can be overlooked and some information (e.g. failure probability, severity)
regarding the failure modes can be incorrectly estimated at early design phases.
Since these techniques focus on individual components at a time, combined effects
and coordination failures can also be missed. In addition, the analysis is effort and
time consuming.
FMEA is usually applied together with Fault Tree Analysis (FTA) [34]. FTA is
based on a graphical model, fault tree, which defines causal relationships between
faults. An example fault tree can be seen in Figure 3.3.

Figure 3.3: An example fault tree
The top node (i.e. root) of the fault tree represents the system failure and the leaf
nodes represent faults. Faults, which are assumed to be provided, are defined as
undesirable system states or events that can lead to a system failure. The nodes
of the fault tree are interconnected with logical connectors (e.g. AND, OR gates)
that infer propagation and contribution of faults to the failure. Once the fault
tree is constructed, it can be processed in a bottom-up manner to calculate the
probability that a failure would take place. This calculation is done based on the
probabilities of fault occurrences and interconnections between the faults and the
failure [34]. Additionally, the tree can be processed in a top-down manner for
diagnosis to determine the potential faults that may cause the failure.
24 Chapter 3. Scenario-Based Software Architecture Reliability Analysis
3.3 SARAH
We propose the Software Architecture Reliability Analysis (SARAH) approach that
benefits from both reliability analysis and scenario-based software architecture anal-
ysis to provide an early reliability analysis of next product releases. SARAH defines
the notion of failure scenario model that is inspired from FMEA. Failure scenarios
define potential component failures in the software system and they are used for
deriving a fault tree set (FTS). Similar to a fault tree in FTA, FTS shows the causal
and logical connections among the failure scenarios.
To a large extent SARAH integrates the best practices of the conventional and
stable reliability analysis techniques with the scenario-based software architecture
analysis approaches. Besides this, SARAH provides another distinguishing property
by focusing on user perceived reliability. Conventional reliability analysis techniques
prioritize failures according to how serious their consequences are with respect to
safety. In SARAH, the prioritization and analysis of failure scenarios are based on
user perception [28]. The structure of FTS and related analysis techniques are also
adapted accordingly.
SARAH results in a failure analysis report that defines the sensitive elements of the
architecture and provides information on the type of failures that might frequently
happen. The reliability analysis forms the key input to identify architectural tactics
([4]) for adjusting the architecture and improving its dependability, which forms the
last phase in SARAH. The approach is illustrated using an industrial case for an-
alyzing user-perceived reliability of future releases of Digital TVs. In the following
subsection, we present this industrial case, in which a Digital TV architecture is in-
troduced. This example will be used throughout the remainder of the section, where
the activities of SARAH are explained and illustrated. As such while explaining the
approach we also discuss our experience and obstacles in applying the approach.
3.3.1 Case Study: Digital TV
A conceptual architecture of Digital TV (DTV) is depicted in Figure 3.4, which will
be referred throughout the section. The design mainly comprises two layers. The
bottom layer, namely the streaming layer, involves modules taking part in streaming
of audio/video information. The upper layer consists of applications, utilities and
modules that control the streaming process. In the following, we briefly explain
some of the important modules that are part of the architecture. For brevity, the
modules for decoding and processing audio/video signals are not explained here.
Chapter 3. Scenario-Based Software Architecture Reliability Analysis 25
«subsystem»
Command HandIer
«subsystem»
AppIication Manager
«subsystem»
Program Manager
«subsystem»
Program InstaIIer
«subsystem»
Content Browser
«subsystem»
TeIetext
«subsystem»
EPG
«subsystem»
Graphics ControIIer
«subsystem»
Last State Manager
«subsystem»
ConditionaI Access
«subsystem»
Audio ControIIer
«subsystem»
Video ControIIer
«subsystem»
Tuner
«subsystem»
Video Processor
«subsystem»
Data Decoder & Interpreter
«subsystem»
Audio Processor
«subsystem»
Graphics
«subsystem»
Audio Out
«subsystem»
Video Out
«subsystem»
Communication Manager
Control Layer
Streaming Layer
Module Dependency (uses)
KEY «subsystem»
Layer
Figure 3.4: Conceptual Architecture of DTV
• Application Manager (AMR), located at the top middle of the figure, initi-
ates and controls execution of both resident and downloaded applications in
the system. It keeps track of application states, user modes and redirects
commands/information to specific applications or controllers accordingly.
• Audio Controller (AC), located at the bottom right of the figure, controls
26 Chapter 3. Scenario-Based Software Architecture Reliability Analysis
audio features like volume level, bass and treble based on commands received
from AMR.
• Command Handler (CH), located at the top left of the figure, interprets ex-
ternally received signals (i.e. through keypad or remote control) and sends
corresponding commands to AMR.
• Communication Manager (CMR), located at the top left of the figure, employs
protocols for providing communication with external devices.
• Conditional Access (CA), located at the bottom left of the figure, authorizes
information that is presented to the user.
• Content Browser (CB), located at the middle of the figure, presents and pro-
vides navigation of content residing in a connected external device.
• Electronic Program Guide (EPG), located at the middle right of the figure,
presents and provides navigation of electronic program guide regarding a chan-
nel.
• Graphics Controller (GC), located at the bottom right of the figure, is re-
sponsible for generation of graphical images corresponding to user interface
elements.
• Last State Manager (LSM), located at the middle of the figure, keeps track of
last state of user preferences such as volume level and selected program.
• Program Installer (PI), located at the middle of the figure, searches and reg-
isters programs together with channel information (i.e. frequency).
• Program Manager (PM), located at the middle left of the figure, tunes to a
specific program based on commands received from AMR.
• Teletext (TXT), located at the middle of the figure, handles acquisition, inter-
pretation and presentation of teletext pages.
• Video Controller (VC), located at the bottom middle of the figure, controls
video features like scaling of the video frames based on commands received
from AMR.
3.3.2 The Top-Level Process
For understanding and predicting quality requirements of the architectural design [4],
Bachman et al. identify four important requirements: i ) provide a specification
Chapter 3. Scenario-Based Software Architecture Reliability Analysis 27
of the quality attribute requirements, ii ) enumerate the architectural decisions to
achieve the quality requirements, iii ) couple the architectural decisions to the qual-
ity attribute requirements, and iv) provide the means to compose the architectural
decisions into a design. SARAH is in alignment with these key assumptions. The
focus in SARAH is the specification of the reliability quality attribute, the analysis
of the architecture based on this specification and the identification of architectural
tactics to adjust the architecture.
The steps of SARAH are presented as a UML activity diagram in Figure 3.5. The
approach consists of three basic processes: i ) Definition ii ) Analysis and iii ) Ad-
justment. In the definition process the architecture, the failure domain model, the
failure scenarios, the fault trees and the severity values for failures are defined.
Based on this input, in the analysis process, an architectural level analysis and an
architectural element level analysis are performed. The results are presented in the
failure analysis report. The failure analysis report is used in the adjustment process
to identify the architectural tactics and adapt the software architecture. In the fol-
lowing subsections the main steps of the method will be explained in detail using
the industrial case study.
Adjustment Analysis Definition
Derive Fault Tree Set
Define Severity Values
in Fault Tree Set
Perform Architectural Level Analysis
Perform Architectural Element
Level Analysis
Provide Failure Analysis Report
Develop Failure Scenarios
Define Relevant
Failure Domain Model
Define Failure Scenarios
Define Architectural Element Spots
Ìdentify Architectural Tactics
Apply Architectural Tactics
Describe the Architecture
Figure 3.5: Activity Diagram of the Software Architecture Analysis Method
28 Chapter 3. Scenario-Based Software Architecture Reliability Analysis
3.3.3 Software Architecture and Failure Scenario Definition
Describe the Architecture
Similar to existing software architecture analysis methods SARAH starts with de-
scribing the software architecture. The description includes the architectural el-
ements and their relationships. Currently, the method itself does not presume a
particular architectural view [18] to be provided but in our project we have basi-
cally applied it to the module view. The architecture that we analyzed is depicted
in Figure 3.4.
Develop Failure Scenarios
SARAH is a scenario-based architecture analysis method, that is, scenarios are the
basic means to analyze the architecture. SARAH defines the concept of failure
scenario to analyze the architecture with respect to reliability. A failure scenario
defines a chain of dependability threats (i.e. fault, error and failure) for a component
of the system. To specify the failure scenarios in a uniform and consistent manner
a failure scenario template, as defined in Table 3.1 is adopted for specifying failure
scenarios.
Table 3.1: Template for Defining Failure Scenarios
FID A numerical value to identify the failures (i.e. Failure ID)
AEID An acronym defining the architectural element for which the
failure scenario applies (i.e. Architectural Element ID)
Fault The cause of the failure defining both the description of the
cause and its features
Error Description of the state of the element that leads to the
failure together with its features
Failure The description of the failure, its features, user/element(s)
that are affected by the failure
The template is inspired from FMEA [99]. For clarity in SARAH fault, error and
failure are used instead of the concepts failure cause, failure mode and failure effect,
respectively. In SARAH, failure scenarios are derived in two steps. First the relevant
failure domain model is defined, then failure scenarios are derived from this failure
domain. The following subsections describe these steps in detail.
Chapter 3. Scenario-Based Software Architecture Reliability Analysis 29
Define Relevant Failure Domain Model
The failure scenario template can be adopted to derive scenarios in an ad hoc manner
using free brainstorming sessions. However, it is not trivial to define fault classes,
error types or failure modes. Hence, there is a high risk that several potential and
relevant failure scenarios are missed or that other irrelevant failure scenarios are
included. To define the space of relevant failures SARAH defines relevant domain
model for faults, errors and failures using a systematic domain analysis process [2].
These domain models provide a first scoping of the potential scenarios. In fact, sev-
eral researchers have already focused on modeling and classifying failures. Avizienis
et al., for example, provide a nice overview of this related work and provide a
comprehensive classification of faults, errors and failures [3]. The provided domain
classification by Avizienis et al., however, is rather broad, and one can assume that
for a given reliability analysis project not all the potential failures in this overall
domain are relevant. Therefore, the given domain is further scoped by focusing only
on the faults, errors and failures that are considered relevant for the actual project.
Figure 3.6, for example, defines the derived domain model that is considered relevant
for our project.
In Figure 3.6(a), a feature diagram is presented, where faults are identified according
to their source, dimension and persistence. In SARAH, failure scenarios are defined
per architectural element. For that reason, the source of the fault can be either i )
internal to the element in consideration, ii ) caused by other element(s) of the system
that interact(s) with the element in consideration or iii ) caused by external entities
with respect to the system. Faults could be caused by software or hardware, and be
transient or persistent. In Figure 3.6(b), the relevant features of an error are shown,
which comprise the type of error together with its detectability and reversibility
properties. Figure 3.6(c) defines the features for failures, which includes the features
type and target. The target of a failure defines what is/are affected by the failure.
In this case, the target can be the user or other element(s) of the system.
The failure domain model of Figure 3.6 has been derived after a thorough domain
analysis and in cooperation with the domain experts in the project. In principle,
for different project requirements one may come up with a slightly different domain
model, but as we will show in the next sections this does not impact the steps in the
analysis method itself. The key issue here is that failure scenarios are defined based
on the FMEA model, in which their properties are represented by domain models
that provide the scope for the project requirements.
30 Chapter 3. Scenario-Based Software Architecture Reliability Analysis
internal (w.r.t. element)
external (w.r.t. system)
hardware
software
permanent
transient
other element(s)
Fault
Persistence
Dimension
Source
(a) Feature Diagram of Fault
data corruption
wrong value
detectable
undetectable
irreversible
reversible
deadlock
out of resources
too early/late
wrong exec. path
Error Detectability
Reversibility
Type
(b) Feature Diagram of Error
user
other element(s)
timing
presentation quality
behavior
wrong value/presentation
Failure
Type
Target
(c) Feature Diagram of Failure
Figure 3.6: Failure Domain Model
Chapter 3. Scenario-Based Software Architecture Reliability Analysis 31
Define Failure Scenarios
The domain model defines a system-independent specification of the space of failures
that could occur. The number and type of failure scenarios are implicitly defined
by the failure domain model, which defines the scope of the relevant scenarios. In
the fault domain model (feature diagram in Figure 3.6(a)), for example, we can
define faults based on three features, namely Source, Dimension and Persistence.
The feature Source can have 3 different values, the features Dimension and Persis-
tence 2 values. This means that the fault model captures 3 × 2 × 2 = 12 different
faults. Similarly, from the error domain model we can derive 6 × 2 × 2 = 24 dif-
ferent errors, and 4 × 2 = 8 different failures are captured by the failure domain
model. Since a scenario is a composition of selection of features from the failure
domain model, we can state that for the given failure domain model in Figure 3.6
in theory, 12 × 8 × 24 = 2304 failure scenarios can be defined. However, not all
the combinations instantiated from the failure domain are possible in practice. For
example, in case the error type is “too late” then the error cannot be “reversible”. If
the failure type is “presentation quality” then the failure target can only be “user”.
To specify such kind of constraints for the given model in Figure 3.6 usually mutex
and requires predicates are used [66]. The given two example constraints are, for
example, specified as follows:
Error.type.too late mutex
Error.type.reversibility.reversible
Failure.type.presentation quality requires
Failure.target.user
Once the failure domain model together with the necessary constraints have been
specified we can start defining failure scenarios for the architectural elements. For
each architectural element we check the possible failure scenarios and define an ad-
ditional description of the specific fault, error or failure. Figure 3.7 provides, for
example, a list of nine selected failure scenarios that have been derived for the reli-
ability analysis of the DTV. In Figure 3.7 the five elements FID, AEID, fault, error
and failure are represented as columns headings. Failure scenarios are represented
in rows.
32 Chapter 3. Scenario-Based Software Architecture Reliability Analysis

FID AEID Fault Error Failure
F1 AMR
description: Reception of
irrelevant signals.
source: CH(F4) OR
CMR(F6)
dimension: software
persistence: permanent
description: Working mode is
changed when it is not
desired.
type: wrong path
detectability: undetectable
reversibility: reversible
description: Switching to
an undesired mode.
type: behavior
target: user
F2 AMR
description: Can not
acquire information.
source: CMR(F5)
dimension: software
persistence: permanent
description: Information can
not be acquired from the
connected device.
type: too early/late
detectability: detectable
reversibility: irreversible
description: Can not
provide information.
type: timing
target: CB(F3)
F3 CB
description: Can not
acquire information.
source: AMR(F2)
dimension: software
persistence: permanent
description: Information can
not be presented due to lack
of information.
type: too early/late
detectability: detectable
reversibility: irreversible
description: Can not
present content of the
connected device.
type: behavior
target: user
F4 CH
description: Software fault.
source: internal
dimension: software
persistence: permanent
description: Signals are
interpreted in a wrong way.
type: wrong value
detectability: undetectable
reversibility: reversible
description: Provide
irrelevant information.
type: wrong
value/presentation
target: AMR(F1)
F5 CMR
description: Protocol
mismatch.
source: external
dimension: software
persistence: permanent
description: Communication
can not be sustained with the
connected device.
type: too early/late
detectability: detectable
reversibility: irreversible
description: Can not
provide information.
type: timing
target: AMR(F2)
F6 CMR
description: Software fault.
source: internal
dimension: software
persistence: permanent
description: Signals are
interpreted in a wrong way.
type: wrong value
detectability: undetectable
reversibility: reversible
description: Provide
irrelevant information.
type: wrong
value/presentation
target: AMR(F1)
F7 DDI
description: Reception of
out-of-spec signals.
source: external
dimension: software
persistence: permanent
description: Scaling
information can not be
interpreted from meta-data.
type: wrong value
detectability: detectable
reversibility: reversible
description: Can not
provide data.
type: wrong
value/presentation
target: VP(F8)
F8 VP
description: Inaccurate
scaling ratio information.
source: DDI(F7) AND
VC(F9)
dimension: software
persistence: permanent
description: Video image can
not be scaled appropriately.
type: wrong value
detectability: undetectable
reversibility: reversible
description: Provide
distorted video image.
type: presentation quality
target: user
F9 VC
description: Software fault.
source: internal
dimension: software
persistence: permanent
description: Correct scaling
ratio can not be calculated
from the video signal.
type: wrong value
detectability: detectable
reversibility: reversible
description: Provide
inaccurate information.
type: wrong
value/presentation
target: VP(F8)

Figure 3.7: Selected failure scenarios for the analysis of the DTV architecture
Chapter 3. Scenario-Based Software Architecture Reliability Analysis 33
The failure ids represented by FID do not have a specific ordering but are only used
to identify the failure scenarios. The column AEID includes acronyms of names of
architectural elements to which the identifiers refer. The type of the architectural
elements that are analyzed can vary depending on the architectural view utilized [18].
In case of a deployment view, for instance, the architectural elements that will be
analyzed will be the nodes. For component and connector view, the architectural
elements will be components and connectors. In this paper, we focused on module
view of the architecture, where the architectural elements are the implementation
units (i.e. modules). In principle, when an architectural element is represented as a
first-class entity in the model and it affects the failure behavior, then it can be used
in SARAH for analysis.
The columns Fault, Error and Failure describe the specific faults, errors and failures.
Note that the separate features of the corresponding domain models are represented
as keywords in the cells. For example, Fault includes the features source, dimension
and persistence as defined in Figure 3.6(a), and likewise these are represented as
keywords. Besides the different features, each column also includes the keyword de-
scription, which is used to explain the domain specific details of the failure scenarios.
Typically these descriptions are obtained from domain experts. The columns Fault
and Failure include the fields source and target respectively. Both of these refer to
failure scenarios of other architectural elements. For example, in failure scenario F2,
the fault source is defined as CMR(F5), indicating that the fault in F2 occurs due
to a failure in CMR as defined in F5. The source of the fault can be caused by a
combination of failures. This is expressed by logical connectives. For example, the
source of F1 is defined as CH(F4) OR CMR(F6) indicating that F1 occurs due to a
failure in CH as defined in F4 or due to a failure in CMR as defined in F6.
Derive Fault Tree Set from Failure Scenarios
A close analysis of the failure scenarios shows that they are connected to each other.
For example, the failure scenario F1 is caused by the failure scenario F4 or F6. To
make all these connections explicit, in SARAH fault trees are defined. Typically, a
given set of failure scenarios leads to a set of fault trees, which are together defined
as a Fault Tree Set (FTS). FTS is a graph G(V, E) consisting of the set of fault
trees. G has the following properties.
34 Chapter 3. Scenario-Based Software Architecture Reliability Analysis
1. V = F ∪ A
2. F is the set of failure scenarios each of which is associated with an architectural
element.
3. F
u
⊆ F is the set of failure scenarios comprising failures that are perceived
by the user (i.e. system failures). Vertices residing in this set constitute root
nodes of fault trees.
4. A is the set of gates representing logical connectives.
5. ∀g ∈ A,
outdegree(g) = 1∧
indegree(g) ≥ 1
6. A = A
AND
∪ A
OR
such that,
A
AND
is the set of AND gates
A
OR
is the set of OR gates
7. E is the set of directed edges (u, v) where u, v ∈ V .
For the example case, based on the failure scenarios in Figure 3.7 we can derive the
FTS as depicted in Figure 3.8.
F6 F5 F4
F3
F2
F1
F7 F9
F8
F
u
: set of
failure
scenarios
comprising
failures that
are perceived
by the user.
Figure 3.8: Fault trees derived from failure scenarios in Figure 3.7
Here, the FTS consists of three fault trees. On the left the fault tree shows that
F1 is caused by F4 or F6. The middle column indicates that failure scenario F3 is
caused by F2 which is on its turn caused by F5. Finally, in the last column the fault
tree shows that F8 is caused by both F7 and F9.
Chapter 3. Scenario-Based Software Architecture Reliability Analysis 35
Define Severity Values in Fault Tree Set
As we have indicated before we are focused on failure analysis in the context of
consumer electronics domain. As such we need to analyze in particular those failures
which have the largest impact on the user perception. For example, a complete black
screen will have a larger impact on the user than a temporary distortion in the
image. These different user perceptions on the failures have a clear impact on the
way how we process the fault trees. Before computing the failure probabilities of the
individual leaf nodes, we first assign severity values to the root failures based on their
impact on the user. In our case the severity values are based on the prioritization
of the failure scenarios as defined in Table 3.2.
Table 3.2: Prioritization of severity categories
Severity Type Annoyance Description
1. very low User hardly perceives failure.
2. low A failure is perceived but not really annoying.
3. moderate Annoying performance degradation is perceived.
4. high User perceives serious loss of functions.
5. very high Basic user-perceived functions fail. System locks
up and does not respond.
The severity degrees range from 1 to 5 and are provided by domain experts. For
example, the severity values for failures F1, F3 and F8 are depicted in Figure 3.9.
Here we can see that F5 has been assigned the severity value 5 indicating that it
has a very high impact on the user perception. In fact, the impact of a failure can
differ from user to user. In this paper, we consider the average user type. To define
user perceived failure severities, an elaborate user model can be developed based on
experiments with subjects from different age groups and education levels [28]. We
consider the user model as an external input and any such model can be incorporated
to the method.
36 Chapter 3. Scenario-Based Software Architecture Reliability Analysis
F8 F5 F6
F3
F2
F1
4 5
4 4
5
5
F7 F9
F8
4
2 2
Figure 3.9: Fault trees with severity values
The severity values of the root nodes are used for determining the severity values
of the other nodes of the FTS. These values are calculated based on the following
equations:
∀u ∈ F
u
, s(u) = s
u
(3.1)
∀f ∈ F ∧ f / ∈ F
u
, s(f) =

∀vs.t.(f,v)∈E
s(v) ×P(v|f) (3.2)
Equation 3.1 defines the assignment of severity values to the root nodes. Equa-
tion 3.2 defines the calculation of the severity values for the other nodes. Here,
P(v|f) denotes the probability that the occurrence of f will cause the occurrence
of v [34]. We multiply this value with the severity of v to calculate the sever-
ity of f. According to the probability theory, P(v|f) = P(v ∩ f)/P(f). If v is
an OR gate, then the output of v will always occur whenever f occurs. That is,
P(v ∩ f) = P(f). As a result, P(v|f) = P(f)/P(f) = 1. This makes sense because
P(v|f) denotes the probability that v will occur, provided that f occurs. If we take
the OR gate at the left hand side of Figure 3.9 for example, P(v|F6) = 1. If we
know that F6 occurs, v will definitely occur (the probability is equal to 1). This
is also the case for F8. So, s(F6) = s(F8) = s(F1) = 4. If v is an AND gate,
the situation is different. In this case, P(v ∩ f) = ΠP(x) for all vertices x that is
connected to v, including f. Since P(v|f) = P(v ∩ f)/P(f), to calculate P(v|f)
we need to know P(x) for all x except f. If we take the AND gate at the right
hand side of Figure 3.9 for example, P(v|F7) = P(F7) × P(F9)/P(F7) = P(F9)
and similarly P(v|F7) = P(F9). So, s(F7) = s(F8) ×P(F8|F7) = 4 ×P(F9) and
s(F9) = s(F8) ×P(F8|F9) = 4 ×P(F7).
Chapter 3. Scenario-Based Software Architecture Reliability Analysis 37
To define the final severity values for all the nodes obviously we need to know the
probability of each failure. In principle there are three strategies that can be used
to determine these required probability values:
• Using Fixed values: All probability values can be fixed to a certain value. An
example is to assume that each failure will have equal weight and likewise the
probabilities of individual failures are basically defined by the AND and OR
gates.
• What-if analysis: A range of probability values can be considered, where fault
probabilities are varied and their effect on the probability of user perceived
failures are observed.
• Measurement of actual failure rate: The actual failure rate can be measured
based on historical data or execution probabilities of elements that are ob-
tained from scenario-based test runs [126].
In SARAH all these three strategies can be used depending on the available knowl-
edge on probabilities. In case the probabilities are not known or they do not have a
relevant impact on the final outcome then fixed probability values can be adopted.
If probabilities are not equal or have different impacts on the severity values then
either the second or third strategy can be used. In the second strategy, the what-if
analysis can be useful if no information is available on an existing system. The
measurement of actual failure rates is usually the most accurate way to define the
severity values. However, the historical data can be missing or not accessible, and
deriving execution probabilities is cumbersome.
To identify the appropriate strategy a preliminary sensitivity analysis can be per-
formed. In conventional FTA, sensitivity analysis is based on cut-sets in a fault tree
and the probability values of fault occurrences [34]. However, this analysis leads to
complex formulas and it requires that the probability values are known a priori. In
our case we propose the following model (based on [14] and [126]) for estimating the
sensitivity with respect to a fault probability even if the probability values are not
known.
root ∈ F
u
, node ∈ F,
∀n ∈ F ∧ n = node ∧ n = root, P(n) = p (3.3)
P(node) = p

, P(root) = f(p

, p),
sensitivity(node) =

1
0
(

∂p

P(root))dp (3.4)
38 Chapter 3. Scenario-Based Software Architecture Reliability Analysis
Equations 3.3 and 3.4 show the calculation of the sensitivity of probability of a user
perceived failure (P(root)) to the probability of a node (P(node)) of the fault tree.
Here, we assign p

to P(node) and fix the probability values of all the other nodes
to p. Thus P(root) turns out to be a function of p and p

. For example, if we are
interested in the sensitivity of P(F8) to P(F9), P(F8) = P(F7) ×P(F9) = p

×p.
Then, we take a partial derivative of P(root) with respect to p

. This gives the rate
of change of P(root) with respect to p

(P(F9)). For our example, this will yield
to ∂/∂p

P(F8) = p. Finally, the result of the partial derivation is integrated with
respect to p for all possible probability values (range from 0 to 1) to calculate the
overall impact. For the example case,

(∂/∂p

P(F8))dp =

pdp = p
2
/2. So, the
result of the integration from 0 to 1 will be 0.5. In this model we use the basic
sensitivity analysis approach, where we change a parameter one at a time and fix
the others ([126]). Additionally, we use derivation to calculate the rate of change
([14]) and we use integration to take the range of probability values into account.
The resulting formulas are simple enough to be computed with spreadsheets.
For the analysis presented here, we skip the sensitivity analysis and assume equal
probabilities (i.e. P(x) = p for all x), which simplifies the severity assignment
formula for AND gates as s(v) × P(v|f) = s(v) × p
indegree(v)−1
. Based on this
assumption, the severity calculation for intermediate failures is as shown in the
following equation.
s(f) =

∀vs.t.
(u,v)∈E∧
(v∈F∨v∈A
OR
)
s(v) +

∀vs.t.
(u,v)∈E∧
v∈A
AND
s(v) ×p
indegree(v)−1
(3.5)
In our analysis, we fixed the value of p to be 0.5. In case of F7 and F9, for instance,
the severity value is 4/2 because F8 has the severity value of 4 and it is connected to
an AND gate with indegree = 2. On the other hand, F1 has the assigned severity
value of 4 and this is also assigned to the failures F6 or F8 that are connected to
F1 through an OR gate. A failure scenario can be connected to multiple gates and
other failures, in which the severity value is derived as the sum of the severity values
calculated for all these connections.
3.3.4 Software Architecture Reliability Analysis
In our example case, we defined a total of 37 failure scenarios including the scenarios
presented in Figure 3.7. We completed the definition process by deriving the corre-
sponding fault tree set and calculating severity values as explained in section 3.3.3
Chapter 3. Scenario-Based Software Architecture Reliability Analysis 39
(See Appendix A). The results that were obtained during the definition process are
utilized by the analysis process as described in the subsequent sections.
Perform Architecture-Level Analysis
The first step of the analysis process is architecture-level analysis in which we pin-
point sensitive elements of the architecture with respect to reliability. Sensitive
elements are the elements, which are associated with the majority of the failure
scenarios. These include not only the failure scenarios caused by internal faults but
also the failure scenarios caused by other elements. In the architectural-level anal-
ysis step, we aim at identifying sensitive elements with respect to the most critical
failures from the user perception. Later, in the architectural element level analysis,
the fault, error types and the actual sources of their failures (internal or other ele-
ments) are identified. Sensitive elements provide a starting point for considering the
relevant fault tolerance techniques and the elements to improve with respect to fault
tolerance (see section 3.3.5). In this way the effort that is provided for reliability
analysis is scoped with respect to failures that are directly perceivable by the users.
As a primary and straightforward means of comparison, we consider the percentage
of failures (PF) that are associated with elements. For each element c the value for
PF is calculated as follows:
PF
c
=
# of failures associated with c
# of failures
×100 (3.6)
This means that simply all the number of failures related to an element are summed
and divided by the total number of failures (in this case 44). The results are shown
in Figure 3.10. A first analysis of this figure already shows that the Application
Manager (AMR) and Teletext (TXT) modules have to cope with a higher number
of failures than other modules.
This analysis treats all failures equally. To take the severity of failures into account
we define the Weighted Percentage of Failures (WPF) as given in Equation 3.7.
WPF
c
=

∀u∈Fs.t.
AEID(u)=c
s(u)

∀u∈F
s(u)
×100 (3.7)
For each element, we collect the set of failures associated with them and we add
up their severity values. After averaging this value with respect to all failures and
40 Chapter 3. Scenario-Based Software Architecture Reliability Analysis
0,00
2,00
4,00
6,00
8,00
10,00
12,00
14,00
16,00
A
C
A
M
R
A
O
A
P
C
A
C
B
C
H
C
M
R
D
D
I
E
P
G G
G
C
P
I
P
M
L
S
M
T
T
X
T
V
O
V
C
V
P
Architectural Elements
P
e
r
c
e
n
t
a
g
e

o
f

F
a
i
l
u
r
e
s
Figure 3.10: Percentage of failure scenarios impacting architectural elements
calculating the percentage, we obtain the WPF value. The result of the analysis is
shown in Figure 3.11.
0,00
5,00
10,00
15,00
20,00
25,00
30,00
A
C
A
M
R
A
O
A
P
C
A
C
B
C
H
C
M
R
D
D
I
E
P
G G
G
C
P
I
P
M
L
S
M
T
T
X
T
V
O
V
C
V
P
Architectural Elements
W
e
i
g
h
t
e
d

P
e
r
c
e
n
t
a
g
e

o
f

F
a
i
l
u
r
e
s
Figure 3.11: Weighted percentage of failure scenarios impacting architectural ele-
ments
Although weighted percentage presents different results compared to the previous
one, the Application Manager (AMR) and Teletext(TXT) modules again appears to
be very critical.
From the project perspective it is not always possible to focus on the total set of
possible failures due to the related cost for fault tolerance. To optimize the cost
usually one would like to consider the failures that have the largest impact on the
system. For this, in SARAH the architectural elements are ordered in ascending
order with respect to their WPF values. The elements are then categorized based
on the proximity of their WPF values. Accordingly, elements of the same group
have WPF values that are close to each other. The results of this prioritization and
grouping are provided in Table 3.3, which also shows the sum of the WPF values
of elements for each group. Here we can see that, for example, group 4 consists of
two modules AMR and TXT. The reason for this is that their WPF values are the
highest and close to each other. The sum of their WPF values is 25 + 14 = 39%.
Chapter 3. Scenario-Based Software Architecture Reliability Analysis 41
Table 3.3: Architectural elements grouped based on WPF values
Group # Modules WPF
1 AC, AO, AP, CA, G, GC, PI, LSM, T, VO, VC,
VP
28%
2 CB, CH, EPG, PM 18%
3 DDI, CMR 15%
4 AMR, TXT 39%
To highlight the difference in impact of the groups of architectural elements we
define a Pareto chart as presented in Figure 3.12.
0
20
40
60
80
100
group 1 group 2 group 3 group 4
p
e
r
c
e
n
t
a
g
e

o
f

e
l
e
m
e
n
t
s
0
19
38
W
P
F
percentage of architectural elements WPF
Figure 3.12: Pareto Chart showing the largest impact of the smallest set of archi-
tectural elements
In the Pareto chart, the groups of architectural elements shown in Table 3.3 are
ordered along the x-axis with respect to the number of elements they include. The
percentage of elements that each group includes is depicted with bars. The y-axis
on the left hand shows the percentage values from 0 to 100 and is used for scaling
the percentages of architectural elements whereas the y-axis on the right hand side
scales WPF values. The plotted line represents the WPF value for each group.
In the figure we can, for example, see that group 4 (consisting of two elements)
represents 10% of all the elements but has a WPF of 39%. The chart helps us in
this way to focus on the most important set of elements which are associated with
the majority of the user perceived failures.
Perform Architectural Element Level Analysis
The architectural level analysis provides only an analysis of the impact of failure
scenarios on the given system. However, for error handling and recovery it is also
necessary to define the type of failures that might affect the identified sensitive
42 Chapter 3. Scenario-Based Software Architecture Reliability Analysis
elements. This is analyzed in the architectural element level analysis in which the
features of faults, errors and failures that impact an element are determined. For the
example case, in the architectural-level analysis it appeared that elements residing
in the 4
th
group (see Table 3.3) had to deal with largest set of failure scenarios.
Therefore, in architectural element level analysis, we will focus on members of this
group, namely Application Manager and Teletext modules.
Following the derivation of the set of failure scenarios impacting an element, we
group them in accordance with the features presented in Figure 3.6. This grouping
results in the distribution of fault, error and failure categories of failure scenarios
associated with the element.
For example, the results obtained for Application Manager and Teletext modules are
shown in Figure 3.13 and Figure 3.14, respectively. If we take a look at fault features
presented on those figures for instance, we see that most of the faults impacting
Application Manager Module are caused by other modules. On the other hand,
Teletext Module has internal faults as much as faults stemming from the other
modules. As such, distribution of features reveals characteristics of faults, errors and
failures associated with individual elements of the architecture. This information is
later utilized for architectural adjustment (See section 3.3.5).
Chapter 3. Scenario-Based Software Architecture Reliability Analysis 43
0
10
20
30
40
50
60
70
80
90
100
i
n
t
e
r
n
a
l
o
t
h
e
r

c
o
m
p
o
n
e
n
t
s
e
x
t
e
r
n
a
l
s
o
f
t
w
a
r
e
h
a
r
d
w
a
r
e
t
r
a
n
s
i
e
n
t
p
e
r
m
a
n
e
n
t
p
e
r
c
e
n
t
a
g
e

o
f

f
a
u
l
t
s
0
10
20
30
40
50
60
70
80
90
100
d
a
t
a

c
o
r
r
u
p
t
io
n
d
e
a
d
l
o
c
k
w
r
o
n
g

v
a
l
u
e
w
r
o
n
g

e
x
e
c
u
t
i
o
n
.
.
.
o
u
t

o
f

r
e
s
o
u
r
c
e
s
t
o
o

e
a
r
ly
/
la
t
e
d
e
t
e
c
t
a
b
l
e
u
n
d
e
t
e
c
t
a
b
le
r
e
v
e
r
s
i
b
l
e
i
r
r
e
v
e
r
s
i
b
l
e
p
e
r
c
e
n
t
a
g
e

o
f

e
r
r
o
r
s
0
10
20
30
40
50
60
70
80
90
100
t
i
m
i
n
g
b
e
h
a
v
io
u
r
p
r
e
s
e
n
t
a
t
i
o
n

q
u
a
l
i
t
y
w
r
o
n
g

v
a
l
u
e
/
p
r
e
s
.
.
.
u
s
e
r
o
t
h
e
r

c
o
m
p
o
n
e
n
t
s
p
e
r
c
e
n
t
a
g
e

o
f

f
a
i
l
u
r
e
s
Figure 3.13: Fault, error and failure features associated with the Application Man-
ager Module
44 Chapter 3. Scenario-Based Software Architecture Reliability Analysis
0
10
20
30
40
50
60
70
80
90
100
i
n
t
e
r
n
a
l
o
t
h
e
r

c
o
m
p
o
n
e
n
t
s
e
x
t
e
r
n
a
l
s
o
f
t
w
a
r
e
h
a
r
d
w
a
r
e
t
r
a
n
s
i
e
n
t
p
e
r
m
a
n
e
n
t
p
e
r
c
e
n
t
a
g
e

o
f

f
a
u
l
t
s
0
10
20
30
40
50
60
70
80
90
100
d
a
t
a

c
o
r
r
u
p
t
io
n
d
e
a
d
l
o
c
k
w
r
o
n
g

v
a
l
u
e
w
r
o
n
g

e
x
e
c
u
t
i
o
n
.
.
.
o
u
t

o
f

r
e
s
o
u
r
c
e
s
t
o
o

e
a
r
ly
/
la
t
e
d
e
t
e
c
t
a
b
l
e
u
n
d
e
t
e
c
t
a
b
le
r
e
v
e
r
s
i
b
l
e
i
r
r
e
v
e
r
s
i
b
l
e
p
e
r
c
e
n
t
a
g
e

o
f

e
r
r
o
r
s
0
10
20
30
40
50
60
70
80
90
100
t
i
m
i
n
g
b
e
h
a
v
io
u
r
p
r
e
s
e
n
t
a
t
i
o
n

q
u
a
l
i
t
y
w
r
o
n
g

v
a
l
u
e
/
p
r
e
s
.
.
.
u
s
e
r
o
t
h
e
r

c
o
m
p
o
n
e
n
t
s
p
e
r
c
e
n
t
a
g
e

o
f

f
a
i
l
u
r
e
s
Figure 3.14: Fault, error and failure features associated with Teletext Module
Chapter 3. Scenario-Based Software Architecture Reliability Analysis 45
Provide Failure Analysis Report
SARAH defines a detailed description of the fault tree sets, the failure scenarios, the
architectural level analysis and the architectural element level analysis. These are
described in the failure analysis report that summarizes the previous analysis results
and provides hints for recovery. Sections comprised by the failure analysis report
are listed in Table 3.4, which are in accordance with the steps of SARAH. The first
section describes the project context, information sources and specific considerations
(e.g. cost-effectiveness). The second section describes the software architecture. The
third section presents the domain model of faults, errors and failures which include
features of interest to the project. The fourth section contains list of failure scenarios
annotated based on this domain model. The fifth section depicts the fault tree set
generated from the failure scenarios together with the severity values assigned to
each. The sixth and seventh sections include analysis results as presented in sections
3.3.4 and 3.3.4, respectively. Additionally, the sixth section includes the distribution
of fault, error and failure features for all failure scenarios as depicted in Figure 3.15.
Finally, the report includes a section on first hints for architectural recovery as titled
architectural tactics [4]. This is explained in the next section.
Table 3.4: Sections of the failure analysis report.
1. Introduction
2. Software Architecture
3. Failure Domain Model
4. Failure Scenarios
5. Fault Tree Set
6. Architecture Level Analysis
7. Architectural Element Level Analysis
8. Architectural Tactics
46 Chapter 3. Scenario-Based Software Architecture Reliability Analysis
0
10
20
30
40
50
60
70
80
90
100
i
n
t
e
r
n
a
l
o
t
h
e
r

c
o
m
p
o
n
e
n
t
s
e
x
t
e
r
n
a
l
s
o
f
t
w
a
r
e
h
a
r
d
w
a
r
e
t
r
a
n
s
i
e
n
t
p
e
r
m
a
n
e
n
t
p
e
r
c
e
n
t
a
g
e

o
f

f
a
u
l
t
s
0
10
20
30
40
50
60
70
80
90
100
d
a
t
a

c
o
r
r
u
p
t
i
o
n
d
e
a
d
l
o
c
k
w
r
o
n
g

v
a
l
u
e
w
r
o
n
g

e
x
e
c
u
t
i
.
.
o
u
t

o
f

r
e
s
o
u
r
c
e
s
t
o
o

e
a
r
l
y
/
l
a
t
e
d
e
t
e
c
t
a
b
l
e
u
n
d
e
t
e
c
t
a
b
l
e
r
e
v
e
r
s
i
b
l
e
i
r
r
e
v
e
r
s
i
b
l
e
p
e
r
c
e
n
t
a
g
e

o
f

e
r
r
o
r
s
0
10
20
30
40
50
60
70
80
90
100
t
i
m
i
n
g
b
e
h
a
v
io
u
r
p
r
e
s
e
n
t
a
t
i
o
n

q
u
a
l
i
t
y
w
r
o
n
g

v
a
l
u
e
/
p
r
e
s
.
.
.
u
s
e
r
o
t
h
e
r

c
o
m
p
o
n
e
n
t
s
p
e
r
c
e
n
t
a
g
e

o
f

f
a
i
l
u
r
e
s
Figure 3.15: Feature distribution of fault, error and failures for all failure scenarios
Chapter 3. Scenario-Based Software Architecture Reliability Analysis 47
3.3.5 Architectural Adjustment
The failure analysis report that is defined during the reliability analysis forms an
important input to the architectural adjustment. Hereby the architecture will be
enhanced to cope with the identified failures. This requires the following three steps:
i ) defining the elements to which the failure relates ii ) identifying the architectural
tactics and iii ) application of the architectural tactics.
Table 3.5: Architectural element spots.
AEID Architectural Element Spot
AMR AC, AO, CB, CH, CMR, DDI, EPG, GC, LSM,
PI, PM, TXT, VC, VO
AC AMR, AP, LSM
AP AC, DDI
AO AMR, CA
CA AMR, PM, T, AO, VO
CB AMR, CMR
CH AMR
CMR AMR, CB
DDI AMR, AP, CA, EPG, TXT, VP
EPG AMR, DDI
G GC
GC AMR, G
LSM AMR, AC, PM, VC
PI AMR, PM
PM AMR, CA, LSMR, T
T CA, PM
TXT AMR, DDI
VC AMR, LSM, VP
VO AMR, CA
VP DDI, VC
Define Architectural Element Spots
Architectural tactics have been introduced as a characterization of architectural de-
cisions that are used to support a desired quality attribute [4]. In SARAH we apply
the concept of architectural tactics to derive architectural design decisions for sup-
porting reliability. The previous steps in SARAH result in a prioritization of the
most sensitive elements in the architecture together with the corresponding failures
48 Chapter 3. Scenario-Based Software Architecture Reliability Analysis
that might occur. Thus, SARAH prioritizes actually the design fragments [4] to
which the specific tactics will be applied and to improve dependability. In general,
for applying a recovery technique to an element we need also to consider the other
elements coupled with it. This is because local treatments of individual elements
might directly impact the dependent elements. For this reason, we define architec-
tural element spot for a given element as the set of elements with which it interacts.
This draws the boundaries of the design fragment [4] to which the design decisions
will be applied. The coupling can be statically defined by analyzing the relations
among the elements. Table 3.5 shows architectural element spots for each element
in the example case. For example, Table 3.5 shows the elements that are in the
architectural element spot of AMR as AC, AO, CB, CH, CMR, DDI, EPG, GC,
LSM, PI, PM, TXT, VC and VO. These elements should be considered while incor-
porating mechanisms to AMR or any refactoring action that includes alteration of
AMR and its interactions.
Identify Architectural Tactics
Once we have identified the sensitive elements, the element spots and the potential
set of failure scenarios, we proceed with identifying the architectural tactics for
enhancing dependability. As discussed in section 2.1.2, dependability means [3] are
categorized as fault forecasting, fault removal, fault prevention and fault tolerance
(See Figure 3.16). In SARAH, we particularly focus on fault tolerance techniques.
After identifying the sensitive elements we analyze the fault and error features re-
lated to these elements (Figures 3.13 and 3.14). Faults can be either internal or
external to the sensitive element. If the source of the fault is internal this means
that the fault tolerance techniques should be applied to the sensitive element itself.
However, if the source of the fault is external, this means that the failures are caused
by the other elements that are undependable. In that case, we might need to consider
applying fault tolerance techniques to these undependable elements instead. These
undependable elements can be traced with the help of the architectural element spot
(section 3.3.5) and FTS. Some elements can tolerate external faults originating from
other undependable elements. A broad set of fault tolerance techniques are available
for this purpose such as exception handling, restarting, check-pointing and roll-back
recovery [58]. On the other hand, if an element comprises faults that are internal
and permanent, we might consider applying design diversity (e.g. recovery blocks,
N-version programming [81]) on the element to tolerate such faults.
In Figure 3.16, some examples for fault tolerance techniques have been given. Each
technique provides a solution for a particular set of problems. To make this explicit,
each fault tolerance technique is tagged with error (and fault) types that it aims to
Chapter 3. Scenario-Based Software Architecture Reliability Analysis 49
Dependability
Means
Fault Prevention Fault-Tolerance Fault Removal
Error Detection Recovery
KEY
feature
optional
sub-feature
mandatory
sub-feature
watchdog
on-line
monitoring
check-pointing
N-version
programming
Error.Type ÷ deadlock Error.Type ÷ wrong value
OR wrong execution path
Error.Type ÷ deadlock
OR out oI resources
Fault.Persistence ÷ transient
Fault Forecasting
note
applicable fault and
error properties
Error.Type ÷ wrong value
Fault.Source ÷ internal
Fault.Persistence ÷ permanent
data corruption
wrong value
deadlock
out of resources
too early/late
wrong exec. path
Error Type
Error Handling Fault Handling
compensation
forward
recovery
backward
recovery
Figure 3.16: Partial categorization of dependability means with corresponding fault
and error features
detect and/or recover. As an example, in Figure 3.16, watchdog [58] can be applied
to detect deadlock errors for tolerating faults resulting in such errors. On the other
hand, N-version programming [81] can be used for compensating wrong value errors
that are caused by internal and persistent faults. Figure 3.16 shows only a few
fault tolerance techniques as an example. Derivation of all fault tolerance tactics is
out-of-scope of this work.
Apply Architectural Tactics
As the last step of architectural adjustment, selected architectural tactics should be
applied to adjust the architecture if necessary.
The possible set of architectural tactics is determined as described in the previous
section. Additionally, other criteria need to be considered including cost and limi-
tations imposed by the system (resource constraints). For the given case, Table 3.6
shows, for example, the potential fault tolerance techniques that can be applied for
AMR and TXT modules. The last column of the table shows the selected candidate
50 Chapter 3. Scenario-Based Software Architecture Reliability Analysis
Table 3.6: Treatment approaches for sensitive components
AEID Potential Means Selected Means
AMR on-line monitoring, watchdog,
checkpoint & restart, resource
checks, interface checks
on-line monitoring, interface
checks
TXT on-line monitoring, watch-
dog, n-version programming,
checkpoint & restart, resource
checks, interface checks
checkpoint & restart, resource
checks
techniques.
Each architectural tactic has a set of characteristics, requirements, optional and al-
ternative features and constraints. For instance, on-line monitoring is selected as
a candidate error detection technique to be applied to AMR module. Schroeder
defines in [108] basic concepts of on-line monitoring and explains its characteris-
tics. Accordingly, a monitoring system is an external observer that monitors a fully
functioning application. Based on the provided classification scheme [108] (purpose,
sensors, event interpretation, action), we can specify an error detection mechanism
as follows: The purpose of the mechanism is error detection. Sensors are the ob-
servers of a set of values represented in the element (i.e. AMR module) and they are
always enabled. The event interpretation specification is built into the monitoring
system. Sensing is based on sampling. Action specification is a recovery attempt,
which can be achieved by partially or fully restarting the observed system and it is
executed when a wrong value is detected.
The realization of an architectural tactic and measuring its properties (reengineering
and performance overhead, availability, performability) for a system requires dedi-
cated analysis and additional research. For example, in [74] software fault tolerance
techniques that are based on diversity (e.g. recovery blocks, N-version programming)
are evaluated with respect to dependability and cost. In chapter 5, we evaluate a
local recovery technique and its impact on software decomposition with respect to
performance and availability.
Chapter 3. Scenario-Based Software Architecture Reliability Analysis 51
3.4 Discussion
The method and its application provide new insight in the scenario-based architec-
ture analysis methods. A number of lessons can be learned from this study.
Early reliability analysis
Similar to other software architecture analysis methods, based on our experiences
with SARAH we can state that early analysis of quality at the software architecture
design level has a practical and important benefit [31]. In our case the early analysis
relates to the analysis of the next release of a system (e.g. Digital TV) in a product
line. Although, we did had access to the next release implementation of the Digital
TV the reliability analysis with SARAH still provided useful insight in the critical
failures and elements of the current architecture. For example, we were able to
identify the critical modules Application Manager, Teletext and also got an insight
in the failures, their causes and their effects. Without such an analysis it would be
hard to denote the critical modules that have the risk to fail. For the next releases
of the product this information can be used in deciding for the dependability means
to be used in the architecture.
Utilizing a quality model for deriving scenarios
The use of an explicit domain model for failures has clearly several benefits. Actually,
in the initial stages of the project we first tried to directly collect failure scenarios
by interviewing several stakeholders. In our experience this has clear limitations
because i ) the set of failure scenarios for a given architecture is in theory too large
and ii ) even for the experts in the project it was hard to provide failure scenarios. To
provide a more systematic and stable approach we have done a thorough analysis on
failures and defined a fault domain model that represents essentially the reliability
quality attribute. This model does not only provide systematic means for deriving
failure scenarios but also defined the stopping criteria for defining failure scenarios.
Basically we have looked at all the elements of the architecture, checked the failure
domain and expressed the failure scenarios using the failure scenario template that
we have presented in section 3.3.3. During the whole process we were supported by
TV domain experts.
52 Chapter 3. Scenario-Based Software Architecture Reliability Analysis
Impact of project requirements and constraints
From the industrial project perspective it was not sufficient to just define a failure
domain model and derive the scenarios from it. A key requirement of the indus-
trial case was to provide a reliability analysis that takes the user-perception as the
primary criteria. This requirement had a direct impact on the way how we pro-
ceeded with the reliability analysis. In principle, it meant that all the failures that
could be directly or indirectly perceived by the end-user had to be prioritized be-
fore the other failures. In our analysis this was realized by weighing the failures
based on their severities from the user-perspective. In fact, from a broader sense,
the focus on user-perceived failures could just be considered an example. SARAH
provides a framework for reliability analysis and the method could be easily adapted
to highlight other types of properties of failures.
Calculation of probability values of failures
One of the key issues in the reliability analysis is the definition of the probability
values of the individual failures. In section 3.3.3 we have described three well-
known strategies that can be adopted to define the probability values. In case more
knowledge on probabilities is known in the project the analysis will be more accurate
accordingly. As described in section 3.3.3 SARAH does not adopt a particular
strategy and can be used with any of these strategies. We have illustrated the
approach using fixed values.
Inherent dependence on domain knowledge
Obviously, the set of selected failure scenarios and values assigned to their attributes
(severity) directly affect the results of the analysis. As it is the case with other
scenario-based analysis methods both the failure scenario elicitation and the priori-
tization are dependent on subjective evaluations of the domain experts. To handle
this inherent problem in a satisfactory manner SARAH guides the scenario elicita-
tion by using the relevant failure domain model as described in section 3.3.3 and
the use of failure scenario template in section 3.3.3. The initial assignment of sever-
ity values for the user-perceived failures is defined by the domain experts, but the
independent calculation of the severities for intermediate failures is defined by the
method itself as defined in section 3.3.3.
Chapter 3. Scenario-Based Software Architecture Reliability Analysis 53
Extending the Fault Tree concept
In the reliability engineering community, fault trees have been used for a long time
in order to calculate the probability that a system fails [34]. This information
is derived from the probabilities of fault occurrences by processing the tree in a
bottom-up manner. The system or system state can be represented by a single fault
tree where the root node of the tree represents the failure of the system. In this
context, failure means a crash-down in which no more functionality can be provided
further. The fault tree can be also processed in a top-down manner to find the root
causes of this failure. One of the contributions of this paper from the reliability
engineering perspective is the Fault Tree Set (FTS) concept, which is basically a set
of fault trees. The difference of FTS from a single fault tree is that an intermediate
failure can participate in multiple fault trees and there exists multiple root nodes
each of which represents a system failure perceived by the user. This refinement
enables us to discriminate different types of system failures (e.g. based on severity)
and infer what kind of system failures that an intermediate failure can lead to. Our
analysis based on FTS is also different from the traditional usage of FTA. In FTA,
the severity of the root node is not distributed to the intermediate nodes. Only the
probability values are propagated to calculate the risk associated with the root node.
In our analysis, we assign severity values to the intermediate nodes of FTS based
on to what extend they influence the occurrence of root nodes and the severity of
these nodes.
Architectural Tactics for fault tolerance
After the identification of the sensitive architectural elements, the related element
spots and the corresponding failures in SARAH architectural tactics are defined. In
the current literature on software architecture design this is not explicit and basically
we had to rely on the techniques that are defined by the reliability engineering
community. We have modeled a limited set of software fault tolerance techniques
and defined the relation of these techniques with the faults and errors in the failure
domain model that had been defined before. Bachman et al. derive architectural
tactics by first listing concrete scenarios, matching these with general scenarios
and deriving quality attribute frameworks [4]. In a sense this could be considered
as a bottom-up approach. In SARAH the approach is more top-down because it
essentially starts at the domain models of failures and fault tolerance techniques and
then derives the appropriate matching. The concept of architectural tactics as an
appropriate architectural design decision to enhance a particular quality attribute
of the architecture remains of course the same.
54 Chapter 3. Scenario-Based Software Architecture Reliability Analysis
3.5 Related Work
FMEA has been widely used in various industries such as automotive and aerospace.
It has also been extended to make it applicable to the other domains or to achieve
more analysis results. For instance, Failure Modes, Effects and Criticality Analysis
(FMECA) is a well-known method that is built upon FMEA. As an extension to
FMEA, FMECA [30] incorporates severity and probability assessments for faults.
The probabilities of occurrence of faults (assumed to be known) are utilized together
with the fault tree in order to calculate the probability that the system will fail. On
the other hand, a severity is associated with every fault to distinguish them based
on the cost of repair. Note that the severity definition in our method is different
from the one that is employed by FMECA. We take a user-centric approach and
define the severities for failures based on their impact on the user.
Almost all reliability analysis techniques have primarily devised to analyze failures
in hardware components. These techniques have been extended and adapted to be
used for the analysis of software systems. Application of FMEA to software has a
long history [98]. Both FMEA and FTA have been employed for the analysis of soft-
ware systems and named as Software Failure Modes and Effects Analysis (SFMEA)
and Software Fault Tree Analysis (SFTA), respectively. In SFMEA, failure modes
for software components are identified such as computational, logic and data I/O.
This classification resembles the failure domain model of SARAH. However, SARAH
separates fault, error and failure concepts and provides a more detailed categoriza-
tion for each. Also, note that the failure domain model can vary depending on the
project requirements and the system. In [76], SFTA is used for the safety verifica-
tion of software. However, the analysis is applied at the source code level rather
than the software architecture design level. Hereby, a set of failure-mode templates
outlines the failure modes of programming language elements like assignments and
conditional statements. These templates are composed according to the control flow
of the program to derive a fault tree for the whole software.
In general, efforts for applying reliability analysis to software [62] mainly focus on
the safety-critical systems, whose failure may have very serious consequences such
as loss of human life and large-scale environmental damage. In our case, we focus
on consumer electronics domain, where the systems are usually not safety-critical.
FMEA has been used in other domains as well, where the methodology is adapted
and extended accordingly. To use FMEA for analyzing the dependability of Web
Services, new failure taxonomy, intrusions and various failure effects (data loss,
financial loss, denial of service, etc.) are taken into account in [47]. Utilization of
FMEA is also proposed in [128] for early robustness analysis of Web-based systems.
The method is applied together with the Jacobson’s method [100], which identifies
Chapter 3. Scenario-Based Software Architecture Reliability Analysis 55
three types of objects in a system: i ) boundary objects that communicate with actors
in a use-case, ii ) entity objects that are objects from the domain and iii ) control
objects that serve as a glue between boundary objects and entity objects. In our
method, we do not presume any specific decomposition of the software architecture
and we do not categorize objects or components. However, we categorize failure
scenarios based on the failure domain model and each failure scenario is associated
with a component.
Jacobson’s classification [100] is aligned with our failure domain model with respect
to the propagation of failures (fault.source, failure.target). The target feature of
failure, for instance, can be the user (i.e. actor) or the other components. In [124],
a modular representation called Fault Propagation and Transformation Calculus
(FPTC) is introduced. FPTC is used for specifying the failure behavior of each
component (i.e. how a component introduces or transforms a failure type). This
facilitates the automatic derivation of the propagation of the failures throughout the
system. In our method, we represent the propagation of failure scenarios with fault
trees. The semantics of the transformation is captured in the ”type” tags of failure
scenarios.
In this work, we made use of spreadsheets that define the failure scenarios and
automatically calculate the severity values in the fault trees (after initial assignment
of the user-perceived failure severities). This is a straightforward calculation and
as such we have not elaborated on this issue. On the other hand, there is a body
of work focusing on tool-support for FMEA and FTA. In [47], for example, FMEA
tables are being integrated with a web service deployment architecture so that they
can be dynamically updated by the system. In [90], fault trees are synthesized
automatically. Further, multiple failures are taken into account, where a failure
mode can contribute to more than one system failure. The result of the fault tree
synthesis is a network of interconnected fault trees, which is analogous to the fault
tree set in our method.
An Advanced Failure Modes and Effect Analysis (AFMEA) is introduced in [39],
which also focuses on the analysis at the early stages of design. However, the aim of
this work is to enhance FMEA in terms of the number and range of failure modes
captured. This is achieved by constructing a behavior model for the system.
56 Chapter 3. Scenario-Based Software Architecture Reliability Analysis
3.6 Conclusions
In this chapter, we have introduced the software architecture reliability analysis
method (SARAH). SARAH utilizes and integrates mature reliability engineering
techniques and scenario-based architectural analysis methods. The concept of fault,
error failure models, the failure scenarios, fault tree set and the severity calculations
are inspired from the reliability engineering domain [3]. The overall scenario-based
elicitation and prioritization is derived from the work on software architecture anal-
ysis methods [31]. SARAH focuses on the reliability quality attribute and uses
failure scenarios for analysis. Failure scenarios are prioritized based on the user per-
ception. We have illustrated the steps of SARAH for analyzing reliability of a DTV
software architecture. SARAH has helped us to identify the sensitivity points and
architectural tactics for the enhancement of the architecture with respect to fault
tolerance.
The basic limitation of the approach is the inherent subjectivity of the analysis re-
sults. SARAH provides systematic means to evaluate potential risks of the system
and identify possible enhancements. However, the analysis results at the end de-
pends on the failure scenarios that are defined. We use scenarios because they are
practical and integrate good with the current methods. However, the correctness
and completeness of these scenarios cannot be guaranteed in absolute sense because
scenarios are subjective. Another critical issue is the definition of the probabil-
ity values of failures in the analysis. However, just like existing reliability analysis
approaches in the literature the accuracy of the analysis depends on the available
knowledge in the project. SARAH does not adopt a fixed strategy but can be rather
considered as a general process in which different strategies for defining probability
values can be used.
Chapter 4
Software Architecture Recovery
Style
The software architecture of a system is usually documented with multiple views.
Each view helps to capture a particular aspect of the system and abstract away
from the others. The current practice for representing architectural views focus
mainly on functional properties of systems and they are limited to represent fault-
tolerant aspects of a system. In this chapter, we introduce the recovery style and
its specialization, the local recovery style, which are used for documenting a local
recovery design.
The chapter is organized as follows. In section 4.1, we provide background informa-
tion on documenting sofware architectures with multiple views. In section 4.2, we
define the problem statement. Section 4.3 illustrates the problem with a case study,
where we introduce local recovery to the open-source media player, MPlayer. This
case study will be used in the following two chapters as well. In sections 4.4 and
4.5, we introduce the recovery style and the local recovery style, respectively. In sec-
tion 4.6, we illustrate the usage of the recovery style for documenting recovery design
alternatives of MPlayer software. We conclude this chapter after discussing alterna-
tives, extensions and related work. The usage of the recovery style for supporting
analysis and implementation are discussed in the following two chapters.
57
58 Chapter 4. Software Architecture Recovery Style
4.1 Software Architecture Views and Models
Due to the complexity of current software systems, it is not possible to develop
a model of the architecture that captures all aspects of the system. A software
architecture cannot be described in a simple one-dimensional fashion [18]. For this
reason, as explained in section 2.2.1, architecture descriptions are organized around a
set of views. An architectural view is a representation of a set of system elements and
relations associated with them to support a particular concern [18]. Different views
of the architecture support different goals and uses, often for different stakeholders.
In the literature initially some authors have prescribed a fixed set of views to doc-
ument the architecture. For example, the Rational’s Unified Process [70], which is
based on Kruchten’s 4+1 view approach [69] introduces the logical view, develop-
ment view, process view and physical view. Another example is the Siemens Four
Views model [56] that uses conceptual view, module view, execution view and code
view to document the architecture.
Because of the different concerns that need to be addressed for different systems, the
current trend recognizes that the set of views should not be fixed but different sets
of views might be introduced instead. For this reason, the IEEE 1471 standard [78]
does not commit to any particular set of views although it takes a multi-view ap-
proach for architectural description. IEEE 1471 indicates in an abstract sense that
an architecture description consists of a set of views, each of which conforms to
a viewpoint [78] associated with the various concerns of the stakeholders. View-
points basically represent the conventions for constructing and using a view [78]
(See Figure 4.1).
Views and Beyond Approach ÌEEE 1471
Viewpoint
View
Viewtype
View
Style
conforms to
specializes / constraints
conforms to
Figure 4.1: Viewpoint, view, viewtype and style
The Views and Beyond (V&B) approach as proposed by Clements et al. is another
Chapter 4. Software Architecture Recovery Style 59
multi-view approach [18] that appears to be more specific with respect to the view-
points. To bring order to the proposed views in the literature, the V&B approach
holds that a system can be generally viewed from so-called three different viewtypes:
module viewtype, component & connector viewtype and allocation viewtype. Each
viewtype can be further specialized or constrained in so-called architectural styles
(See Figure 4.1). For example, layered style, pipe-and-filter style and deployment
style are specializations of the module viewtype, component & connector viewtype
and allocation viewtype, respectively [18]. The notion of styles makes the V&B
approach adaptable since the architect may in principle define any style needed.
4.2 The Need for a Quality-Based View
Certainly, existing multi-view approaches are important for representing the struc-
ture and functionality of the system and are necessary to document the architecture
systematically. Yet, an analysis of the existing multi-view approaches reveals that
they still appear to be incomplete when considering quality properties. The IEEE
1471 standard is intentionally not specific with respect to concerns that can be ad-
dressed by views. Thus, each quality property can be seen as a concern. In the
V&B approach quality concerns appear to be implicit in the different views but no
specific style has been proposed for this yet. One could argue that software archi-
tecture analysis approaches have been introduced for addressing quality properties.
The difficulty here is that these approaches usually apply a separate quality model,
such as queuing networks or process algebra, to analyze the quality properties. Al-
though these models are useful to obtain precise calculations, they do not depict
the decomposition of the architecture. They need to abstract away from the actual
decomposition of the system to support the required analysis. As a complementary
approach, preferably an architectural view is required to model the decomposition
of the architecture based on the required quality concern.
One of the key objectives of the TRADER project [120] is to develop techniques
for analyzing recovery at the architecture design level. Hereby, modules in a DTV
can be decomposed in various ways and each alternative decomposition might lead
to different recovery properties. To represent the functionality of the system we
have developed different architectural views including module view, component &
connector view and deployment view. None of these views however directly shows
the decomposition of the architecture based on recovery concern. On the other
hand, using separate quality models such as fault trees helps to provide a thorough
analysis but is separate from the architectural decomposition. A practical and easy-
to-use approach coherent with the existing multi-view approaches was required to
understand the system from a recovery point of view.
60 Chapter 4. Software Architecture Recovery Style
In this chapter, we introduce the recovery style as an explicit style for depicting
the structure of the architecture from a recovery point of view. The recovery style
is a specialization of the module viewtype in the V&B approach. Similar to the
module viewtype, component viewtype and allocation viewtype, which define units
of decomposition of the architecture, the recovery style also provides a view of the
architecture. Unlike quality models that are required by conventional analysis tech-
niques, recovery views directly represent the decomposition of the architecture and
as such help to understand the structure of the system related to the recovery con-
cern. The recovery style considers recoverable units as first class elements, which
represent the units of isolation, error containment and recovery control. Each re-
coverable unit comprises one or more modules. The dependency relation in module
viewtype is refined to represent basic relations for coordination and application of
recovery actions. As a further specialization of the recovery style, the local recovery
style is provided, which is used for documenting a local recovery design in more
detail. The usage of the recovery style is illustrated by introducing local recovery
to the open-source software, MPlayer [85].
4.3 Case Study: MPlayer
We have applied local recovery to an open-source software, MPlayer [85]. MPlayer
is a media player, which supports many input formats, codecs and output drivers.
It embodies approximately 700K lines of code and it is available under the GNU
General Public License. In our case study, we have used version v1.0rc1 of this
software that is compiled on Linux Platform (Ubuntu version 7.04). Figure 4.2
presents a simplified module view [18] of the MPlayer software architecture with
basic implementation units and direct dependencies among them. In the following,
we briefly explain the important modules that are shown in this view.
• Stream reads the input media by bytes, or blocks and provides buffering, seek
and skip functions.
• Demuxer demultiplexes (separates) the input to audio and video channels, and
reads them from buffered packages.
• Mplayer connects the other modules, and maintains the synchronization of
audio and video.
• Libmpcodecs embodies the set of available codecs.
• Libvo displays video frames.
Chapter 4. Software Architecture Recovery Style 61
«subsystem»
MpIayer
«subsystem» KEY Module
Dependency (uses)
«subsystem»
Gui
«subsystem»
Libvo
«subsystem»
Demuxer
«subsystem»
Stream
«subsystem»
Libmpcodecs
«subsystem»
Libao
Figure 4.2: A Simplified Module View of the MPlayer Software Architecture
• Libao controls the playing of audio.
• Gui provides the graphical user interface of MPlayer.
We have derived the main modules of MPlayer from the package structure of its
source code (e.g. Source files that are related to the graphical user interface are
collected under the folder ./Gui).
4.3.1 Refactoring MPlayer for Recovery
In principle, fault tolerance should be considered early in the software development
life cycle. Relevant fault classes, possible fault tolerance provisions should be care-
fully analyzed [95] and the software system must be structured accordingly [94].
However, there exist many software systems that have been already developed with-
out fault tolerance in mind. Some of these systems comprise millions lines of code
and it is not possible to develop these systems from scratch within time constraints.
We have to refactor and possibly restructure them to incorporate fault tolerance
mechanisms.
In general, software systems are composed of a set of modules. These modules
provide a set of functionalities each of which has a different importance from the
62 Chapter 4. Software Architecture Recovery Style
users’ point of view [28]. A goal of fault tolerance mechanisms should be to make
important functions available to the user as much as possible. For example, the
audio/video streaming in a DTV should preferably not be hampered by a fault in a
module that is not directly related to the streaming functionality.
Local recovery is an effective fault tolerance technique to attain high system avail-
ability, in which only the erroneous parts of the system are recovered. The other
parts can remain available during recovery. As one of the requirements of local re-
covery, the system has to be decomposed into several units that are isolated from
each other. Note that the system is already decomposed into a set of modules. De-
composition for local recovery is not necessarily, and in general will not be, aligned
one-to-one with the module decomposition of the system. Multiple modules can
be wrapped in a unit and recovered together. We have introduced a local recovery
mechanism to MPlayer to make it fault-tolerant against transient faults. We had to
decompose it into several units. The communication between multiple units had to
be controlled and recovery actions had to be coordinated so that the units can be
recovered independently and transparently, while the other units are operational.
4.3.2 Documenting the Recovery Design
Designing a system with recovery or refactoring it to introduce recovery requires
the consideration of several design choices related to: the types of errors that will
be recovered, error containment boundaries, the way that the information regarding
error detection and diagnosis is conveyed, mechanisms for error detection, diagnosis,
the control of the communication during recovery and the coordination of recovery
actions, architectural elements that embody these mechanisms and the way that
they interact with the rest of the system.
To be able to communicate these design choices, they should take part in the archi-
tecture description of the system. Effective communication is one of the fundamen-
tal purposes of an architectural description that comprise several views. However,
viewpoints that capture functional aspects of the system are limited for representing
recovery mechanisms and it is inappropriate for understandability to populate these
views with elements and complex relations related to recovery.
Architecture description should also enable analysis of recovery design alternatives.
For example, splitting the system for local recovery on one hand increases availabil-
ity of the system during recovery but on the other hand, it leads to a performance
overhead due to isolation. This leads to a trade-off analysis. Architecture descrip-
tion of the system should comprise information not only regarding the functional
decomposition but also decomposition for recovery to support such analysis.
Chapter 4. Software Architecture Recovery Style 63
Realizing a recovery mechanism usually requires dedicated architectural elements
and relations that have system-wide impact. Architectural description related to
recovery is also required for providing a roadmap for the implementation and sup-
porting the detailed design.
In short, we need a architectural view for recovery to i ) communicate recovery design
decisions ii ) analyze design alternatives iii ) support the detailed design. For this
purpose, we introduce the recovery and the local recovery styles. In this chapter, we
describe these styles and illustrate their usage for documenting the recovery design
alternatives of MPlayer. In Chapter 5, we present analysis techniques that can be
performed based on the recovery style. In Chapter 6, the style is used for supporting
the detailed design and realization.
4.4 Recovery Style
The key elements and relations of the recovery style are listed below. A visual
notation based on UML [104] is shown in Figure 4.3 that can be used for documenting
a recovery view.

recoverable unit
applies-recovery-
action-to
conveys-information-to

non-recoverable unit
elements relations
Figure 4.3: Recovery style notation
• Elements: recoverable unit (RU), non-recoverable unit (NRU)
• Relations: applies-recovery-action-to, conveys-information-to
• Properties of elements: Properties of RU: set of system modules together with
their criticality (i.e. functional importance) and reliability, types of errors
that can be detected, supported recovery actions, type of isolation (process,
exception handling, etc.). Properties of NRU: types of errors that can be
detected (i.e. monitoring capabilities).
64 Chapter 4. Software Architecture Recovery Style
• Properties of relations: Type of communication (synchronous / asynchronous)
and timing constraints if there are any.
• Topology: The target of a applies-recovery-action-to relation can only be a
RU.
The recovery style introduces two types of elements; RU and NRU. A RU is a unit
of recovery in the system, which wraps a set of modules and isolates them from the
rest of the system. It provides interfaces for conveying information about detected
errors and responding to triggered recovery actions. A RU has the ability to recover
independently from other RU and NRUs it is connected to. A NRU, on the other
hand, can not be recovered independent from other RU and NRUs. It can only be
recovered together with all other connected elements. This can happen in the case of
global recovery or a recovery mechanism at a higher level in the case of a hierarchic
decomposition.
The relation conveys-information-to is about communication between elements for
guiding and coordinating the recovery actions. Conveyed information can be related
to detected errors, diagnosis results or a chosen recovery strategy. The relation
applies-recovery-action-to is about the control imposed to a RU and it affects the
functioning of the target RU (suspend, kill, restart, roll-back, etc.)
4.5 Local Recovery Style
In the following, we present the local recovery style in which the elements and rela-
tions of the recovery style are further specialized. This specialization is introduced
in particular to document the application of our local recovery framework (FLORA)
to MPlayer as presented in Chapter 6. Figure 4.4 provides a notation for specialized
elements and relations based on UML.
• Elements: RU, recovery manager, communication manager
• Relations: restarts, kills, notifies-error-to, provides-diagnosis-to, sends-queued message-
to
• Properties of elements: Each RU has a name property in addition to those
defined in the recovery style.
• Properties of relations: notifies-error-to relation has a type property to indi-
cate the error type if multiple types of errors can be detected.
Chapter 4. Software Architecture Recovery Style 65
<< RU >>
RU name
<<NRU>> recovery mgr
<<NRU>> comm mgr
recoverable unit
recovery manager
communication
manager
elements relations
<<restart>>
<<queued>>
<<error [:type]>>
<<diagnosis>>
restarts
notifies-error-to
provides-diagnosis-to
sends-queued
message-to
<<kill>>
kills
Figure 4.4: Local recovery style notation
• Topology: restarts and kills relations can be only from a recovery manager to
a RU. sends-queued message-to can be between a RU and a communication
manager.
Communication manager connects RUs together, routes messages and informs con-
nected elements about the availability of another connected element. Recovery man-
ager applies recovery actions on RUs.
Figure 4.5 depicts a simple instance of the local recovery style with one communi-
cation manager, one recovery manager and two RUs, A and B. Errors are detected
by the communication manager and notified to the recovery manager. Recovery
manager restarts A and/or B. The messages that are sent from the communica-
tion manager to A are stored in a queue by the communication manager while A is
unavailable (i.e. being restarted) and sent (retried) after A becomes available.
66 Chapter 4. Software Architecture Recovery Style
<< RU >>
A
<<NRU>> recovery mgr <<NRU>> comm mgr
<< RU >>
B
<<restart>> <<restart>> <<queued>>
<<error>>
Figure 4.5: A simple recovery view based on the local recovery style (KEY: Fig-
ure 4.4)
4.6 Using the Recovery Style
Our aim is to use the recovery style for MPlayer (Section 4.3) to choose and realize
a recovery design among several design alternatives. One such alternative is to
introduce a local recovery mechanism comprising 3 RUs as listed below.
• RU AUDIO: provides the functionality of Libao
• RU GUI : encapsulates the Gui functionality
• RU MPCORE: comprises the rest of the system.
Figure 4.6 depicts the boundaries of these RUs, which are overlayed on the module
view of the MPlayer software architecture. In Figure 4.7(b) the recovery design
corresponding to this RU selection is shown
1
. Here, we can also see two new archi-
tectural elements that are not recoverable; a communication manager that mediates
and controls the communication between the RUs and a recovery manager that ap-
plies the recovery actions on RUs.
Note that Figure 4.7(b) shows just one possible design for the recovery mechanism
to be introduced to the MPlayer. We could consider many other design alternatives.
One alternative would be to have only 1 RU (See Figure 4.7(a)). This would ba-
sically lead to a global recovery since all the modules are encapsulated in a single
RU. We could also have more than 3 recoverable units as shown in Figure 4.7(c).
1
In this chapter, we use the recovery style notation to represent all the recovery design alter-
natives. We use the local recovery style in Chapter 6, to represent local recovery designs in more
detail.
Chapter 4. Software Architecture Recovery Style 67
«subsystem»
MpIayer
«subsystem» KEY
Module
Dependency (uses)
«subsystem»
Gui
«subsystem»
Libvo
«subsystem»
Demuxer
«subsystem»
Stream
«subsystem»
Libmpcodecs
«subsystem»
Libao
RU
MPCORE
RU
AUDIO
RU
GUI
Boundaries of RU
RU Recoverable Unit
Figure 4.6: The Module View of the MPlayer Software Architecture with the Bound-
aries of the Recoverable Units
Hereby, each module of the system is assigned to a separate RU
2
. We could consider
many such alternatives to decompose the system into a set of RUs
3
. On the other
hand, there could be more than one element that controls the recovery actions and
communication (e.g. distributed or hierarchical control). These elements could be
recoverable units as well. Recovery views of the system helps us to document and
communicate such design alternatives. The following chapters further elaborate on
the related analysis techniques and realization activities supported by the recovery
style.
2
The names of the RUs are not significant and they are assigned based on the comprised
modules.
3
Chapter 5 presents an analysis method together with a tool-set for optimizing the decomposi-
tion of software architecture into RUs.
68 Chapter 4. Software Architecture Recovery Style
<< RU >>
MPLAYER
<<NRU>> recovery mgr
(a) recovery design with 1 RU
<< RU >>
AUDIO
<< RU >>
MPCORE
<< RU >>
GUI
<<NRU>> recovery mgr <<NRU>> comm mgr
(b) recovery design with 3 RUs
<< RU >>
AUDIO
<< RU >>
MPCORE
<< RU >>
GUI
<<NRU>> recovery mgr <<NRU>> comm mgr
<< RU >>
DEMUX
<< RU >>
CODECS
<< RU >>
VIDEO
(c) recovery design with 7 RUs
Figure 4.7: Architectural alternatives for the recovery design of MPlayer software
(KEY: Figure 4.3)
Chapter 4. Software Architecture Recovery Style 69
4.7 Discussion
Style or viewpoint?
In this work we have extended the module viewtype to represent a recovery style. In
the V&B approach a distinction is made between viewtypes, styles and views [18].
Hereby, styles can be considered as specialized viewtypes, and views are considered
as instances of these styles. The IEEE 1471 standard [78], on the other hand,
describes viewpoints and views, and does not describe the concept of style. In the
IEEE 1471 standard, viewpoint represents the language to represent views. So, in
the parlance of this standard, the recovery style that we have introduced can be
considered as a viewpoint as well because it represents a language to define recovery
views. Yet, in this paper we have focused on representing recovery views for units
of implementation which is made explicit in the V&B approach through the module
viewtype. To clarify this focus we adopted the term style instead of the more general
term viewpoint.
Recovery Style based on other viewtypes
Although we have focused on defining a recovery view for implementation units,
we could also derive a recovery style for the component & connector viewtype or
the allocation viewtype [18] to represent recovery views of run-time or deployment
elements. This would provide a broader view on recovery and we consider this as a
possible extension of our work.
Utilization of Recovery Style in different contexts
There are several works that apply local recovery in different contexts. They can be
represented with the local recovery style as well. For example, in [53] device drivers
are executed on separate processes, where microreboot [16] is applied to increase the
failure resilience of operating systems. They provide a view of the architecture of the
operating system, where the architectural elements related to fault tolerance and cor-
responding relations (recovery actions and error notifications) are shown. According
to this view [53] Process Manager, which restarts the processes, is a non-recoverable
unit that coordinates the recovery actions. Reincarnation Server monitors the sys-
tem to facilitate error detection and diagnosis and it guides the recovery procedure.
Data Store, which is a name server for interconnected components, mediates and
controls communication.
70 Chapter 4. Software Architecture Recovery Style
4.8 Related Work
Perspectives [125] guide an architect to modify a set of existing views to document
and analyze quality properties. Our work addresses the same problem for recovery
by introducing the recovery style, which is instantiated as a view. One of the
motivations for using perspectives instead of creating a quality-based view is to avoid
the duplicate information related to architectural elements and relations. This was
less an issue in our project context, where we needed an explicit view for recovery to
depict the related decomposition, which includes, dedicated architectural elements
with complex interactions among them (e.g. Figure 6.3).
Architectural tactics [4] aim at identifying architectural decisions related to a quality
attribute requirement and composing these into an architecture design. The recovery
style is used for representing and analyzing an architectural tactic for recovery. It
provides a view of the system, by means of which the recovery design of a system
can be captured.
The idealized fault-tolerant COTS (commercial off-the-shelf) component is intro-
duced for encapsulating COTS components to add fault tolerance capabilities [25].
The approach is based on the idealized C2 component (iC2C), which is compliant
with the C2 architectural style [114]. The C2 architectural style is a component-
based style, in which components communicate only through asynchronous messages
mediated by connectors. Components are wrapped to cope with interface and ar-
chitectural mismatches [43]. iC2C specializes the elements and relations of the C2
architectural style to explicitly represent fault tolerance provisions. The iC2C dif-
ferentiates between service requests/responses and interface/failure exceptions as
subtypes of requests and notifications in the C2 architectural style. It also intro-
duces two internal components, a NormalActivity component and an AbnormalAc-
tivity component together with an additional connector between these. The iC2C
NormalActivity component implements the normal behavior and error detection,
whereas the iC2C AbnormalActivity component is responsible for error recovery.
The idealized fault-tolerant architectural element (iFTE) [27] is introduced to make
exception handling capabilities explicit at the architectural level. The iFTE adopts
the basic principles of the iC2C. However, it can be instantiated for both compo-
nents and connectors, named as iFTComponent and iFTConnector, respectively.
This approach enables the analysis of recovery properties of a system based on its
architecture design.
Multi-version connector [93] is a type of connector based on the C2 architectural
style. It relies on multiple versions of a component for different operations to improve
the dependability of a system. This approach can be used for representing N-version
Chapter 4. Software Architecture Recovery Style 71
programming in the software architecture design.
The notion of Web Service Composition Action (WSCA) [113] is introduced for
structuring the composition of Web services in terms of coordinated atomic actions
to enforce transaction processing. An XML-based specification defines the elements,
which participate to perform the corresponding action, together with the standard
behavior and their cooperative recovery actions in case of an exception.
An architectural style for tolerating faults in Web services is proposed in [91]. In
this style, a system is composed of multiple variants of components, which crash in
case of a failure. The system is then dynamically reconfigured to utilize alternative
variants of the crashed components.
In the architecture-based exception handling [63] approach, architecture configu-
ration exceptions and their handling are specified in the architecture description.
Configuration exceptions are defined as conditions with respect to possible con-
figurations of architectural elements and their interaction (communication patterns
within the current configuration). Exception handlers are defined as a set of required
changes to the architecture configuration (substitution of elements and alterations
of bindings). The approach is supported by a run-time environment, which recon-
figures the system dynamically based on the provided specification.
Another run-time architecture reconfiguration approach is presented in [24] for up-
dating a running system to tolerate faults. In this approach, the architecture de-
scription is expressed with xADL [23] and repair plans are specified as a set of
elements to be added to or removed from the architecture. In case of an error,
the associated repair plan is executed and the configuration of the architecture is
changed accordingly.
A generic run-time adaptation framework is presented in [44] together with its spe-
cialization and realization for performance adaptation. According to the concepts
in this framework, our monitoring mechanisms are error detectors. Diagnosis facil-
ities can be considered as gauges, which interpret monitored low-level events, and
recovery actions can be considered as architectural repairs. As a difference from the
approach presented in [44], we propose a style specific for fault tolerance.
As previously discussed in Chapter 3, several software architecture analysis ap-
proaches have been introduced for addressing quality properties. The goal of these
approaches is to assess whether or not a given architecture design satisfies desired
quality requirements. The main aim of the recovery style, on the other hand, is to
communicate and support the design of recovery. Analysis is a complementary work
to make trade-off decisions, tune and select recovery design alternatives.
72 Chapter 4. Software Architecture Recovery Style
4.9 Conclusions
In this chapter, we have introduced the recovery style to document and analyze
recovery properties of a software architecture. The recovery style is a specialization
of the module viewtype as defined in the V&B approach [18]. We have further
specialized the recovery style to define the local recovery style. The usage of these
styles have been illustrated for defining recovery views of MPlayer. These views
have formed the basis for analysis of design alternatives. They have also supported
the detailed design and implementation of local recovery for MPlayer. Chapter 5
explains the analysis of decomposition alternatives for local recovery, whereas Chap-
ter 6 explains the realization of recovery views.
Chapter 5
Quantitative Analysis and
Optimization of Software
Architecture Decomposition for
Recovery
Local recovery enables the recovery of the erroneous parts of a system while the
other parts of the system are operational. To prevent the propagation of errors,
the architecture needs to be decomposed into several parts that are isolated from
each other. There exist many alternatives to decompose a software architecture for
local recovery. It appears that each of these decomposition alternatives performs
differently with respect to availability and performance metrics. In this chapter, we
propose a systematic approach, which employs integrated set of analysis techniques
to optimize the decomposition of software architecture for local recovery.
The chapter is organized as follows. In the following two sections, we introduce
background information on program analysis and analytical models that are utilized
by the proposed approach. In section 5.3, we define the problem statement for
selecting feasible decomposition alternatives. In section 5.4, we present the top-
level process of our approach and the architecture of the tool that supports this
process. Sections 5.5 through 5.11 explain the steps of the top-level process and
illustrate them for the open-source MPlayer software. In section 5.12 we evaluate
the analysis results. We conclude the chapter after discussing limitations, extensions
and related work.
73
74 Chapter 5. Quantitative Analysis and Optimization of Software Architecture...
5.1 Source Code and Run-time Analysis
Program analysis techniques are used for automatically analyzing the structure
and/or behavior of software for various goals such as the ones listed below.
• Optimizing the generated code during compilation [1]
• Automatically pinpointing (potential) errors and as such reducing debugging
time [119]
• Automatically detecting vulnerabilities and security threats [92]
• Supporting the understanding and comprehension of large and complex pro-
grams [20]
• Reverse engineering legacy software systems [87]
Several types of program analysis techniques are used in practice as a part of ex-
isting software development tools and compilers, which are tailored for different
execution platforms and programming languages. There are mainly two approaches
for program analysis; static analysis and dynamic analysis.
Static analysis approaches inspect the source code of programs and perform the
analysis without actually executing these programs. The sophistication of the anal-
ysis varies depending on the granularity of the analysis (e.g. individual statements,
functions/methods, source files) and the purpose (e.g. spotting potential errors,
verifying specified properties) [9]. Although static analysis approaches are helpful
for several aforementioned purposes, they are limited due to the lack of run-time
information. For example, static analysis can reveal the dependency between func-
tion calls but not the frequency of calls and their execution times. Moreover, static
analysis is not always practical and scalable. Depending on the type of analysis (e.g.
data flow and especially pointer analysis [55]) and the size of the analyzed program,
static analysis can become highly resource consuming, and very soon intractable.
Dynamic analysis approaches make use of data gathered from a running program.
The program is executed on a real or virtual processor according to a certain sce-
nario. During the execution, various data are collected that are subject to analysis.
The advantage of dynamic analysis approaches is their ability to capture detailed
interactions at run-time (e.g. pointer operations, late binding). This leads to more
accurate information about the analyzed program compared to what might be ob-
tained with static analysis. On the other hand, dynamic analysis has also limitations
and drawbacks. First, the collected data for analysis depends on the program input.
Chapter 5. Quantitative Analysis and Optimization of Software Architecture... 75
Use of techniques for estimating code coverage [127] can help to ensure that a suf-
ficient subset of the program’s possible behaviors has taken into account. Second,
depending on the type and granularity of the analyzed information, the amount of
collected data may turn out to be huge and intractable. Third, instrumentation or
probing has an effect on the execution of the target program. As a potential risk,
this effect can influence the program behavior and analysis results.
In this chapter, we utilize dynamic analysis techniques for estimating the perfor-
mance overhead introduced by a recovery design. We utilize two different tools;
GNU gprof [40] and Valgrind [88]. We use GNU gprof tool to obtain the function
call graph of the system together with statistics about the frequency of performed
function calls and the execution time of functions (i.e. function call profile). We use
the Valgrind framework to obtain the data access profile of the system.
5.2 Analysis based on Analytical Models
The main goal of recovery is to achieve higher system availability to its users in
case of component failures. Thus, to select an optimal recovery design, we need to
quantitatively assess design alternatives with respect to availability. As defined in
section 2.1.1, availability is a measure of readiness for correct service and it can be
simply represented by the following formula.
Availability =
MTTF
MTTF + MTTR
(5.1)
Hereby, MTTF and MTTR stand for the mean time to failure and the mean time
to repair, respectively. These concepts are mainly inherited from hardware fault
tolerance and adapted for software fault tolerance. To estimate the availability and
related quality attributes for complex systems and fault-tolerant designs, several
formal models, techniques and tools have been devised [121]. These techniques are
mostly based on Markov models [51], queuing networks [22], fault trees [35], or a
combination of these. Quantitative availability analysis methods usually employ
so-called state-based models [61]. In the application of state-based models, it is in
general assumed that the behavior of modules has a Markov property, where the
future behavior is conditionally independent of the past behavior [48]. Usually dis-
crete time Markov chains (DTMC) or continuous time Markov chains (CTMC) are
used as the modeling formalism. DTMCs are used for representing applications
that operate on demand, whereas CTMC formalism is well-suited for continuously
operating software applications [48]. A CTMC is defined by a set of states and
76 Chapter 5. Quantitative Analysis and Optimization of Software Architecture...
transitions between these states. The transitions are labeled with rates λ indicating
that the transition is taken after a delay that is governed by an exponential distri-
bution with parameter λ. In availability modeling, the states represent the degree
of availability of the system [73]. Figure 5.1 depicts a CTMC as a simple availability
model.

Figure 5.1: An example CTMC as a simple availability model
Since the delay for taking a transition is governed by an exponential distribution
with a constant rate, the mean time to take a transition is simply the reciprocal of
the transition rate. Consequently, λ = 1/MTTF defines the failure rate, where a
transition is made from the “up” state to the “failed” state. Similarly, µ = 1/MTTR
defines the recovery rate, where a transition is made from an “failed” state to the
“up” state.
Markov model checker tools like CADP [42] can take a CTMC model as an input and
compute the probability that a particular state will be visited in the long run (i.e.
steady state). Since states represent availability degrees, this makes it possible to
calculate the total availability of the system in the long run. In the example model
shown in Figure 5.1, for instance, the availability degrees implied by the “up” and
“failed” states are 1 and 0, respectively. Thus, the probability that the “up” state
is visited in the long run determines the availability of the system.
A potential problem with such analytical models is the state space explosion problem.
Depending on the complexity of the system and the recovery design, the state space
can get too large. As a result, it can take too long time for the model checker
to calculate, if possible at all, the steady state probabilities. To overcome this
problem, Input/output interactive Markov chain (I/O-IMC) formalism has been
introduced [10] as explained in the following subsection.
Chapter 5. Quantitative Analysis and Optimization of Software Architecture... 77
5.2.1 The I/O-IMC formalism
I/O-IMCs [10, 11] are a combination of Input/Output automata (I/O automata) [86]
and interactive Markov chains (IMCs) [54]. Essentially, I/O-IMCsare extensions of
CTMCs [57] having 2 main types of transitions; interactive transitions and Marko-
vian transitions. Markovian transitions are labeled with the corresponding transition
rates just like in regular CTMCs. Interactive transitions are identified with 3 types
of actions;
• Input actions: This type of actions are controlled by the environment. A tran-
sition with an input action is taken if and only if another I/O-IMC performs
the corresponding output action (input and output actions are matched by
labels). I/O-IMCs are input-enabled meaning that each state is ready to re-
spond to any input action. Hence, every state has an outgoing transition (or
self-transition by default), for each possible input action.
• Output actions: These actions are controlled by the I/O-IMC itself. Transi-
tions labeled with output actions are taken immediately.
• Internal actions: These actions are not visible to the environment. They are
controlled by the I/O-IMC itself and the corresponding transitions are taken
immediately.
S1
S2
S3
S4 a?
a?
a?
a?
S5
b!
λ
µ
a?
Figure 5.2: An example I/O-IMC
An example I/O-IMC is depicted in Figure 5.2. Hereby, Markovian transitions are
depicted with dashed lines, and interactive transitions by solid lines. Labels of input
and output actions are appended with “?” and “!”, respectively. In this example,
there are 2 Markovian transitions: one from S1 to S2 with rate λ and another from
S3 to S4 with rate µ. The I/O-IMC has 1 input action, which is labeled as a?. Note
78 Chapter 5. Quantitative Analysis and Optimization of Software Architecture...
that, all the states have a transition for this input action to ensure that the I/O-IMC
is input-enabled. There is also only 1 output action specified in this example, which
is labeled as b!. S1 is the initial state of the I/O-IMC.
The behavior of an I/O-IMC depends on both its internally specified behavior and
the environment (i.e. behavior of the other I/O-IMCs). For example, the state S1 in
Figure 5.2 has 2 outgoing transitions: a Markovian transition to S2 with rate λ and
an interactive transition to S3 with input action a?. In S1, the I/O-IMC delays for
a time that is governed by an exponential distribution with parameter λ, and moves
to S2. If however, before that delay ends, an input a? arrives (another I/O-IMC
takes a transition with output action a!), then the I/O-IMC moves to S3. As soon
as the state S4 is reached, the I/O-IMC takes a transition to S5 with output action
b!. This transition might influence the behavior of other I/O-IMCs that have been
waiting for an input action b?.
The I/O-IMC formalism enables modularity in model building and analysis. Mul-
tiple I/O-IMC models can be composed together based on the common inputs they
consume and outputs they generate. Instead of building a model for the system as
a whole, the behavior of system elements can be modeled independently and then
combined with a compositional aggregation approach [10]. Compositional aggrega-
tion is a technique to build an I/O-IMC by composing, in successive iterations, a
number of elementary I/O-IMCs and reducing (i.e. aggregating) the state-space of
the generated I/O-IMC after each iteration. This helps in managing large state
spaces and model elements can be re-used in different configurations. Composition
and aggregation of I/O-IMCs results in a regular CTMC, which can be then ana-
lyzed using standard methods to compute different dependability and performance
measures.
5.3 Software Architecture Decomposition for Lo-
cal Recovery
An error occurring in one part of the system can propagate and lead to errors in
other parts. To prevent the propagation of errors and recover from them locally, the
system should be decomposed into a set of isolated recoverable units (RUs) ([59, 16]).
For example, recall the MPlayer case study presented in section 4.3. As re-depicted
in Figure 5.3, one possible decomposition for the MPlayer software is to partition the
system modules into 3 RUs; i ) RU AUDIO, which provides the functionality of Libao
ii ) RU GUI, which encapsulates the Gui functionality and iii ) RU MPCORE which
comprises the rest of the system. This is only one alternative decomposition for
Chapter 5. Quantitative Analysis and Optimization of Software Architecture... 79
isolating the modules within RUs and as such introducing local recovery. Obviously,
the partitioning can be done in many different ways.
«subsystem»
MpIayer
«subsystem» KEY
Module
Dependency (uses)
«subsystem»
Gui
«subsystem»
Libvo
«subsystem»
Demuxer
«subsystem»
Stream
«subsystem»
Libmpcodecs
«subsystem»
Libao
RU
MPCORE
RU
AUDIO
RU
GUI
Boundaries of RU
RU Recoverable Unit
Figure 5.3: An example decomposition of the MPlayer software into RUs
5.3.1 Design Space
To reason about the number of decomposition alternatives, we first need to model
the design space. In fact, the partitioning of architecture into a set of RUs can
be generalized to the well-known set partitioning problem [50]. Hereby, a partition
of a set S is a collection of disjoint subsets of S whose union is S. For exam-
ple, there exists 5 alternatives to partition the set {a, b, c}. These alternatives are:
{{a}, {b}, {c}}, {{a}, {b, c}}, {{b}, {a, c}}, {{c}, {a, b}}, {{a, b, c}}. The number of
ways to partition a set of n elements into k nonempty subsets is computed by the
so-called Stirling numbers of the second kind, S(n, k) [50]. It is calculated with the
recursive formula shown in Equation 5.2.

n
k

= k

n−1
k

+

n−1
k −1

, n ≥ 1 (5.2)
The total number of ways to partition a set of n elements into arbitrary number of
nonempty sets is counted by the n
th
Bell number as follows.
80 Chapter 5. Quantitative Analysis and Optimization of Software Architecture...
B
n
=
n

k=1
S(n, k) (5.3)
In theory, B
n
is the total number of partitions of a system with n modules. B
n
grows
exponentially with n. For example, B
1
= 1, B
3
= 5, B
4
= 15, B
5
= 52, B
7
= 877
(The MPlayer case), B
15
= 1.382.958.545. Therefore, searching for a feasible design
alternative becomes quickly problematic as n (i.e. the number of modules) grows.
5.3.2 Criteria for Selecting Decomposition Alternatives
Each decomposition alternative in the large design space may have both pros and
cons. In the following, we outline the basic criteria that we consider for choosing a
decomposition alternative.
• Availability: Local recovery aims at keeping the system available as much
as possible. The total availability of a system depends on the availability of
its individual recoverable units. To maximize the availability of the overall
system, MTTF of RUs must be kept high and MTTR of RUs must be kept
low (See Equation 5.1). The MTTF and MTTR of RUs on their turn depend
on the MTTF and MTTR values of the contained modules. As such, the
overall value of MTTF and MTTR properties of the system depend on the
decomposition alternative, that is, how we separate and isolate the modules
into a set of RUs.
• Performance: Logically, the optimal availability of the system is defined by
the decomposition alternative in which all the modules in the system are sep-
arately isolated into RUs. However, increasing the number of RUs leads to a
performance overhead due to the dependencies between the separated modules
in different RUs. We distinguish between two important types of dependen-
cies that cause a performance overhead; i ) function dependency and ii ) data
dependency.
The function dependency is the number of function calls between modules
across different RUs. For transparent recovery these function calls must be
redirected, which leads to an additional performance overhead. For this reason,
for selecting a decomposition alternative we should consider the number of
function calls among modules across different RUs.
The data dependencies are proportional to the size of the shared variables
among modules across different RUs. In some of the previous work on local
Chapter 5. Quantitative Analysis and Optimization of Software Architecture... 81
recovery ([16, 53]) it has been assumed that the RUs are stateless and as such
do not contain shared state variables. This assumption can hold, for example,
for stateless Internet service components [16] and stateless device drivers [53].
However, when an existing system is decomposed into RUs, there might be
shared state variables leading to data dependencies between RUs. In fact,
there have been also work focusing on saving component states in case of a
component failure [68] and determining a global system state based on possibly
interdependent component states [17]. Obviously, the size of data dependencies
complicate the recovery and create performance overhead because the shared
data need to be kept synchronized after recovery. This makes the amount of
data dependency between RUs an important criteria for selecting RUs.
It appears that there exists an inherent trade-off between the availability and per-
formance criteria. On the one hand increasing availability will require to increase
the number of RUs
1
. On the other hand increasing the number of RUs will intro-
duce additional performance overhead because the amount of function dependencies
and data dependencies will increase. Therefore, selecting decomposition alternatives
requires their evaluation with respect to these two criteria and making the desired
trade-off.
5.4 The Overall Process and the Analysis Tool
In this section we define the approach that we use for selecting a decomposition
alternative with respect to local recovery requirements. Figure 5.4 depicts the main
activities of the overall process in a UML [100] activity diagram. Hereby, the filled
circle and the filled circle with a border represent the starting point and the ending
point, respectively. The rounded rectangles represent activities and arrows (i.e.
flows) represent transitions between activities. The beginning of parallel activities
are denoted with a black bar with one flow going into it and several leaving it. The
end of parallel activities are denoted with a black bar with several flows entering
it and one leaving it. This means that all the entering flows must reach the bar
before the transition to the next activity is taken. Alternative activities are reached
through different flows that are stemming from a single activity. The activities of
the overall process are categorized into five groups as follows.
1
This is based on the assumption that the fault tolerance mechanism remains simple and fault-
free. By increasing the number of RUs, an error of a module will lead to a failure of a smaller
part of the system (the corresponding RU), while the other RUs can remain available. However,
depending on the fault assumptions and the system, increasing the number of RUs can increase
complexity and this, in turn can increase the probability of introducing additional faults.
82 Chapter 5. Quantitative Analysis and Optimization of Software Architecture...
• Architecture Definition: The activities of this group are about the definition of
the software architecture by specifying basic modules of the system. The given
architecture will be utilized to define the feasible decomposition alternatives
that show the isolated RUs consisting of the modules. To prepare the analysis
for finding the decomposition alternative each module in the architecture will
be annotated with several properties. These properties are utilized as an input
for the analytical models and heuristics during the Decomposition Selection
process.
• Constraint Definition: Using the initial architecture we can in principle define
the design space including the decomposition alternatives. As we have seen
in section 5.3.1 the design space can easily get very large and unmanageable.
In addition, not all theoretically possible decompositions are practically pos-
sible because modules in the initial architecture cannot be freely allocated to
RUs due to domain and stakeholder constraints. Therefore, before generating
the complete design space first the constraints will be specified. The type of
constraints that we consider can i) limit the number of RUs, ii) prevent the
allocation of some modules to same or separate RUs, iii) limit the perfor-
mance overhead that can be introduced by a decomposition alternative based
on provided measurements from the system. The activities of the Constraint
Definition group involve the specification of such domain and stakeholder con-
straints.
• Design Space Generation: After the domain constraints are specified the pos-
sible decomposition alternatives will be defined. Each decomposition alterna-
tive defines a possible partitioning of the initial architecture modules into RUs.
The number of these alternatives is computed based on the Stirling numbers
as explained in section 5.3.1.
Chapter 5. Quantitative Analysis and Optimization of Software Architecture... 83
Generate the
Design Space
Define the Feasible
Design Space
Measure
Data
Dependencies
Measure
Function
Dependencies
Annotate the Software Architecture
Generate Analytical Models
Select the Decomposition
with max. Availability
Utilize heuristics and
Optimization techniques
Constraint
Definition
Decomposition
Selection
Define
Domain
Constraints
Define
Performance
Feasibility
Constraints
Define
Deployment
Constraints
Measurement
Architecture
Definition
Define the Software Architecture
Design Space
Generation
Figure 5.4: The Overall Process
84 Chapter 5. Quantitative Analysis and Optimization of Software Architecture...
• Performance Measurement: Even though some alternatives are possible after
considering the domain constraints they might in the end still not be feasible
due to the performance overhead introduced by the particular allocation of
modules to RUs. To estimate the real performance overhead of decomposition
alternatives we need to know the amount of function and data dependencies
among the system modules. The activities of the Performance Measurement
group deal with performing related dynamic analysis and measurements on the
existing system. These measurements are utilized during the Decomposition
Selection process.
• Decomposition Selection: Based on the input of all the other groups of activ-
ities, the activities of this group select an alternative decomposition
2
based
on either of two approaches. The selected approach depends on the size of
the feasible design space derived from the architecture description, specified
constraints and measurements in the previous activities. If the size is smaller
than a maximum affordable amount, we generate analytical models for each
alternative in the design space to estimate the system availability. Using the
analytical models we select the RU decomposition alternative that is optimal
with respect to availability and performance. If the design space is too large
the generation of analytical models will be more time consuming. In that case
we use a set of optimization algorithms and related heuristics to find a feasible
decomposition alternative.
Figure 5.4 only shows a simple flow of the main activities in the overall process.
There can be many iterations of these activities at different steps, starting from
the architecture definition, constraint definition or decomposition selection. For
brevity, we do not elaborate on this and we do not populate the diagram with these
iterations.
We have developed an analysis tool, the Recovery Designer, that implements the pro-
vided approach. The tool is fully integrated in the Arch-Studio environment [23],
which includes a set of tools for specifying and analyzing software architectures.
Arch-Studio is an open-source software and systems architecture development envi-
ronment based on the Eclipse open development platform [6].
We have used the meta-modeling facilities of ArchStudio to extend the tool and
integrate the Recovery Designer with the provided default set of tools. A snapshot of
the extended Arch-Studio can be seen in Figure 5.5 in which the Recovery Designer
tool is activated. At the bottom of the tool a button Recovery Designer can be
2
Although we use the term ‘decomposition’ for this activity, please note that the system is
already decomposed into a set of modules. On top of this, we are introducing a decomposition by
actually ‘composing’ the system modules into RUs.
Chapter 5. Quantitative Analysis and Optimization of Software Architecture... 85
Figure 5.5: A Snapshot of the Arch-Studio Recovery Designer
seen, which opens the Recovery Designer tool consisting of a number of sub-tools.
The architecture of Recovery Designer is shown in Figure 5.6. The tool boundary
is defined by the large rectangle with dotted lines. The tool consist itself of six
sub-tools Constraint Evaluator, Design Alternative Generator, Function Dependency
Analyzer, Data Dependency Analyzer, Analytical Model Generator, and Optimizer.
Further, it uses five external tools Arch-Edit, GNU-gprof, Valgrind, gnuplot and
CADP model checker. In the subsequent sections, we will explain the activities of
the proposed approach in detail and we will refer to the related component(s) of the
tool.
86 Chapter 5. Quantitative Analysis and Optimization of Software Architecture...
KEY
data flow external tool
tool boundary
tool component
Optimizer
Function
Dependency
Analyzer
Data
Dependency
Analyzer
Arch-Studio
Recovery
Designer
Constraint
Evaluator
Design
Alternative
Generator
Analytical
Model
Generator
GNU
gprof
Valgrind
Model
Checker
optimal
design
Arch-Edit
gnuplot
Figure 5.6: Arch-Studio Recovery Designer Analysis Tool Architecture
5.5 Software Architecture Definition
The first step Architecture Definition in Figure 5.4 is composed of two activities; def-
inition of the software architecture and annotation of the software architecture. For
describing the software architecture, we use the module view of the system, which
includes basic implementation units and direct dependencies among them [18]. This
activity is carried out by using the Arch-Edit tool of the Arch-Studio [23]. Arch-
Studio specifies the software architecture with an XML-based architecture descrip-
tion language called xADL [23]. Arch-Edit is a front-end that provides a graphical
user interface for creating and modifying an architecture description. Both the XML
structure of the stored architectural description and the user interface of Arch-Edit
(i.e. the set of properties that can be specified per module) conforms to the xADL
schema and as such can be interchanged with other XML-based tools.
In the second activity of Architecture Definition a set of properties per module are
defined to prepare the subsequent analysis. These properties are MTTF, MTTR
and Criticality. Criticality property defines how critical the failure of a module is
with respect to the overall functioning of the system.
Chapter 5. Quantitative Analysis and Optimization of Software Architecture... 87
To be able to specify these properties in our tool Recovery Designer, we have ex-
tended the existing xADL schema with a new type of interface, i.e. reliability in-
terface, for modules. A part of the corresponding XML schema that introduces the
new properties is shown in Figure 5.7. The concept of an interface with analytical
parameters has been borrowed from [49], where Grassi et al. define an extension
to the xADL language to be able to specify parameters for performance analysis.
They introduce a new type of interface that comprises two type of parameters; con-
structive parameter and analytic parameter. The analytic parameters are used for
analysis purposes only. In our case, we have specified the reliability interface that
consists of the three parameters MTTF, MTTR and Criticality
3
. Using Arch-Edit
both the architecture and the properties per module can be defined. The modules
together with their parameters are provided as an input to analytical models and
heuristics that are used in the Decomposition Selection process.
1: <xsd:complexType name="ReliabilityInterface">
2: <xsd:complexContent>
3: <xsd:extension base="Interface">
4: <xsd:sequence>
5: <xsd:element name="MTTF"
6: type="reliability:mttf"
7: minOccurs="1" maxOccurs="1"/>
8: <xsd:element name="MTTR"
9: type="reliability:mttr"
10: minOccurs="1" maxOccurs="1"/>
11: <xsd:element name="Criticality"
12: type="reliability:criticality"
13: minOccurs="1" maxOccurs="1"/>
14: </xsd:sequence>
15: </xsd:extension>
16: </xsd:complexContent>
17: </xsd:complexType>
Figure 5.7: xADL schema extension for specifying analytical parameters related to
reliability
Note that the values for properties MTTF, MTTR and Criticality need to be defined
by the software architect. In principle there are three strategies that can be used to
determine these property values:
• Using Fixed values: It can be assumed that all modules have the same proper-
ties. Accordingly, all values can be fixed to a certain value just to investigate
the analysis results.
• What-if analysis: A range of values can be considered, where the values are
varied and their effect is observed.
3
Notation-wise, we have placed all the properties under the reliability interface. However, only
MTTF is actually related to reliability. In principle, a separate analytical interface could be defined
for each property.
88 Chapter 5. Quantitative Analysis and Optimization of Software Architecture...
• Measurement or estimation of actual values: Values can be specified based on
actual measurements from the existing system or estimation based on historical
data.
The selection of these strategies is dependent on the available knowledge about the
domain and the analyzed system. The measurement or estimation of actual values is
usually the most accurate way to define the properties. However, this might not be
possible due to lack of access to the existing system and historical data. In that case,
either fixed values and/or a what-if analysis can be used. The specification of values
for some module properties can also be supported by special analysis methods, tools
and techniques. For example, Criticality values for modules can be specified based
on the outcome of an analysis method like SARAH (Chapter 3).
In our MPlayer case study, we have used fixed MTTF and Criticality values but
specified MTTR values based on measurements from the existing system. For all
the modules, the values for MTTF and Criticality are defined as 0.5hrs and 1,
respectively. We have measured the MTTR values from the actual implementation
by calculating the mean time it takes to restart and initialize each module in the
MPlayer over 100 runs. Table 5.1 shows the measured MTTR values.
Table 5.1: Annotated MTTR values for the MPlayer case based on measurements
(MTTF=0.5hrs; Criticality=1).
Modules MTTR (ms)
Libao 480
Libmpcodecs 500
Demuxer 540
Mplayer 800
Libvo 400
Stream 400
Gui 600
5.6 Constraint Definition
After defining the software architecture, we can start searching for the possible de-
composition alternatives that define the structure of RUs. As stated before in sec-
tion 5.3.1, not all theoretically possible alternatives are practically possible. During
the Constraint Definition activity the constraints are defined and as such the design
space is reduced. We distinguish among the following three types of constraints:
Chapter 5. Quantitative Analysis and Optimization of Software Architecture... 89
• Deployment Constraints: An RU is a unit of recovery in the system and in-
cludes a set of modules. In general, the number of RUs that can be defined
will also be dependent on the context of the system that can limit the number
of RUs in the system. For example, if we employ multiple operating system
processes to isolate RUs from each other, the number of processes that can be
created can be limited due to operating system constraints. It might be the
case that the resources are insufficient for creating a separate process for each
RU, when there are many modules and all modules are separated from each
other.
• Domain Constraints: Some modules can be required to be in the same RU
while other modules must be separated. In the latter case, for example, an
untrusted 3
rd
party module can be required to be separated from the rest of
the system. Usually such constraints are specified with mutex and require
relations [66, 21].
• Performance Feasibility Constraints: As explained in section 5.3.2, each de-
composition alternative introduces a performance overhead due to function and
data dependencies. Usually, there are thresholds for the acceptable amounts of
these dependencies based on the available resources. The decomposition alter-
natives that exceed these thresholds are infeasible because of the performance
overhead they introduce and as such must be eliminated.
The Arch-Studio Recovery Designer includes a tool called the Constraint Evaluator,
where these constraints can be specified. Constraint Evaluator gets as input the
architecture description that is created with the Arch-Edit tool (See Figure 5.6).
We can see a snapshot of the Constraint Evaluator in Figure 5.8. The user in-
terface consists of three parts: Deployment Constraints, Domain Constraints and
Performance Feasibility Constraints, each corresponding to a type of constraint to
be specified.
In the Deployment Constraints part, we specify the limits for the number of RUs.
In Figure 5.8, the minimum and maximum number of RUs are specified as 1 and
7 (i.e. the total number of modules), respectively, which means that there is no
limitation to the number of RUs.
In the Domain Constraints part we specify requires/mutex relations. For each of
these relations, Constraint Evaluator provides two lists that include the name of the
modules of the system. When a module is selected from the first list, the second
list is updated, where the modules that are related are selected (initially, there are
no selected elements). The second list can be modified by multiple (de)selection
to change the relationships. For example, in Figure 5.8 it has been specified that
90 Chapter 5. Quantitative Analysis and Optimization of Software Architecture...
Demuxer must be in the same RU as Stream and Libmpcodecs, whereas Gui must
be in a separate RU than Mplayer, Libao and Libvo. Constraint Evaluator can
also automatically check if there are any conflicts between the specified requires and
mutex relations (respectively two modules must be kept together and separated at
the same time).
Figure 5.8: Specification of Design Constraints
In the Performance Feasibility Constraints part, we specify thresholds for the amount
of function and data dependencies between the separated modules. The specified
constraints are used for eliminating alternatives that exceed the given threshold
values. The dynamic analysis and measurements from the existing system that are
performed to evaluate the performance feasibility constraints will be explained in sec-
tion 5.8. However, if there is no existing system available and we can not perform
the necessary measurements, we can skip the analysis of the performance feasibility
constraints. Arch-Studio Recovery Designer provides an option to enable/disable the
performance feasibility analysis. In case this analysis is disabled, the design space
will be evaluated based on only the deployment constraints and domain constraints
so that it is still possible to generate the decomposition alternatives and depict the
reduced design space.
Chapter 5. Quantitative Analysis and Optimization of Software Architecture... 91
5.7 Design Alternative Generation
The first activity in Generate Design Alternatives is to compute the size of the
feasible design space based on the architecture description and the specified domain
constraints. Design Alternative Generator first groups the set of modules that are
required to be together based on the requires relations
4
defined among the domain
constraints. In the rest of the activity, Design Alternative Generator treats each
such group of modules as a single entity to be assigned to a RU as a whole. Then,
based on the number of resulting modules and the deployment constraints (i.e. the
number of RUs), it uses the Stirling numbers of the second kind (section 5.3.1) to
calculate the size of the reduced design space. Note that this information is provided
to the user already during the Constraint Definition process (See the GUI of the
Constraint Evaluator tool in Figure 5.8).
For the evaluation of the other constraints, the Design Alternative Generator uses
the restricted growth (RG) strings [106] to generate the design alternatives. A RG
string s of length n specifies the partitioning of n elements, where s[i] defines the
partition that i
th
element belongs to. For the MPlayer case, for instance, we can
represent all possible decompositions with a RG string of length 7. The RG string
0000000 refers to the decomposition where all modules are placed in a single RU.
Assume that the elements in the string correspond to the modules Mplayer, Libao,
Libmpcodecs, Demuxer, Stream, Libvo and Gui, then the RG string 010002 defines
the decomposition that is shown in Figure 5.3. A recursive lexicographic algorithm
generates all valid RG strings for a given number of elements and partitions [105].
During the generation process of decomposition alternatives, the Design Alternative
Generator eliminates the alternatives that violate the mutex relations defined among
the domain constraints. These are the alternatives, where two modules that must
be separated are placed in the same RU. For the MPlayer case there exists a total
of 877 decomposition alternatives. The deployment and domain constraints that we
have defined reduced the number of alternatives down to 20.
4
Note that the requires relation specifies a pair of modules that must be kept together in a RU.
It does not specify a pair of modules that require services from each other. Such a dependency is
not necessarily a reason for keeping modules together in a RU.
92 Chapter 5. Quantitative Analysis and Optimization of Software Architecture...
5.8 Performance Overhead Analysis
5.8.1 Function Dependency Analysis
Function calls among modules across different RUs impose an additional perfor-
mance overhead due to the redirection of calls by RUs. In function dependency
analysis, for each decomposition alternative, we analyze the frequency of function
calls between the modules in different RUs. The overhead of a decomposition is
calculated based on the ratio of the delayed calls to the total number of calls in the
system.
Function dependency analysis is performed with the Function Dependency Analyzer
tool based on the inputs from the Design Alternative Generator and GNU gprof
tools (See Figure 5.6). As shown in Figure 5.9, the Function Dependency Analyzer
tool itself is composed of three main components; i ) Module Function Dependency
Extractor (MFDE) ii ) Function Dependency Database and iii ) Function Dependency
Query Generator.
Function
Dependency
Database
GNU
gprof
function call
profile
KEY
association external tool
tool boundary tool component
Module Function
Dependency
Extractor
Function Dependency
Query Generator
Function
Dependency
Analyzer
RU
decomposition
function dependency
overhead
Design Alternative
Generator
uses
uses
uses
provides provides
provides
p
r
o
v
i
d
e
s
Figure 5.9: Function Dependency Analysis
MDFE uses the GNU gprof tool to obtain the function call graph of the system.
GNU gprof also collects statistics about the frequency of performed function calls
and the execution time of functions. Once the function call profile is available,
Chapter 5. Quantitative Analysis and Optimization of Software Architecture... 93
MFDE uses an additional GNU tool called GNU nm (provides the symbol table of a
C object file) to relate the function names to the C object files. As a result, MFDE
creates a so-called Module Dependency Graph (MDG) [83]. MDG is a graph, where
each node represents a C object file and edges represent the function dependen-
cies between these files. Figure 5.10, for example, shows a partial snapshot of the
generated MDG for MPlayer.
Figure 5.10: A Partial Snapshot of the Generated Module Dependency Graph of
MPlayer together with the boundaries of the Gui module with the modules Mplayer,
Libao and Libvo
To be able to query the function dependencies between the system modules, we need
to relate the set of nodes in the MDG to the system modules. MFDE uses the folder
(i.e. package) structure of the source code to identify the module that a C object
file belongs to. This is also reflected to the prefixes of the nodes of the MDG. For
example, the right hand side of Figure 5.10 shows some of the files that belong to
the Gui module each of which has “Gui ” as the prefix. MFDE exploits the full path
of the C object file, which reveals its prefix (e.g. “./Gui*”) and the corresponding
module. For the MPlayer case, the set of packages that corresponds to the provided
module view (Figure 5.3) were processed.
After the MDG is created, it is stored in the Function Dependency Database, which
is a relational database. Once the MDG is created and stored, Function Depen-
dency Analyzer becomes ready to calculate the function dependency overhead for a
particular selection of RUs defined by the Design Alternative Generator. For each al-
ternative, Function Dependency Query Generator accesses the Function Dependency
Database, and automatically creates and executes queries to estimate the function
dependency overhead. The function dependency overhead is calculated based on the
following equation.
94 Chapter 5. Quantitative Analysis and Optimization of Software Architecture...
fd =

RUx

RUy∧
(x=y)
calls(x → y) ×t
OV D

f
calls(f) ×time(f)
×100 (5.4)
In Equation 5.4, the denominator sums for all functions the number of times a
function is called times the time spent for that function. In the nominator, the
number of times a function is called among different RUs are summed. The end
result is multiplied by the overhead that is introduced per such function call (t
OV D
).
In our case study, where we isolate RUs with separate processes, the overhead is
related to the inter-process communication. For this reason, we use a measured
worst case overhead, 100 ms as t
OV D
.
Algorithm 1 Calculate the amount of function dependencies between the selected
RUs
1: sum ← 0
2: for all RU x do
3: for all Module m ∈ x do
4: for all RU y ∧ x = y do
5: for all Module k ∈ y do
6: sum ← sum + noOfCalls(m, k)
7: end for
8: end for
9: end for
10: end for
To calculate the function dependency overhead, we need to calculate the sum of
function dependencies among all the modules that are part of different RUs. Con-
sider, for example, the RU decomposition that is shown in Figure 5.3. Hereby, the
Gui and Libao modules are encapsulated in separate RUs and separated from all
the other modules of the system. The other modules in the system are allocated
to a third RU, RU MPCORE. For calculating the function dependency overhead of
this example decomposition, we need to calculate the sum of function dependencies
between the Gui module and all the other modules plus the function dependencies
between the Libao module and all the other modules. Algorithm 1 shows the pseudo
code for counting the number of function calls betwen different RUs.
The procedure noOfCalls(m, k) that is used in the algorithm creates and executes
an SQL [123] query to get the number of calls between the specified modules m and
Chapter 5. Quantitative Analysis and Optimization of Software Architecture... 95
k. An example query is shown in Figure 5.11, where the number of calls from the
module Gui to the module Libao is queried.
SELECT Sum(module_dependency.calls)
AS total FROM module_dependency
WHERE (module_dependency.src Like './Gui%'
And module_dependency.dest Like './libao%')
Figure 5.11: The generated SQL query for calculating the number of calls from the
module Gui to the module Libao
For the example decomposition shown in Figure 5.3, the function dependency over-
head was calculated as 5.4%. Note that this makes the decomposition a feasible
alternative with respect to the function dependency overhead constraints, where the
threshold was specified as 15% for the case study. It took the tool less than 100 ms.
to calculate the function dependency overhead for this example decomposition. In
general, the time it takes for the calculation depends on the number of modules and
the particular decomposition. The worst case asymptotic complexity of Algorithm 1
is O(n
2
) with respect to the number of modules.
5.8.2 Data Dependency Analysis
Data dependency analysis is performed with the Data Dependency Analyzer (See
Figure 5.6), which is composed of three main components; i ) Module Data Depen-
dency Extractor (MDDE) ii ) Data Dependency Database and iii ) Data Dependency
Query Generator (See Figure 5.12).
MDDE uses the Valgrind tool to obtain the data access profile of the system. Val-
grind [88] is a dynamic binary instrumentation framework, which enables the devel-
opment of dynamic binary analysis tools. Such tools perform analysis at run-time
at the level of machine code. Valgrind provides a core system that can instrument
and run the code, plus an environment for writing tools that plug into the core sys-
tem [88]. A Valgrind tool is basically composed of this core system plus the plug-in
tool that is incorporated to the core system. We use Valgrind to obtain the data
access profile of the system. To do this, we have written a plug-in tool for Valgrind,
namely the Data Access Instrumentor (DAI), which records the addresses and sizes
of the memory locations that are accessed by the system. A sample of this data
access profile output can be seen in Figure 5.13.
In Figure 5.13, we can see the addresses and sizes of memory locations that have
been accessed. In line 7 for instance, we can see that there was a memory access at
address bea1d4a8 of size 4 from the file “audio out.c”. DAI outputs the full paths
96 Chapter 5. Quantitative Analysis and Optimization of Software Architecture...
Data
Dependency
Database
data access
instrumentor
Valgrind
data access
profile
Module
Data Dependency
Extractor
Data Dependency
Query Generator
Data
Dependency
Analyzer
RU
decomposition
data dependency size
Design Alternative
Generator

uses
uses
uses
provides
provides provides
p
r
o
v
id
e
s
KEY
association external tool
tool boundary tool component
Figure 5.12: Data Dependency Analysis
of the files, which distinguishes to which module these files belong. The related
parts of the file paths are underlined in Figure 5.13. For example, in line 9 we can
see that the file “aclib template.c” belongs to the Libvo module. From this data
access profile we can also observe that the same memory locations can be accessed
by different modules. Based on this, we can identify the data dependencies. For
instance, Figure 5.13 shows a memory address bea97498 that is accessed by both
the Gui and the Mplayer modules as highlighted in lines 2 and 6, respectively. The
output of DAI is used by MDDE to search for all such memory addresses that are
shared by multiple modules. The output of the MDDE is the total number and
size of data dependencies between the modules of the system. Table 5.2 shows the
1: ... /MPlayer-1.0rc1/Gui/interface.c: bea1d298,4;
2: , 4; ... /MPlayer-1.0rc1/Gui/cfg.c:
3: bea1d2a0, 4; 0862b714, 4; bea1d29c, 4; bea1d2a8,
4: 4; bea1d2a4, 4; bea1d294, 4; bea1d288, 4; ...
5: /MPlayer-1.0rc1/mplayer.c: bea1d4e8, 4; bea1d4e4,
6: 4; bea1d4e0, 4; bea1d4dc, 4; , 4; ...
7: /MPlayer-1.0rc1/libao2/audio_out.c: bea1d4a8, 4;
8: bea1d4a4, 4; bea1d4a0, 4; bea1d49c, 4; bea1d498,
9: 4; ... /MPlayer-1.0rc1/libvo/aclib_template.c:
10: 086b4860, 16; 086b4870, 16; ...
Figure 5.13: A Sample Output of Valgrind + Data Access Instrumentor
Chapter 5. Quantitative Analysis and Optimization of Software Architecture... 97
actual output for the MPlayer case.
Table 5.2: Measured data dependencies between the modules of the MPlayer
Module 1 Module 2 count size (bytes)
Gui Libao 64 256
Gui Libmpcodecs 360 1329
Gui Libvo 591 2217
Gui Demuxer 47 188
Gui MPlayer 36 144
Gui Stream 242 914
Libao Libmpcodecs 78 328
Libao Libvo 63 268
Libao Demuxer 3 12
Libao Mplayer 43 172
Libao Stream 54 232
Libmpcodecs Libvo 332 1344
Libmpcodecs Demuxer 29 116
Libmpcodecs Mplayer 101 408
Libmpcodecs Stream 201 812
Libvo Demuxer 53 212
Libvo Mplayer 76 304
Libvo Stream 246 995
Demuxer Mplayer 0 0
Demuxer Stream 23 92
Mplayer Stream 28 116
In Table 5.2 we see per pair of modules, the number of common memory locations
accessed (i.e. count) and the total size of the shared memory. This table is stored
in the Data Dependency Database. Once the data dependencies are stored, Data
Dependency Analyzer becomes ready to calculate the data dependency size for a
particular selection of RUs defined by the Design Alternative Generator. Data De-
pendency Query Generator accesses the Data Dependency Database, creates and
executes queries for each RU decomposition alternative to estimate the data depen-
dency size. The size of the data dependency is calculated simply by summing up
the shared data size between the modules of the selected RUs. This calculation is
depicted in the following equation.
dd =

RUx

RUy∧
(x=y)

m∈x

k∈y
memsize(m, k) (5.5)
98 Chapter 5. Quantitative Analysis and Optimization of Software Architecture...
The querying process is very similar to the querying of function dependencies as
explained in section 5.8.1. The only difference is that the information being queried is
the data size instead of the number of function calls. For the example decomposition
shown in Figure 5.3 the data dependency size is approximately 5 KB. It took the
tool less than 100 ms. to calculate this. Note also that the decomposition turns out
to be a feasible alternative with respect to the data dependency size constraints,
where the threshold was specified as 10 KB for the case study.
5.8.3 Depicting Analysis Results
Using the deployment and domain constraints we have seen that for the MPlayer
case the total number of 877 decomposition alternatives was reduced to 20. For each
of these alternatives we can now follow the approach of the previous two subsections
to calculate the function dependency overhead and data dependency size values. The
Arch-Studio Recovery Designer generates a graph corresponding to the generated
design space and uses gnuplot to present it. The tool allows to depict the generated
design space at any time of the analysis process. In Figure 5.14 the total design
space with the data dependency size and function dependency overhead values been
plotted. The decomposition alternatives are listed along the x-axis. The y-axis on
the left hand side is used for showing the function dependency overheads of these
alternatives as calculated by the Function Dependency Analyzer. The y-axis on the
right hand side is used for showing the data dependency sizes of the decomposition
alternatives as calculated by the Data Dependency Analyzer. Figure 5.15 shows the
plot for the 20 decomposition alternatives that remain after applying the deployment
and domain constraints. Using these results the software architect can already have
a first view on the feasible decomposition alternatives. The final selection of the
alternatives will be explained in the next section.
5.9 Decomposition Alternative Selection
After the software architecture is described, design constraints are defined and the
necessary measurements are performed on the system, the final set of decompo-
sition alternatives can be selected as defined by the last group of activities (See
Figure 5.4). Using the domain constraints we have seen that for the MPlayer case
20 alternatives were possible. This set of alternatives is further evaluated with re-
spect to the performance feasibility constraints based on the defined thresholds and
the measurements performed on the existing system. For the MPlayer case we have
set the function dependency overhead threshold to 15% and the data dependency
Chapter 5. Quantitative Analysis and Optimization of Software Architecture... 99
0
5
10
15
20
25
0 100 200 300 400 500 600 700 800
0
2
4
6
8
10
12
14
F
u
n
c
t
i
o
n

D
e
p
e
n
d
e
n
c
y

O
v
e
r
h
e
a
d

(
%
)
D
a
t
a

D
e
p
e
n
d
e
n
c
y

S
i
z
e

(
K
B
)
RU Decomposition Alternatives
Analysis Results For the Decomposition Alternatives
Function Dependency Overhead (%)
Data Dependency Size (KB)
Figure 5.14: Function Dependency Overhead and Data Dependency Sizes for all
Decomposition Alternatives
size threshold to 10 KB. It appears that after the application of performance fea-
sibility constraints, only 6 alternatives are left in the feasible design space as listed
in Figure 5.16. Hereby, RUs are defined in the straight brackets ‘[’ and ‘]’. For
example, the alternative # 1 represents the alternative as defined in Figure 5.3.
Figure 5.17 shows the function dependency overhead and data dependency size for
the 6 feasible decomposition alternatives. Hereby, we can see that the alternative
# 4 has the highest value for the function dependency overhead. This is because,
this alternative corresponds to the decomposition, where the two highly coupled
modules Libvo and Libmpcodecs are separated from each other. We can see that
this alternative has also a distinctively high data dependency size. This is because,
this decomposition alternative separates the modules Gui, Libvo and Libmpcodecs
from each other. As can be seen in Table 5.2, the size of data that is shared among
these modules are the highest among all.
Since the main goal of local recovery is to maximize the system availability, we need
to evaluate and compare the feasible decomposition alternatives based on availability
as well. The approach adopted for the evaluation of availability and the selection
of a decomposition depends on the size of the feasible design space. If the size is
smaller than a maximum affordable amount depending on the available computation
resources, we generate analytical models for each alternative in the design space.
These analytical models are utilized for estimating the system availability. We
select a RU decomposition based on the estimated availability and performance
overhead of the feasible decomposition alternatives. If the design space is too large,
we use a set of optimization algorithms to select one of the feasible decomposition
100 Chapter 5. Quantitative Analysis and Optimization of Software Architecture...
0
5
10
15
20
25
0 5 10 15 20
4
5
6
7
8
9
10
F
u
n
c
t
i
o
n

D
e
p
e
n
d
e
n
c
y

O
v
e
r
h
e
a
d

(
%
)
D
a
t
a

D
e
p
e
n
d
e
n
c
y

S
i
z
e

(
K
B
)
RU Decomposition Alternatives
Analysis Results For the Decomposition Alternatives
Function Dependency Overhead (%)
Data Dependency Size (KB)
Figure 5.15: Function Dependency Overhead and Data Dependency Sizes for De-
composition Alternatives after domain constraints are applied
0: { Mplayer, Libao, Libmpcodecs, Demuxer, Stream, Libvo Gui }
1: { Mplayer, Libmpcodecs, Demuxer, Stream, Libvo Gui Libao}
2: { Mplayer GuiLibao, Libmpcodecs, Demuxer, Stream, Libvo }
3: { Mplayer, Libao GuiLibmpcodecs, Demuxer, Stream, Libvo }
4: { Mplayer, Libao, Libmpcodecs, Demuxer, Stream Gui Libvo }
5: { Mplayer GuiLibaoLibmpcodecs, Demuxer, Stream, Libvo }
Figure 5.16: The feasible decomposition alternatives with respect to the specified
constraints
alternative based on heuristics. These two approaches are explained in the following
subsections.
5.10 Availability Analysis
In Figure 5.17, we can see the function dependency overhead and data dependency
size for the 6 feasible decomposition alternatives. To select an optimal decomposi-
tion, we need to quantitatively assess these alternatives with respect to availability
as well.
We exploit the compositional semantics of the I/O-IMC formalism to build availabil-
ity models for alternative decompositions for local recovery. The Analytical Model
Chapter 5. Quantitative Analysis and Optimization of Software Architecture... 101
0
5
10
15
20
0 1 2 3 4 5 6
4
5
6
7
8
9
10
F
u
n
c
t
i
o
n

D
e
p
e
n
d
e
n
c
y

O
v
e
r
h
e
a
d

(
%
)
D
a
t
a

D
e
p
e
n
d
e
n
c
y

S
i
z
e

(
K
B
)
RU Decomposition Alternatives
Analysis Results For the Decomposition Alternatives
Function Dependency Overhead (%)
Data Dependency Size (KB)
Figure 5.17: Function Dependency Overhead and Data Dependency Sizes for 6
Decomposition Alternatives
Generator automatically generates I/O-IMC models for the behavior of each module
and RU of the system, and the coordination of recovery actions (See Appendix B).
It also generates a script for the composition and analysis of these models. The ex-
ecution of this script merges all the generated models in several iterations, pruning
the state space in each iteration. The result of this composition and aggregation is a
regular CTMC, which is analyzed to compute system availability. In the following,
we explain our modeling approach and the analysis results.
5.10.1 Modeling Approach
In addition to the decomposition of system modules into RUs, the realization of
local recovery (Chapter 6) requires i ) control of communication between RUs and
ii ) coordination of recovery actions. These features can be implemented in several
ways. In our modeling approach, we have introduced a model element, namely the
Recovery Manager (RM), to represent the coordination of recovery actions. The
communication control between RUs is considered to be a part of the RM and as
such is not modeled separately. In addition to the RM, we have introduced model
elements for representing the behavior of individual modules and RUs. We have
made the following assumptions in our modeling approach:
1. The failure of any module within an RU causes the failure of the entire RU.
2. The recovery of an RU entails the recovery (i.e. restart) of all of its modules
102 Chapter 5. Quantitative Analysis and Optimization of Software Architecture...
(even the ones that did not fail).
3. The failure of a module is governed by an exponential distribution (i.e. con-
stant failure rate).
4. The recovery of a module is governed by an exponential distribution
5
and the
error detection phase is ignored.
5. The RM does not fail.
6. As implied by the previous assumption, the RM always correctly detects a
failing RU.
7. Only one RU can be recovered at a time and RM recovers the RUs on a
first-come-first-served (FCFS) basis.
8. The recovery always succeeds.
9. The recovery of the modules inside a given RU is performed sequentially, one
module at a time (a particular recovery sequence has no significance).
Recovery
Manager
Failure
Ìnterface
Recovery
Ìnterface
Module x
RU 0
up_x, failed_x
Module k
up_k, failed_k
recovered_k,
recovering_0
recovered_x,
recovering_0
Failure
Ìnterface
Recovery
Ìnterface
RU n
up_0, failed_0 up_n, failed_n
start_recover_0 start_recover_n
KEY RU boundaries Ì/O-ÌMC models
Ì/O-ÌMC
input/output
actions
Figure 5.18: Modeling approach based on the I/O-IMC formalism
Figure 5.18 depicts our modeling approach. The RM only interfaces with RUs
and is unaware of the modules within RUs. Each RU exhibits two interfaces; A
5
This assumption was made for analytic tractability. An exponential distribution might not be,
in some cases, a realistic choice; however, it is also possible to use a phase-type distribution, which
approximates any distribution arbitrarily closely.
Chapter 5. Quantitative Analysis and Optimization of Software Architecture... 103
failure interface and a recovery interface. The failure interface essentially listens
to the failure of the modules within the RU and outputs a RU failure signal upon
the failure of a module. Correspondingly, the failure interface outputs an RU up
signal upon the successful recovery of the all the modules. The RM listens to the
failure and up signals emitted by the failure interfaces of the RUs. The recovery
interface is responsible for the recovery of the modules within the RU. Upon the
receipt of a start recover signal from the RM, it starts a sequential recovery of the
corresponding modules. Each module, recovery interface, failure interface, and the
RM correspond to an I/O-IMC model as depicted in Figure 5.18. The exchanged
signals are represented as input and output actions in these I/O-IMCs.
Analytical Model Generator first generates I/O-IMC models for each module. Based
on the decomposition alternative provided by the Design Alternative Generator, it
then generates the corresponding I/O-IMCs for failure/recovery interfaces and the
RM. An example I/O-IMC model generated for the RM that controls two RUs is
depicted in Figure 5.19. Appendix B contains detailed explanations of example
I/O-IMCs (including the one depicted in Figure 5.19). It also gives details on the
specification and generation process of I/O-IMC models.
In addition to generating all the necessary I/O-IMCs, a composition and analysis
script is also generated, which conforms to the CADP SVL scripting language [42].
The execution of this script within CADP composes/aggregates all the I/O-IMCs
based on the decomposition alternative, reduces the final I/O-IMC into a CTMC
(See Appendix B for details), and computes the steady state probabilities of being
at states, where the specified RUs are available. Based on these probabilities, the
availability of the system is calculated as a whole. Figure 5.20 shows the generated
CTMC for global recovery, where all modules are placed in a single RU.
In Figure 5.20, the initial state is state 0. It represents the system state, where the
only RU of the system is available. All the other states represent the recovery stages
of the RU. Since the RU includes 7 modules that are recovered (initialized) sequen-
tially, there exist 7 states corresponding to the recovery of the RU. For example,
state 7 represents the system state, where the RU is failed and non of its modules
are recovered yet. There is a transition from state 0 to state 7, in which the label of
the transition denotes the failure rate of this RU. Upon its failure, all the 7 modules
comprised by the RU must be recovered. That is why, there exist 7 transitions from
state 7 back to state 0, where the labels of these transitions specify the recovery
rates of the modules. The steady state probability of being at state 0 was computed
by CADP as 0.985880, which provides us the availability of the system as 98.5880%.
104 Chapter 5. Quantitative Analysis and Optimization of Software Architecture...
8
4
0
5
1
6
2
7
3
failed_1?
failed_1?
up_2?
up_2?
up_2?
failed_2?
failed_2?
start_recover_1!
failed_2?
up_1?
up_1?
up_1?
failed_1?
failed_1?
failed_1?
up_2?
start_recover_2!
failed_2?
up_2?
start_recover_1!
failed_2?
up_1?
up_1?
failed_1?
failed_1?
start_recover_2!
up_2?
up_2?
failed_2?
up_1?
failed_2?
up_1?
failed_1?
failed_1?
up_2?
up_2?
failed_2?
failed_2?
up_1?
up_1?
Figure 5.19: The generated I/O-IMC model for the Recovery Manager that controls
two RUs
5.10.2 Analysis Results
We have generated CTMC models and calculated the availability for all the feasible
decomposition alternatives that are listed in Figure 5.16. In this case study, we
assume that the availability of the system is determined by the availability of the
RU that comprises the Mplayer module. This is because, only the failure of this RU
leads to the failure of the whole system regardless of the availability of the other
RUs. The results are listed in Table 5.3 together with the corresponding function
dependency overhead and data dependency size values as calculated before. The
decomposition alternative # 1 represents the decomposition shown in Figure 5.3.
The CTMC generated for this decomposition alternative comprises 60 states 122
transitions.
Based on the results listed in Table 5.3, we can evaluate and compare the feasible
Chapter 5. Quantitative Analysis and Optimization of Software Architecture... 105
4
0 5
1 6
2 7
3
rate 250000
rate 385
rate 200000 rate 250000
rate 185185
rate 208333
rate 125000 rate 166666
Figure 5.20: CTMC generated for global recovery, where all modules are placed in
a single RU
Table 5.3: Estimated Availability, Function Dependency Overhead and Data De-
pendency Size for the Feasible Decomposition Alternatives
Function Data
# Estimated Dependency Dependency
Availability (%) Overhead (%) Size (bytes)
0 98, 9808 3, 4350 4860
1 99, 2791 5, 3907 5860
2 99, 9555 7, 7915 5860
3 99, 8589 5, 8358 6516
4 99, 2575 14, 9803 7771
5 99, 9557 7, 7915 6688
decomposition alternatives. The decomposition alternatives # 0 and # 1 have the
lowest function dependency overhead and data dependency size. That means that
they introduce relatively less performance overhead and it is easier to implement
them. However, the alternative # 0 has also the lowest availability among all the
alternatives. We can also notice from the results that the decomposition alternative
# 4 does not lead to a significantly higher availability as opposed to the high over-
head it introduces. As will be explained in section 5.12, we have introduced local
recovery to the MPlayer software for the decomposition alternatives # 0 and # 1.
106 Chapter 5. Quantitative Analysis and Optimization of Software Architecture...
5.11 Optimization
In our case study, we have evaluated a total of 6 feasible decomposition alternatives.
Depending on the specified constraints and the number of modules, there can be
too many feasible decomposition alternatives. In such cases it can take too much
time to evaluate all the alternatives with analytical models although the models
are generated automatically. Moreover, for some decomposition alternatives, the
number of RUs can be too large. As a result, the state space of the model can grow
too big despite of the reductions. Consequently, it might be infeasible for the model
checker to evaluate the availability in an acceptable period of time.
For such cases, we also provide an option to select a decomposition among the fea-
sible alternatives with heuristic rules and optimization techniques (See Figure 5.4).
This activity is carried out with the Optimizer tool of Arch-Studio Recovery De-
signer as shown in Figure 5.6. The Optimizer tool employs optimization techniques
to automatically select an alternative in the design space. The following objective
function is adopted by the optimization algorithms based on heuristics.
MTTR
RU
x
=

m∈RU
x
MTTR
m
(5.6)
1/MTTF
RU
x
=

m∈RU
x
1/MTTF
m
(5.7)
Criticality
RU
x
=

m∈RU
x
Criticality
m
(5.8)
obj.func. = min.

RURU
x
Criticality
RU
x
×
MTTR
RU
x
MTTF
RU
x
(5.9)
In equations 5.6 and 5.7, we calculate for each RU the MTTR and MTTF, respec-
tively. The calculation of MTTR for a RU is based on the fact that all the modules
comprised by a RU are recovered sequentially. That is why the MTTR for a RU is
simply the addition of MTTR values of the modules that are comprised by the cor-
responding RU. The calculation of MTTF for a RU is based on the assumption that
the failure probability follows an exponential distribution with rate λ = 1/MTTF.
If X
1
, X
2
, ... and X
n
are independent exponentially distributed random variables
with rates λ
1
, λ
2
, ... and λ
n
, respectively, than min(X
1
, X
2
, ..., X
n
) is also expo-
nentially distributed with rate λ
1
+ λ
2
+ ... + λ
n
[101]. As a result, the failure rate
of a RU (1/MTTF
RU
x
) is equal to the sum of failure rates of the modules that are
comprised by the corresponding RU. In Equation 5.8, we calculate the Criticality
of a RU by simply summing up the Criticality values of the modules that are com-
Chapter 5. Quantitative Analysis and Optimization of Software Architecture... 107
prised by the RU. Equation 5.9 shows the objective function that is utilized by the
optimization algorithms. As explained in section 5.3.2, to maximize the availability,
MTTR of the system must be kept as low as possible and MTTF of the system must
be as high as possible. As a heuristic based on this fact, the objective function is to
minimize the MTTR/MTTF ratio in total for all RUs, which are weighted based
on the Criticality values of RUs.
The Optimizer tool can use different optimization techniques to find the design
alternative that satisfies the objective function the most. Currently it supports two
optimization techniques; exhaustive search and hill-climbing algorithm. In this case
of exhaustive search, all the alternatives are evaluated and compared with each
other based on the objective function. If the design space is too large for exhaustive
search, hill-climbing algorithm can be utilized to search the design space faster but
ending up with possibly a sub-optimal result with respect to the objective function.
Hill-climbing algorithm starts with a random (potentially bad) solution to the prob-
lem. It sequentially makes small changes to the solution, each time improving it a
little bit. At some point the algorithm arrives at a point where it cannot see any
improvement anymore, at which point the algorithm terminates. The Optimizer uti-
lizes the hill-climbing algorithm in a similar way to the clustering algorithm in [83].
The Optimizer starts from the worst decomposition with respect to availability,
where all modules of the system are placed in a single RU. Then it systematically
generates a set of neighbor decomposition by moving modules between RUs (called
as a partition in [83]). A decomposition NP is defined as a neighbor decomposition
of P if NP is exactly the same as P except that a single module of a RU in P is in
a different RU in NP. During the generation process a new RU can be created by
moving a module to a new RU. It is also possible to remove a RU when its only mod-
ule is moved into another RU. The Optimizer generates neighbor decompositions of
a decomposition by systematically manipulating the corresponding RG string. For
example, consider the set of RG strings of length 7, where the elements in the string
correspond to the modules Mplayer, Libao, Libmpcodecs, Demuxer, Stream, Libvo
and Gui, respectively. Then the decomposition { [Libmpcodecs, Libvo] [Mplayer]
[Gui] [Demuxer] [Libao] [Stream] } is represented by the RG string 1240350. By
incrementing or decrementing one element in this string, we end up with the RG
strings 1242350, 1243350, 1244350, 1245350, 1246350, 1240340, 1240351, 1240352,
1240353, 1240354, 1240355 and 1240356, which correspond to the decompositions
shown in Figure 5.21, respectively.
For the MPlayer case, the optimal decomposition (ignoring the deployment and
domain constraints) based on the heuristic-based objective function (Equation 5.9)
is { [Mplayer] [Gui] [Libao] [Libmpcodecs, Libvo] [Demuxer] [Stream] }. It took 89
seconds for the Optimizer to find this decomposition with exhaustive search. The
108 Chapter 5. Quantitative Analysis and Optimization of Software Architecture...
1: { Libvo Mplayer, Libmpcodecs Gui Demuxer Libao Stream }
2: { Libvo Mplayer Gui, Libmpcodecs Demuxer Libao Stream }
3: { Libvo Mplayer Gui Libmpcodecs, Demuxer Libao Stream }
4: { Libvo Mplayer Gui Demuxer Libao, Libmpcodecs Stream }
5: { Libvo Mplayer Gui Demuxer Libao Libmpcodecs, Stream }
6: { Libvo Mplayer Gui Demuxer Libao Stream Libmpcodecs }
7: { Libmpcodecs, Libvo Mplayer Gui Demuxer Libao, Stream }
8: { Libmpcodecs Mplayer, Libvo Gui Demuxer Libao Stream }
9: { Libmpcodecs Mplayer Gui, Libvo Demuxer Libao Stream }
10: { Libmpcodecs Mplayer Gui Demuxer, Libvo Libao Stream }
11: { Libmpcodecs Mplayer Gui Demuxer Libao, Libvo Stream }
12: { Libmpcodecs Mplayer Gui Demuxer Libao Stream, Libvo }
13: { Libmpcodecs Mplayer Gui Demuxer LibaoStream Libvo }
Figure 5.21: The neighbor decompositions of the decomposition { [Libmpcodecs,
Libvo] [Mplayer] [Gui] [Demuxer] [Libao] [Stream] }
hill-climbing algorithm terminated on the same machine in 8 seconds with the same
result. A total of 76 design alternatives had to be evaluated and compared by the
hill-climbing algorithm, instead of 877 alternatives in the case of exhaustive search.
Selection of decomposition alternatives with heuristic rules and optimization tech-
niques is a viable approach for evaluating large design spaces although the evaluation
of alternatives might lead to different results than the evaluation based on analyt-
ical models. For example, under the consideration of the specified constraints in
the MPlayer case study, the decomposition alternative # 5 as listed in Figure 5.16
has been selected as the optimal decomposition. This is indeed one of the best
alternatives though the results in Table 5.3 show that there is a small difference
of availability especially compared to the alternative # 2. In this case, we have
rather selected the alternative #2 because of its relatively less function dependency
overhead and smaller data dependency size.
Chapter 5. Quantitative Analysis and Optimization of Software Architecture... 109
5.12 Evaluation
The utilization of analytical models and optimization techniques helps us to analyze,
compare and select decompositions among many alternatives. To assess the correct-
ness and accuracy of these techniques, we have performed real-time measurements
from systems that are decomposed for local recovery. In this section, we explain the
performed measurements and the results of our assessments.
To implement local recovery, we have used a framework FLORA (Chapter 6).
We have implemented local recovery for a total of 3 decomposition alternatives
of MPlayer. i ) Global recovery, where all the modules are placed in a single RU ({
[Mplayer, Libmpcodecs, Libvo, Demuxer, Stream, Gui, Libao] }) ii ) Local recovery
with two RUs, where the module Gui is isolated from the rest of the modules ({
[Mplayer, Libmpcodecs, Libvo, Demuxer, Stream, Libao] [Gui] }) iii ) Local recov-
ery with three RUs, where the module Gui, Libao and the rest of the modules are
isolated from each other ({ [Mplayer, Libmpcodecs, Libvo, Demuxer, Stream, Libao]
[Gui] }). Note that the 2
nd
and the 3
rd
implementations correspond to the decom-
position alternatives # 0 and # 1 in Figure 5.16, respectively. We have selected
these decomposition alternatives because they have the lowest function dependency
overhead and data dependency size.
To be able to measure the availability achieved with these three implementations, we
have modified each module so that they fail with a specified failure rate (assuming an
exponential distribution with mean MTTF). After a module is initialized, it creates
a thread that is periodically activated every second to inject errors. The operation
of the thread is shown in Algorithm 2.
Algorithm 2 Periodically Activated Thread for Error Injection
1: time init ← currentTime()
2: while TRUE do
3: time elapsed ← currentTime() −time init
4: p ← 1 −1/e
time elapsed/MTTF
5: r ← random()
6: if p ≥ r then
7: injectError()
8: break
9: end if
10: end while
The error injection thread first records the initialization time (Line 1). Then, each
time it is activated, the thread calculates the time elapsed since the initialization
(Line 3). The MTTF value of the corresponding module and the elapsed time is
110 Chapter 5. Quantitative Analysis and Optimization of Software Architecture...
used for calculating the probability of error occurrence (Line 4). In Line 5, ran-
dom() returns, from a uniform distribution, a sample value r ∈ [0, 1]. This value is
compared to the calculated probability to decide whether or not to inject an error
(Line 6). Possibly an error is injected by basically creating a fatal error with an
illegal memory operation. This error crashes the process, on which the module is
running (Line 7).
The Recovery Manager component of FLORA (Chapter 6) logs the initialization
and failure times of RUs to a file during the execution of the system. For each of
the implemented alternatives, we have let the system run for 5 hours. Then, we
have processed the log files to calculate the times, when the core system module,
Mplayer has been down. We have calculated the availability of the system based on
the total time that the system has been running ( 5 hours). For the error injection,
we have used the same MTTF values (0.5hrs) as utilized by the analytical models
and heuristics as explained in section 5.5. Table 5.4 lists the results for the three
decomposition alternatives.
Table 5.4: Comparison of the estimated availability with analytical models, the
heuristic outcome and the measured availability.
Decomposition Measured Estimated Heuristic
Alternative Availability Availability Objective Function
all modules in 1 RU 97.5732 98.5880 6.72
Gui, the rest 97.5834 98, 9808 9.38
Gui, Libao, the rest 97.7537 99, 2791 12.57
In Table 5.4, the first column shows the availability measured from the running
systems. The second column shows the availability estimated with the analytical
models for the corresponding decomposition alternatives. The third column shows
the objective function value calculated to compare these alternatives during the ap-
plication of optimization techniques. Here, we see that the measured availability
and the estimated availability are quite close to each other. In addition, the de-
sign alternatives are ordered correctly with respect to their estimated availabilities,
measured availabilities and the objective function used for optimization.
To amplify the difference between the decomposition alternatives with respect to
availability, and such evaluate the results based on analytical models better, we
have repeated our measurements and estimations for varying MTTF values. As
shown in Table 5.5, we have fixed MTTF to 1800s (= 0, 5hrs) for all the modules
but Libao and Gui. For these two modules, we have specified MTTF values as 60s
and 30s, respectively.
Table 5.6 shows the results, when we have used the MTTF values shown in Table 5.5
Chapter 5. Quantitative Analysis and Optimization of Software Architecture... 111
Table 5.5: MTTF values that are used for the repeated estimations and measure-
ments (MTTR and Criticality values are kept as same as before).
Modules MTTF (s)
Libao 60
Libmpcodecs 1800
Demuxer 1800
Mplayer 1800
Libvo 1800
Stream 1800
Gui 30
both for the analytical models and the error injection threads.
Table 5.6: Comparison of the estimated availability with analytical models, the
heuristic outcome and the measured availability in the case of varying MTTFs.
Decomposition Measured Estimated Heuristic
Alternative Availability Availability Objective Function
all modules in 1 RU 83.27 83.60 0.5
Gui, the rest 92.31 93.25 1.24
Gui, Libao, the rest 97.75 98.70 2.83
Again, in Table 5.6, we observe that the design alternatives are ordered correctly
with respect to their estimated availabilities, measured availabilities and the objec-
tive function used for optimization. We can also see that the measured availability
and the estimated availability values (in %) are quite close to each other. In gen-
eral, the measured availability is lower than the estimated availability. This is due
in part to the communication delays in the actual implementation which are not
accounted for in the analytical models. In fact, the various communications be-
tween the software modules, which are modeled using interactive transitions in the
I/O-IMC models, abstract away any communicaion time delay (i.e. the communi-
cation is instantaneous). However, in reality, the recovery time includes the time for
error detection, diagnosis and communication among multiple processes, which are
subject to delays due to process context switching and inter-process communication
overhead.
Based on these results, we can conclude that our approach can be used for accurately
analyzing, comparing and selecting decomposition alternatives for local recovery.
112 Chapter 5. Quantitative Analysis and Optimization of Software Architecture...
5.13 Discussion
Partial failures
We have calculated the availability of the system according to the proportion of time
that the core system module, Mplayer has been down. There are also partial failures,
where RU GUI and RU AUDIO are restarted. In these cases, GUI panel vanishes
and audio stops playing until the corresponding RUs are recovered. These are also
failures from the user’s perspective but we have neglected these failures to have a
common basis for comparison with global recovery. Depending on the application,
functional importance of the failed module and the recovery time, the user might get
annoyed differently from partial failures. Also different users might have different
perceptions. So, to take into account different type of partial failures, experiments
are needed to be conducted to determine their actual effect on users [28]. In any
case, the failure of the whole system would be the most annoying of all.
Frequency of data access
In data dependency analysis (Section 5.8.2), we have evaluated the size of shared
data among modules. The shared data size is important since such data should be
synchronized back after each recovery. The number of data access (column count in
Table 5.2), on the other hand, is important due to redirection of the access through
IPC. This is mostly covered by the function dependency analysis, where data is
accessed through function calls. The correlation between function dependency over-
head and data dependency size (Figure 5.15) shows this. In fact, our implementation
of local recovery (explained in Chapter 6) requires that all data access must be per-
formed through explicit function calls. For each direct data access without a function
call, if there are any, the corresponding parts of the system are refactored to full-
fill this requirement. After this refactoring, the function dependency analysis can
be repeated to get more accurate results with respect to the expected performance
overhead.
Specification of decomposition constraints
For defining the decomposition alternatives an important step is the specification of
constraints. The more and the better we can specify the corresponding constraints
the more we can reduce the design space of decomposition alternatives. In this work,
we have specified deployment constraints (number of possible recoverable units),
domain constraints and feasibility constraints (performance overhead thresholds).
Chapter 5. Quantitative Analysis and Optimization of Software Architecture... 113
The domain constraints are specified with requires and mutex relations. This has
shown to be practical for designers who are not experts in defining complicated
formal constraints. Nevertheless, a possible extension of the approach is to improve
the expressiveness power of the constraint specification. For example, using first-
order logic can be an option though in certain cases, the evaluation of the logical
expressions and checking for conflicts can become NP-complete (e.g. the satisfiability
problem). As such, workarounds and simplifications can be required to keep the
approach feasible and practical.
Impact of usage scenarios on profiling results
The frequency of calls, execution times of functions and the data access profile
can vary depending on the usage scenario and the inputs that are provided to the
system. In our case study, we have performed our measurements for the video-
playing scenario, which is a common usage scenario for a media player application.
In principle, it is possible to take different types of usage scenarios into account. The
results obtained from several system runs are statistically combined by the tools [40].
If there is a high variance among the execution of scenarios, where statistically
combining the results would be wrong, multiple scenarios can be repeated in a period
of time and the overhead can be calculated based on the profile information collected
during this time period. However, this would require selection and prioritization of
a representative set of scenarios with respect to their importance from the user point
of view (user profile) [28]. The analysis process will remain the same although the
input profile data can be different.
5.14 Related Work
For analyzing software architectures a broad number of architecture analysis ap-
proaches have been introduced in the last two decades ([19, 31, 46]). These ap-
proaches help to identify the risks, sensitivity points and trade-offs in the architec-
ture. These are general-purpose approaches, which are not dedicated to evaluating
the impact of software architecture decomposition on particular quality attributes.
In this work, we have proposed a dedicated analysis approach that is required for
optimizing the decomposition of architecture for local recovery in particular.
In [83] several optimization techniques are utilized to cluster the modules of a system.
The goal of the clustering is to reverse engineer complex software systems by creating
views of their structure. The part of our approach, where we utilize optimization
techniques as explained in section 5.11 is very similar to and inspired from [83].
114 Chapter 5. Quantitative Analysis and Optimization of Software Architecture...
Although the underlying problem (set partitioning) is similar, we focus on different
qualities. Our goal is to improve availability, whereas modularity is the main focus
of [83]. As a result, we utilize different objective functions for the optimization
algorithms.
There are several techniques for estimating reliability and availability based on for-
mal models [121]. For instance, [72] presents a Markov model to compute the avail-
ability of a redundant distributed hardware/software system comprised of N hosts;
[122] presents a 3-state semi-Markov model to analyze the impact of rejuvenation
6
on
software availability and in [79], the authors use stochastic Petri nets to model and
analyze fault-tolerant CORBA (FT-CORBA) [89] applications. In general however,
these models are specified manually for specific system designs and the methodol-
ogy lacks a comprehensive tool-support. Consequently, the process rather slow and
error-prone, making these models less practical to use. In our tool, we generate for-
mal models for different decomposition alternatives for local recovery automatically.
In this work, quantitative availability analysis is performed with Markov models,
which are generated based on an architectural model expressed in an extended
xADL. There are similar approaches that are based on different ADLs or, that use
different analytical models. For example, architectural models that are expressed in
AADL (Architecture Analysis and Design Language) has been used as a basis for
generating Generalized Stochastic Petri Nets (GSPN) to estimate the availability
of design alternatives [102]. UML based architectural models have also been used
for generating analytical models to estimate dependability attributes. For instance,
augmented UML diagrams are used in [80] for generating Timed Petri Nets (TPN).
A recoverable unit can be considered as a wrapper of a set of system modules to
increase system dependability. Formal consistency analysis of error containment
wrappers is performed in [65]. In this approach, system components and their
composition are expressed based on the concepts of category theory. Using this
specification, the propagation of data (wrong value) errors is analyzed and a globally
consistent set of wrappers are generated based on assertion checks. The utilization
of the iC2C (See section 4.8) is discussed in [25], where a method and a set of
guidelines are presented for wrapping COTS (commercial off-the-shelf) components
based on their undesired behavior to meet the dependability requirements of the
system.
As far as local-recovery strategies are concerned, the work in [15] is similar to ours in
the sense that several decomposition alternatives are evaluated for isolating system
components from each other. However, their evaluation techniques are different:
whereas we predict the availability at design time based on formal models, [15]
6
Proactively restarting a software component to mitigate its aging and thus its failure.
Chapter 5. Quantitative Analysis and Optimization of Software Architecture... 115
uses heuristics to choose a decomposition alternative and evaluate it by running
experiments on the actual implementation.
5.15 Conclusions
Local recovery requires the decomposition of software architecture into a set of iso-
lated recoverable units. There exist many alternatives to select the set of recoverable
units and each alternative has an impact on availability and performance of the sys-
tem. In this chapter, we have proposed a systematic approach to evaluate such
decomposition alternatives and balance them with respect to availability and per-
formance. The approach is supported by an integrated set of tools. These tools in
general can be utilized to support the analysis of a legacy system that is subject to
major modification, porting or integration (e.g. with recovery mechanisms in this
case). We have illustrated our approach by analyzing decomposition alternatives of
MPlayer. We have seen that the performance overhead introduced by decomposi-
tion alternatives can be easily observed. We have utilized implementations of three
decomposition alternatives to validate our analysis approach and the measurements
performed on these systems have confirmed the validity of the analysis results.
The proposed approach requires the existence of the source code of the system for
performance overhead estimation. This estimation is subject to the basic limitations
of dynamic analysis approaches as described in section 5.1. The assumptions listed
in section 5.10.1 constitute another limitation of the approach concerning the model-
based availability analysis. The outcome of the analysis is also dependent on the
provided MTTF values for the modules. The estimation of actual failure rates is
usually the most accurate way to define the MTTF values. However, this requires
historical data (e.g. a problem database), which can be missing or not accessible.
In our case study, we have used fixed values and what-if analysis.
116 Chapter 5. Quantitative Analysis and Optimization of Software Architecture...
Chapter 6
Realization of Software
Architecture Recovery Design
In the previous two chapters, we have explained how to document and analyze de-
sign alternatives for local recovery. After one of the design alternatives is selected,
the software architecture should be structured accordingly. In addition, new supple-
mentary architectural elements and relations should be implemented to enable local
recovery. As a result, introducing local recovery to a system leads to additional de-
velopment and maintenance effort. In this chapter, we introduce a framework called
FLORA to reduce this effort. FLORA supports the decomposition and implemen-
tation of software architecture for local recovery.
The chapter is organized as follows. In the following section, we first outline the
basic requirements for implementing local recovery. In seciton 6.2, we introduce
the framework. Section 6.3 illustrates the application of FLORA with the MPlayer
case study. We evaluate the application of FLORA in section 6.4. We conclude the
chapter after discussing possibilities, limitations and related work.
117
118 Chapter 6. Realization of Software Architecture Recovery Design
6.1 Requirements for Local Recovery
Introducing local recovery to a system imposes certain requirements to its architec-
ture. Based on the literature and our experiences in the TRADER [120] project we
have identified the following three basic requirements:
• Isolation: An error (e.g. illegal memory access) occurring in one part of the
system can easily lead to a system failure (e.g. crash). To prevent this failure
and support local recovery we need to be able to decompose the system into a
set of recoverable unit (RU) that can be isolated (e.g. isolation of the memory
space). Isolation is usually supported by either the operating system (e.g.
process isolation [59]) or a middleware (e.g. encapsulation of Enterprise Java
Bean objects) and the existing design (e.g. crash-only design) [16].
• Communication Control : Although a RU is unavailable during its recovery,
the other RUs might still need to access it in the mean time. Therefore, the
communication between RUs must be captured to deal with the unavailabil-
ity of RUs, for example, by queuing and retrying messages or by generating
exceptions. Similar to loosely coupled components of the C2 architectural
style [114], RUs should not directly communicate with each other. In [16], for
instance, the communication is mediated by an application server. In general,
various alternatives can be considered for realizing the communication control
like completely distributed, hierarchical or centralized approaches.
• System-Recovery Coordination: In case recovery actions need to take place
while the system is still operational, interference with the normal system func-
tions can occur. To prevent this interference, the required recovery actions
should be coordinated and the communication control should be leaded ac-
cording to the applied recovery actions. Similar to communication control,
coordination can also be realized in different ways ranging from completely
distributed to completely centralized solutions.
6.2 FLORA: A Framework for Local Recovery
To reduce the development and maintenance effort for introducing local recov-
ery while preserving the existing decomposition we have developed the framework
FLORA. The framework has been implemented in the C language on a Linux plat-
form (Ubuntu version 7.04) to provide a proof-of-concept for local recovery. FLORA
assigns RUs to separate operating system (OS) processes, which do not directly com-
municate to each other. It relies on the capabilities of the platform to be able to
Chapter 6. Realization of Software Architecture Recovery Design 119
+kill()
+restart()
+cleanUp()
-state variables
RecoverabIe Unit
+diagnose()()
+sendMsg()
+recvMsg()
+applyCommPolicy()
-comm policies
Comm Manager
-recover()
-diagnose()
-onErrorDetected()
-strategies
Recovery Manager
+execute()
-error type
-recovery state
Recovery Strategy
+execute()
Strategy Set
+execute()
+restartAll()
Strategy n
+apply()
-msg queue
Communication PoIicy
+apply()
+retry()
PoIicy 0
+apply()
+drop()
PoIicy n
+execute()
+restartRU()
Strategy 0
+apply()
PoIicy Set
<<uses>>
<<uses>>
<<uses>>
Figure 6.1: A Conceptual View of FLORA
control/monitor processes (kill, restart, detect a crash) and focuses only on transient
faults. We assume that a process crashes or hangs in case of a failure of the RU that
runs on it. FLORA restarts the corresponding RU in a new process and integrates
it back to the rest of the system, while the other RUs can remain available.
The framework includes a set of reusable abstractions for introducing error detection,
diagnosis, communication control between RUs. Figure 6.1 shows a conceptual
view of FLORA as a UML class diagram. The framework comprises three main
components: Recoverable Unit (RU), Communication Manager (CM) and Recovery
Manager (RM).
To communicate with the other RUs, each RU uses the CM, which mediates all
inter-RU communication and employs a set of communication policies (e.g. drop,
queue, retry messages). Note that a communication policy can be composed of a
combination of other primitive or composite policies. The RM uses the CM to ap-
ply these policies based on the executed recovery strategy. The RM also controls
RUs (e.g. kill, restart) in accordance with the recovery strategy being executed.
Recovery strategies can be composed of a combination of other primitive or com-
posite strategies and they can have multiple states. They are also coupled with error
types, from which they can recover. FLORA implements the detection of two type
of errors: deadlock and fatal errors (e.g. illegal instruction, invalid data access).
Each RU can detect deadlock errors via its wrapper, which detects if an expected
response to a message is not received within a configured timeout period. The CM
120 Chapter 6. Realization of Software Architecture Recovery Design
can discover the RU that fails to provide the expected message. Diagnosis infor-
mation is conveyed to the RM, which kills and restarts the corresponding RU. RM
can detect fatal errors. It is the parent process of all RUs and it receives a signal
from the OS when a child process is dead. By handling this signal, the RM can also
discover the failed RU based on the associated process identifier. Then it restarts
the corresponding RU in a new process. If either of the CM or the RM fails, FLORA
applies global recovery and restarts the whole system.
FLORA comprises Inter-Process Communication (IPC) utilities, message serializa-
tion / de-serialization primitives, error detection and diagnosis mechanisms, a RU
wrapper template, one central RM and one central CM that communicate with one
or more instances of RU. In the following section, we explain how FLORA can be
applied to adapt a given architecture for local recovery.
6.3 Application of FLORA
We have used FLORA to implement local recovery for 3 decomposition alternatives
of MPlayer (See Figure 6.2). The implementations of these alternatives have been
used in the evaluation of our analysis approach as presented in Chapter 5. The
fist decomposition alternative places all the modules in a single RU (Figure 6.2(a)).
In the second alternative, the module Gui and the rest of the modules are placed
in different RUs (Figure 6.2(b)). The third decomposition alternative comprises 3
RUs, where the module Gui, Libao and the rest of the modules are separated from
each other (Figure 6.2(c))
Figure 6.3 depicts the design of MPlayer for each of the decomposition alternatives
after local recovery is introduced. Note that the figures 6.3(a) and 6.3(c) actually
represent the same designs that were depicted before in the figures 4.7(a) and 4.7(b),
respectively. In Chapter 4, the recovery style was used for depicting these design
alternatives. In this chapter, we use the local recovery style to represent the same
design alternatives in more detail.
The first decomposition alternative shown in Figure 6.2(a) comprises only one RU.
Hence, the CM is not used in this case and it does not take place in the correspond-
ing recovery design shown in Figure 6.3(a). The third decomposition alternative was
previously discussed in the previous two chapters and it was presented in the fig-
ures 4.6 and 5.3. In the corresponding recovery design, in Figure 6.3(c), we can see
the 3 RUs, RU MPCORE, RU GUI and RU AUDIO. In addition, the components
CM and RM have been introduced by the framework.
Chapter 6. Realization of Software Architecture Recovery Design 121
«subsystem»
MpIayer
«subsystem»
Gui
«subsystem»
Libvo
«subsystem»
Demuxer
«subsystem»
Stream
«subsystem»
Libmpcodecs
«subsystem»
Libao

(a) all the modules are placed in one RU
«subsystem»
MpIayer
«subsystem»
Gui
«subsystem»
Libvo
«subsystem»
Demuxer
«subsystem»
Stream
«subsystem»
Libmpcodecs
«subsystem»
Libao

MPCORE
RU
GUI
(b) Gui and the rest of the modules are placed
in different RUs
«subsystem»
MpIayer
«subsystem»
Gui
«subsystem»
Libvo
«subsystem»
Demuxer
«subsystem»
Stream
«subsystem»
Libmpcodecs
«subsystem»
Libao
RU
MPCORE
RU

(c) Gui, Libao and the rest of the modules are
placed in different RUs
Figure 6.2: Realized decomposition alternatives of MPlayer (KEY: Figure 4.6)
122 Chapter 6. Realization of Software Architecture Recovery Design
<< RU >>
MPLAYER
<<NRU>> recovery mgr
<<restart>>
<<kill>>
(a) all the modules are placed in one RU
<< RU >>
MPCORE
<< RU >>
GUI
<<NRU>> recovery mgr <<NRU>> comm mgr
<<restart>>
<<restart>>
<<error :deadlock>>
<<error :deadlock>>
<<error :fatal>>
<<diagnosis>>
<<queued>>
<<queued>> <<kill>>
<<kill>>
(b) Gui and the rest of the modules are placed in dif-
ferent RUs
<< RU >>
AUDIO
<< RU >>
MPCORE
<< RU >>
GUI
<<NRU>> recovery mgr <<NRU>> comm mgr
<<restart>>
<<restart>>
<<restart>>
<<error :deadlock>>
<<error :deadlock>>
<<error :deadlock>>
<<error :fatal>>
<<diagnosis>>
<<queued>>
<<queued>>
<<queued>> <<kill>>
<<kill>>
<<kill>>
(c) Gui, Libao and the rest of the modules are placed
in different RUs
Figure 6.3: Recovery views of MPlayer based on the realized decomposition alter-
natives (KEY: Figure 4.4)
Chapter 6. Realization of Software Architecture Recovery Design 123
Each RU can detect deadlock errors. The RM can detect fatal errors. All error
notifications are sent to the CM, which can control the communication accordingly.
The CM conveys diagnosis information to the RM, which kills RUs and/or restarts
dead RUs. Messages that are sent from RUs to the CM are stored (i.e. queued) by
RUs in case the destination RU is not available and they are forwarded when the
RU becomes operational again. On the other hand, RUs can also use stable storage
utilities provided by FLORA to preserve some of their internal state after recov-
ery. So, FLORA enables a combination of check-pointing and log-based recovery to
preserve state.
To apply FLORA, each RU is wrapped using the RU wrapper template. Figure 6.4
shows a part of the RU wrapper that is used for RU GUI. It wraps and isolates
the Gui module of MPlayer. The wrapper includes the necessary set of utilities
for isolating and controlling a RU (lines 1-3). Hereby, “rugui.h” includes the list
of function signatures, whereas “util.h” and “recunit.h” includes a set of standard
utilities for RUs. Later, we will show and explain a part of the utilities provided
by “recunit.h” that is related to check-pointing. To make use of this utility a set
of state variables, their size, and format (integer, string, etc.) must be declared
in the wrapper (lines 5-7). If needed, cleanup specific to the RU (e.g. allocated
resources) can be specified (lines 10-12) as a preparation for recovery. Post-recovery
initialization (lines 14-20) by default includes maintaining the connection with the
CM and the RM (line 15), obtaining the check-pointed state variables (line 17) and
start processing incoming messages from other RUs (line 19). Additional RU-specific
initialization actions can also be specified here.
Each RU provides a set of interfaces, which are captured based on the specifica-
tion in the wrapper (lines 22-26). Each interface defines a set of functions that
are marshaled [32] and transferred through IPC. On reception of these calls, the
corresponding functions are called and then the results are returned (lines 28-31
and 34-37). In all other RUs where this function is declared, function calls are
redirected through IPC to the corresponding interface with C MACRO definitions.
In principle we could also use aspect-oriented programming techniques (AOP) [37]
for this, provided that an appropriate weaver for the C language is available. In
Figure 6.5 a code section is shown from one of the modules of RU MPCORE, where
all calls to the function guiInit are redirected to the function mpcore gui guiInit
(line 1), which activates the corresponding interface (INTERFACE GUI) instead
of performing the function call (lines 4-6).
124 Chapter 6. Realization of Software Architecture Recovery Design
1: #include "util.h"
2: #include "recunit.h"
3: #include "rugui.h"
4: ...
5: #define STATE_SIZE sizeof(int)
6: #define STATE_VARS guiIntfStruct.Playing
7: #define STATE_PARSE_FORMAT "d", &guiIntfStruct.Playing
8: ...
9:
10: void cleanUp() {
11: /* no RU specific cleanup */
12: }
13:
14: void __ruGui(struct recunit info) {
15: INIT_RU(SOCK_PATH_RECMGR, SOCK_PATH_CONN)
16: ...
17: PRESERVE_STATE
18: ...
19: processMsgs();
20: }
21:
22: void catchInterfaces() {
23: BEGIN
24: CATCH(INTERFACE_GUI, apOnMsgRcvd_gui)
25: END
26: }
27:
28: void OnMsgRcvd_gui_guiInit() {
29: guiInit();
30: RETURN(INTERFACE_GUI, msg_gui_guiInit)
31: }
32: ...
33:
34: void OnMsgRcvd_gui_guiGetFileName() {
35: RETURN_ARGS(INTERFACE_GUI, msg_gui_guiGetFileName,
36: sizeof(int) + (strlen(filename)+1),
37: "ds", (strlen(filename)+1), filename)
38: CHECKPOINT
39: }
40: ...
Figure 6.4: RU Wrapper code for RU GUI
1: #define guiInit() mpcore_gui_guiInit()
2: ...
3:
4: void mpcore_gui_guiInit() {
5: CALL(INTERFACE_GUI, msg_gui_guiInit)
6: }
7: ...
Figure 6.5: Function redirection through RU interfaces
Chapter 6. Realization of Software Architecture Recovery Design 125
Figure 6.6 shows a part of the utilities provided by “recunit.h” as a set of C MACRO
definitions. These utilities are imported by each RU wrapper by default (Line 2 in
Figure 6.4). The definitions that are shown in Figure 6.6 are related to the check-
pointing (Lines 3-4) and preservation of state (Lines 7-16). The CHECKPOINT
MACRO sends the data that is defined in the RU wrapper (Line 6 in Figure 6.4)
to the stable storage. Stable storage runs as a separate process in FLORA and it
accepts get/set messages for retaining/saving data. The check-pointing locations
are defined in the RU wrapper by the designer (Line 38 in Figure 6.4) based on the
message exchanges through the RU interface. The PRESERVE STATE MACRO
obtains the saved data from the stable storage. The internal state of the RU is
then updated based on the specified format (Line 7 in Figure 6.4), which defines a
mapping of the saved data to the internal state variables.
1: ...
2:
3: #define CHECKPOINT \
4: CALL_ARGS(INTERFACE_SS_CONTROL, ..., STATE_VARS)
5: ...
6:
7: #define PRESERVE_STATE \
8: iRcvdStateSize = 0; \
9: pcStateData = (char *)malloc(0); \
10: CALL(INTERFACE_SS_CONTROL, msg_ss_getData) \
11: parseMsgArgs(..., &iRcvdStateSize, &pcStateData); \
12: if(iRcvdStateSize > 0) \
13: { \
14: parseMsgArgs(pcStateData, STATE_PARSE_FORMAT); \
15: } \
16: free(pcStateData);
17: ...
Figure 6.6: Standard RU utilities defined in recunit.h concerning check-pointing and
state preservation
Besides the definition of RU wrappers, the designer should configure FLORA accord-
ing to the recovery design. For example, Figure 6.7 shows a part of the configuration
file that was used for the recovery design shown in Figure 6.3(c). Basically, the fol-
lowing information is included in the configuration file presented in Figure 6.7.
• the set of RUs (Lines 2-5)
• the set of interfaces (Lines 8-11)
• the identifiers and initialization functions of RUs (Lines 14-17)
• for each RU, the destination (the other RUs, RM, CM) of messages that will
be sent through the defined interfaces (Lines 22-27)
126 Chapter 6. Realization of Software Architecture Recovery Design
1: ...
2: /* recoverable units */
3: #define RUMPCORE 3
4: #define RUGUI 4
5: #define RUAUDIO 5
6: ...
7:
8: /* interfaces */
9: #define INTERFACE_ERROR 0
10: #define INTERFACE_COMM_CONTROL 1
11: #define INTERFACE_RECOVERY_CONTROL 2
12: ...
13:
14: /* initialization */
15: ...
16: arecunit[RUGUI].iRUid = RUGUI; \
17: arecunit[RUGUI].pfMain = __ruGui; \
18: ...
19:
20: /* binding */
21: ...
22: BEGINRU(RUGUI) \
23: BIND(INTERFACE_RECOVERY_CONTROL, RECMGR) \
24: BIND(INTERFACE_ERROR, CONNECTOR) \
25: BIND(INTERFACE_MP, RUMPCORE) \
26: ...
27: ENDRU \
28: ...
Figure 6.7: Configuration of FLORA
6.4 Evaluation
If all the function calls that pass the boundaries of RUs are defined, FLORA guaran-
tees the correct execution and recovery of these RUs. However, the specification of
the RU boundaries with the RU wrapper template requires an additional effort. The
main effort is spent due to the definition of the RU wrappers. For the decomposition
shown in Figure 4.6, we have measured this effort based on the lines of code (LOC)
written for RU wrappers and the actual size of the corresponding RUs
1
. Table 6.1
shows the LOC for each RU (LOC
RU
), LOC of its wrapper (LOC
RUwrapper
) and
their ratio ((LOC
RUwrapper
/LOC
RU
) ×100).
As we can see from Table 6.1, we had to write approximately 1K LOC to apply
FLORA for the third decomposition alternative. The LOC written for wrappers is
negligible compared to the corresponding system parts that are wrapped. The size
1
We have excluded the source code for the various supported codecs, which are encapsulated
mostly in Libmpcodecs.
Chapter 6. Realization of Software Architecture Recovery Design 127
Table 6.1: LOC for the selected RUs (as shown in Figure 4.6), LOC for the corre-
sponding wrappers and their ratio
LOC
RU
LOC
RUwrapper
ratio
RU MPCORE 214K 463 0,22%
RU GUI 20K 345 1,72%
RU AUDIO 8K 209 2,61%
TOTAL 242K 1017 0,42%
of the wrapper becomes even less significant for bigger system parts. In fact, the
wrapper size is independent of the size and internal complexity of the system part
that is wrapped. This is because the wrapper captures only the interaction of a RU
with the rest of the system.
In FLORA, we have considered only transient faults, which lead to a crash or hanging
of a OS process. In this scope, there is a standard implementation for the RM, the
CM, and the RU wrapper utilities regardless of the number of and type of RUs.
For this reason, LOC for these components remains the same when we increase the
number of RUs
2
. Hence, mainly the specification of RU wrappers causes the extra
LOC that should be written to apply FLORA. To be able to estimate the LOC to
be written for wrappers, we have used the following equation.
LOC
total
= 30 ×|RU| + 15 ×

r∈RU
calls(r →r) (6.1)
Equation 6.1 estimates the LOC needs to be written for wrappers based on the
following assumptions. There should be a wrapper for each RU with some default
settings (Figure 6.4). Therefore, the equation includes a fix amount of LOC (30)
times the number of RUs (|RU|). In addition, all function calls between RUs must
be defined in the corresponding wrappers. For each such function call we add a fix
amount of LOC (15) taking into account the code for redirection of the function,
capturing and processing its arguments and return values. To calculate Equation 6.1,
we have used MFDE (Section 5.8.1) for calculating the number of function calls
between the selected RU boundaries. We have generated all the set of possible
partitions for varying number of RUs. We have calculated Equation 6.1 for each
possible partition and we determined the minimum and maximum LOC estimations
with respect to the number of RUs. The results can be seen in Figure 6.8.
In Figure 6.8, we can see the range of LOC estimations with respect to the number
2
The performance overhead of these components at run-time, of course, increases.
128 Chapter 6. Realization of Software Architecture Recovery Design
1017
0
500
1000
1500
2000
2500
3000
3500
4000
1 2 3 4 5 6 7
Number of RUs
L
i
n
e
s

o
f

C
o
d
e
Figure 6.8: Estimated LOC to be written for wrappers with respect to the number
of RUs
of RUs. When the number of RUs is equal to 1, there exists logically only one possi-
ble partitioning of the system. When the number of RUs is equal to 7, which is the
number of modules in the system, there also exists only one possible partitioning
of the system (each RU comprises a module of the system). Therefore, the mini-
mum and the maximum values are equal for these cases. Figure 6.8 also marks the
LOC written in the actual implemented case with 3 RUs as presented in Table 6.1.
FLORA itself includes source code of size 1,5K LOC, which was reused as is. This
includes the source code of the CM, the RM and standard utilities of RU wrappers
(“util.h” and “recunit.h”). Thus, we can provide an approximate prediction of the
effort for using the framework before the actual implementation.
Depending on how homogeneous the coupling between system modules is, the anal-
ysis can point out exceptional decompositions. For instance, in the analysis of the
MPlayer case, we can see that several decompositions with 6 RUs require less over-
head compared to the maximum overhead caused by a decomposition with 2 RUs.
This means that there are certain modules that are exceptionally highly coupled
(e.g. Libvo and Libmpcodecs) compared to the other modules.
6.5 Discussion
Refactoring the software architecture
FLORA aims at introducing local recovery to a system, while preserving the ex-
isting decomposition. To decide on the partitioning of modules into RUs, we can
Chapter 6. Realization of Software Architecture Recovery Design 129
take coupling and shared data among the modules into account. The system can
be analyzed to select a partitioning that requires minimal effort to introduce local
recovery as discussed in Chapter 5 and in the previous section. On the other hand,
one can also consider to refactor the software architecture to remove existing depen-
dencies and shared data among some of the modules. The effort needed to refactor
the architecture depends very much on the nature and semantics of the architecture.
Restructuring for decoupling shared variables and removing function dependencies
can be considered in case the architecture is already being refactored to improve
certain qualities like reusability, maintainability or evolvability. In fact, this is a
viable approach if the architects/developers have very good insight in the system.
Otherwise, it would be better to treat modules as black boxes and wrap them with
FLORA.
State preservation
State variables that are critical and need to be saved can be declared in RU wrappers.
Declared state variables are check-pointed and automatically restored after recovery
by FLORA. For instance, in the wrapper of RU GUI (See Figure 6.4), the state
variable guiInfStruct.Playing is declared (Line 6). guiInfStruct is a C structure
(struct [67]) and Playing is a variable within this structure, which keeps the status
of the media player (e.g. playing, paused, stopped). The value of this variable
is restored after the Gui module is restarted. However, particular states that are
critical for modules are application dependent. Such states must be known and
explicitly declared in the corresponding wrappers by the developers.
6.6 Related Work
In [26], a survey of approaches for application-level fault tolerance is presented.
According to the categorization of this survey, FLORA falls into the category of
single-version software fault tolerance libraries (SV libraries). SV libraries are said
to be limited in terms of separation of concerns, syntactical adequacy and adapt-
ability [26]. On the other hand, they provide a good ratio of cost over improvement
of the dependability, where the designer can reuse existing, long-tested and sophis-
ticated pieces of software [26]. An example SV library is libft [58], which collects
reusable software components for backward recovery. However, likewise other SV
libraries [26] it does not support local recovery.
Candea et al. introduced the microreboot [16] approach, where local recovery is
applied to increase the availability of Java-based Internet systems. Microreboot
130 Chapter 6. Realization of Software Architecture Recovery Design
aims at recovering from errors by restarting a minimal subset of components of
the system. Progressively larger subsets of components are restarted as long as
the recovery is not successful. To employ microreboot, a system has to meet a set
of architectural requirements (i.e. crash-only design [16]), where components are
isolated from each other and their state information is kept in state repositories.
Designs of many existing systems do not have these properties and it might be too
costly to redesign and implement the whole system from the start. FLORA provides
a set of reusable abstractions and mechanisms to support the refactoring of existing
systems to introduce local recovery.
In [53], a micro-kernel architecture is introduced, where device drivers are executed
on separate processes at user space to increase the failure resilience of an operat-
ing system. In case of a driver failure, the corresponding process can be restarted
without affecting the kernel. The design of the operating system must support
isolation between the core operating system and its extensions to enable such a re-
covery [53]. Mach kernel [96] also provides a micro-kernel architecture and flexible
multiprocessing support, which can be exploited for failure resilience and isolation.
Singularity [59] proposes multiprocessing support in particular to improve depend-
ability and safety by introducing the concept of sealed process architecture, which
limits the scopes of processes and their capabilities with respect to memory alter-
ation for better isolation. To be able to exploit the multiprocessing support of the
operating system for isolation, the application software must be partitioned to be
run on multiple processes. FLORA supports the partitioning of an application soft-
ware and reduces the re-engineering effort, while making use of the multiprocessing
support of the Linux operating system.
Erlang/OTP (Open Telecoms Platform) [38] is a programming environment used
by Ericsson to achieve highly available network switches by enabling local recov-
ery. Erlang is a functional programming language and Erlang/OTP is used for
(re)structuring Erlang programs in so-called supervision trees. A supervision tree
assigns system modules in separate processes, called workers. These processes are
arranged in a tree structure together with hierarchical supervisor processes at dif-
ferent levels of the tree. Each supervisor monitors the workers and the other su-
pervisors among its children and restarts them when they crash. FLORA is used
for restructuring C programs and partitioning them into several processes with a
central supervisor (i.e. the RM).
We have implemented FLORA basically using macro-definitions in the C language.
We could also implement FLORA using techniques [37] in which usually a distinction
is made between base code on which additional so-called crosscutting concerns are
woven. In particular the function redirection calls to IPC could be automatically
woven using aspects. We consider this as a possible future work.
Chapter 6. Realization of Software Architecture Recovery Design 131
Some fault tolerance techniques for achieving high availability are provided by mid-
dlewares like FT CORBA [89]. However, it still requires additional development
and maintenance efforts to utilize the provided techniques. Aurora Management
Workbench (AMW) [13] uses model-centric development to integrate a software
system with an high availability middleware. AMW generates code based on the
desired fault tolerance behavior that is specified with a domain specific language.
By this way, it aims at reducing the amount of hand-written code and as such re-
ducing the developer effort to integrate fault tolerance mechanisms provided by the
middleware with the system being developed. Developers should manually write
code only for component-specific initialization, data maintenance and invocations of
check-pointing APIs similar to those specified within RU wrapper templates. AMW
allows software components (or servers) to be assigned to separate RUs (capsule in
AMW terminology) and restarted independently. However, AMW currently does
not support the restructuring and partitioning of legacy software to introduce local
recovery.
6.7 Conclusions
In this chapter, we have presented the framework FLORA, which supports the re-
alization of local recovery. We have used FLORA to implement local recovery for
3 decomposition alternatives of MPlayer. The realization effort for applying the
framework appears to be relatively negligible. FLORA provides reusable abstrac-
tions for introducing local recovery to software architectures, while preserving their
existing structure.
132 Chapter 6. Realization of Software Architecture Recovery Design
Chapter 7
Conclusion
Most home equipments today such as television sets are software-intensive embedded
systems. The size and complexity of software in such systems are increasing with
every new product generation due to the increasing number of new requirements,
constantly changing hardware technologies and integration with other systems in
networked environments. Together with time pressure, these trends make it harder,
if not impossible, to prevent or remove all possible faults in a system. To cope
with this challenge, fault tolerance techniques have been introduced, which enable
systems to recover from failures of its components and continue operation. Many
fault tolerance techniques are available but incorporating them in a system is not
always trivial. In the following sections, we summarize the problems that we have
addressed, the solutions that we have proposed and possible directions for future
work.
133
134 Chapter 7. Conclusion
7.1 Problems
In this thesis, we have addressed the following problems in designing a fault-tolerant
system.
• Early Reliability Analysis: To select and apply appropriate fault tolerance
techniques, we need reliability analysis methods to i ) analyze potential fail-
ures, ii ) prioritize them, and iii ) identify sensitivity points of a system ac-
cordingly. In the consumer electronics domain, the prioritization of failures
must be based on user perception instead of the traditional approach based on
safety. Moreover, because implementing the software architecture is a costly
process it is important to apply the reliability analysis as early as possible
before committing considerable amount of resources.
• Modeling Software Architecture for Recovery: It turns out that introducing re-
covery mechanisms to a system has a direct impact on its architecture design,
which results in several new elements, complex interactions and even a par-
ticular decomposition for error containment. Existing viewpoints [56, 69, 70]
mostly capture functional aspects of a system and they are limited for explic-
itly representing elements, interactions and design choices related to recovery.
• Optimization of Software Decomposition for Recovery: Local recovery is an
effective approach for recovering from errors. It enables the recovery of the er-
roneous parts of a system while the other parts of the system are operational.
One of the requirements for introducing local recovery to a system is isolation.
To prevent the failure of the whole system in case of a failure of one of its com-
ponents, the software architecture must be decomposed into a set of isolated
recoverable units. There exist many alternatives to decompose a software ar-
chitecture for local recovery. Each alternative has an impact on availability
and performance of the system. We need an adequate integrated set of analy-
sis techniques to optimize the decomposition of software architecture for local
recovery.
• Realization of Local Recovery: It appears that the optimal decomposition for
local recovery is usually not aligned with the decomposition based on func-
tional concerns and the realization of local recovery requires substantial devel-
opment and maintenance effort. This is because additional elements should
be introduced and the existing system must be refactored.
Chapter 7. Conclusion 135
7.2 Solutions
In this section, we explain how we have addressed the aforementioned problems. The
proposed techniques tackle different architecture design issues (modeling, analysis,
realization) at different stages of software development life cycle (early analysis,
maintenance) and as such are complementary.
7.2.1 Early Reliability Analysis
To select and apply appropriate fault tolerance techniques, in Chapter 3 we have
introduced the software architecture reliability analysis method (SARAH). SARAH
integrates the best practices of conventional reliability engineering techniques [34]
with current scenario-based architectural analysis methods [31]. Conventional relia-
bility engineering includes mature techniques for failure analysis and evaluation tech-
niques. Software architecture analysis methods provide useful techniques for early
analysis of the system at the architecture design phase. Despite most scenario-based
analysis methods which usually do not focus on specific quality factors, SARAH is
a specific purpose analysis method focusing on the reliability quality attribute. Fur-
ther, unlike conventional reliability analysis techniques which tend to focus on safety
requirements SARAH prioritizes and analyzes failures basically from the user per-
ception because of the requirements of consumer electronics domain. To provide a
reliability analysis based on user perception we have extended the notion of fault
trees and refined the fault tree analysis approach. SARAH has been illustrated for
identifying the sensitive modules for the DTV and it has provided an important
input for the enhancement of the architecture. Besides the outcome of the analy-
sis, the process of doing such an explicit analysis has provided better insight in the
potential risks of the system.
7.2.2 Modeling Software Architecture for Recovery
We have introduced the recovery style in Chapter 4 to document and analyze recov-
ery properties of a software architecture. We have also introduced a local recovery
style and illustrated its application on the open source media player, MPlayer. Re-
covery views of MPlayer based on the recovery style have been used within our
project to communicate the design of our local recovery framework (FLORA) and
its application to MPlayer. These views have formed the basis for analysis and
support for the detailed design of the recovery mechanisms.
136 Chapter 7. Conclusion
7.2.3 Optimization of Software Decomposition for Recovery
We have proposed a systematic approach in Chapter 5 for analyzing an existing
system to decompose its software architecture to introduce local recovery. The ap-
proach enables i ) depicting the alternative space of the possible decomposition alter-
natives, ii ) reducing the alternative space with respect to domain and stakeholder
constraints, and iii ) balancing the feasible alternatives with respect to availabil-
ity and performance. We have provided the complete integrated tool-set, which
supports the whole process of decomposition optimization. Each tool automates a
particular step of the process and it can be utilized for other types of analysis and
evaluation as well. We have illustrated our approach by analyzing decomposition
alternatives of MPlayer. We have seen that the impact of decomposition alterna-
tives can be easily observed, which are based on actual measurements regarding
the isolated modules and their interaction. We have implemented local recovery for
three decomposition alternatives to validate the results that we obtained based on
analytical models and optimization techniques. The measured results turn out to be
closely matching the estimated availability, while the alternatives were ordered cor-
rectly with respect to their estimated and measured availabilities and the objective
function used for optimization.
7.2.4 Realization of Local Recovery
In Chapter 6 we have presented the framework FLORA that provides reusable ab-
stractions to preserve the existing structure and support the realization of local
recovery. We have used FLORA to implement local recovery for the three decom-
position alternatives of MPlayer that are used for validating the analysis results
based on analytic models and optimization techniques. In total three recoverable
units (RUs) have been defined, which were overlaid on the existing structure without
adapting the individual modules. In addition, the realization effort for applying the
framework and introducing local recovery appears to be relatively negligible. The
application of the framework, as such, provides a reusable and practical approach
to introduce local recovery to software architectures.
Chapter 7. Conclusion 137
7.3 Future Work
In the following, we provide several future directions regarding each of the methods
and tools presented in this thesis.
The scenario derivation process in SARAH can be improved by utilizing a software
architecture model with more semantics to be able to check properties of derived
scenarios (e.g. whether they correctly capture the error propagation). The other
inputs of the analysis can also be improved like severity values for failures based on
user perception and fault occurrence probabilities. We have made use of spreadsheets
to perform the required calculations in SARAH. A better tool-support, possibly with
a visual editor, can be provided to reduce the analysis effort.
We have introduced the local recovery style as a specialization of the recovery style
to represent a local recovery design in more detail. Similarly, we can introduce
additional architectural styles to represent particular fault tolerance mechanisms.
We have introduced the recovery style based on the module viewtype. As discussed
in 4.7, we can also derive a recovery style for the component & connector viewtype
or the allocation viewtype to represent recovery views of run-time or deployment
elements.
Several points can be improved in the analysis and optimization of decomposition
alternatives for recovery. As discussed in 5.13, we can increase the expressiveness
power of the constraint specification. Another improvement can be about the esti-
mation of function and data dependencies, for which we have performed our mea-
surements during a single usage scenario. The analysis can take different types of
scenarios into account based on the usage profile of the system. As for future direc-
tions regarding model-based availability analysis, one can revisit some or all of the
assumptions made in section 5.10.1 and/or modify any of the four basic I/O-IMC
models (Appendix B.1) to more accurately represent the real system behavior. In
addition, new heuristics and alternative optimization algorithms can be incorporated
in the tool-set for a faster or better (i.e. closer to the global optimum) selection of
a decomposition alternative.
FLORA can be improved by reconsidering the fault assumptions and related fault
tolerance mechanisms to be incorporated. FLORA is basically a software library
and it still requires effort to utilize the library. This effort can be reduced by using
model-driven engineering and aspect-oriented software development techniques as
discussed in section 6.6.
As a future work concerning all the proposed techniques, further case studies or
experiments can be conducted to evaluate the methods and tools within the context
of various application domains.
138 Chapter 7. Conclusion
Appendix A
SARAH Fault Tree Set
Calculations
In the following, Figure A.1 depicts the Fault Tree Set that was used for the analysis
presented in Chapter 3. Failure scenarios are labeled with the IDs of the associated
architectural elements and the user perceived failures are annotated with sever-
ity values. Table A.1 shows the corresponding Architectural-Level analysis results.
Hereby, AEID, NF, WNF, PF and WPF stand for Architectural Element ID, Num-
ber of Failures, Weighted Number of Failures, Percentage of Failures and Weighted
Percentage of Failures, respectively.
139
140 Chapter A. SARAH Fault Tree Set Calculations
CMR CMR CH
CB
AMR
AMR
4 5
DDÌ VC
VP
4
CMR CH
AMR
AP
AC
5
4
AO
5
DDÌ AMR
CA
3
LSM
CB PÌ

VC PM TXT EPG
2
3 3 4 4 3
PM
T
5
G
GC
4
VO
5
AC
2
AMR TXT
EPG
5
TXT
5
TXT TXT
5
3 4
Figure A.1: Fault Tree Set
Table A.1: SARAH Architecture-Level Analysis results
AEID AC AMR AO AP CA CB CH CMR DDI EPG
NF 2 5 1 1 1 2 2 3 2 2
WNF 6 49 5 4 3 8 9 14 15 8
PF 5,41 13,51 2,70 2,70 2,70 5,41 5,41 8,11 5,41 5,41
WPF 3,09 25,26 2,58 2,06 1,55 4,12 4,64 7,22 7,73 4,12
AEID G GC PI PM LSM T TXT VO VC VP
NF 1 1 2 2 1 1 4 1 2 1
WNF 4 4 5 9 4 5 27 5 6 4
PF 2,70 2,70 5,41 5,41 2,70 2,70 10,81 2,70 5,41 2,70
WPF 2,06 2,06 2,58 4,64 2,06 2,58 13,92 2,58 3,09 2,06
Appendix B
I/O-IMC Model Specification and
Generation
B.1 Example I/O-IMC Models
In this section, we provide details on the 4 basic I/O-IMC models that we have used
for availability analysis of decomposition alternatives for local recovery. As briefly
explained in section 5.10.1, these are the module I/O-IMC, the failure interface I/O-
IMC, the recovery interface I/O-IMC, and the recovery manager (RM) I/O-IMC.
To explain these models, we use the running example with 3 modules and 2 RUs as
depicted in Figure B.1. This example consists of a RM and two recoverable units
(RUs); RU 1 has one module A and RU 2 has two modules B and C. By convention,
the starting state of any I/O-IMC is state 0 and the RUs are numbered starting from
1.
141
142 Chapter B. I/O-IMC Model Specification and Generation
Recovery
Manager
Failure
Ìnterface
Recovery
Ìnterface
Module B
RU 2
up_B, failed_B
Module C
up_C, failed_C
recovered_C,
recovering_2
recovered_B,
recovering_2
RU 1
up_1, failed_1 up_2, failed_2
start_recover_1 start_recover_2
KEY RU boundaries Ì/O-ÌMC models
Ì/O-ÌMC
input/output
actions
Module A
Failure
Ìnterface
Recovery
Ìnterface
up_A, failed_A
recovered_A,
recovering_1
Figure B.1: The running example with 3 modules and 2 RUs
B.1.1 The module I/O-IMC
Figure B.2 shows the I/O-IMC of module B. The module is initially operational in
state 0, and it can fail with rate 0.2 and move to state 2. In state 2, the module
notifies the failure interface of RU 2 about its failure (i.e. transition from state
2 to state 1). In state 1, the module awaits to be recovered (i.e. receiving signal
‘recovered B’ from the recovery interface), and once this happens it outputs an ‘up’
signal notifying the failure interface about its recovery (i.e. transition from state 3 to
state 0). Signal ‘recovering 2’ is received from the recovery interface indicating that
a recovery procedure of RU 2 has been initiated. The remaining input transitions
are necessary to make the I/O-IMC input-enabled.
B.1.2 The failure interface I/O-IMC
Figure B.3 shows the I/O-IMC model of RU 2 failure interface. The failure interface
simply listens to the failure signals of modules B and C, and outputs a ‘failure’ signal
for the RU (i.e. ‘failed 2’) upon the receipt of any of these two signals. In fact, this
interface behaves as an OR gate in a fault tree. So, the failure interface outputs an
‘up’ signal for the RU (i.e. ‘up 2’) when all the failed modules have output their
‘up’ signals. For instance, consider the following sequence of states: 0, 1, 4, 7, and
0; this corresponds to modules B and C being initially operational, then B fails,
Chapter B. I/O-IMC Model Specification and Generation 143
0
1
2
3
recovered_B?
recovering_2?
up_B!
recovering_2?
recovered_B?
failed_B!
recovered_B?
recovering_2?
rate 0.2
recovering_2?
recovered_B?
Figure B.2: The module I/O-IMC model. Input = {recovered B, recovering 2}, Output
= {failed B, up B}.
followed by RU 2 outputting its ‘failed’ signal, then signal ‘up B’ is received from
module B, and finally RU 2 outputs its own ‘up’ signal.
B.1.3 The recovery interface I/O-IMC
Figure B.4 shows the I/O-IMC model of RU 2 recovery interface. The recovery
interface receives a ‘start recover’ signal from the RM (transition from state 0 to
state 1), allowing it to start the RU’s recovery. A ‘recovering’ signal is then output
(transition from state 1 to state 2) notifying all the modules within the RU that a
recovery phase has started (essentially disallowing any remaining operational module
to fail). Then two sequential recoveries (i.e. of B and C) take place both with rate
1 (transitions from state 2 to state 3 and from state 3 to state 4), followed by two
sequential ‘recovered’ notifications (transitions from state 4 to state 5 and from state
5 to state 0).
144 Chapter B. I/O-IMC Model Specification and Generation
4
0
5
1
6
2
7
3
failed_B?
up_B?
up_C?
failed_B?
failed_C?
failed_C?
up_C?
up_B?
up_B?
failed_B?
failed_B?
failed_2!
up_C?
up_C?
failed_C?
failed_C?
up_B?
up_B?
failed_B?
failed_B?
up_C?
failed_2!
up_C?
failed_C?
failed_C?
failed_B?
up_B?
up_B?
up_2!
failed_B?
up_C?
failed_2!
failed_C?
up_C?
up_B?
failed_C?
Figure B.3: The failure interface I/O-IMC model. Input = {failed B, up B, failed C,
up C}, Output = {failed 2, up 2}.
Chapter B. I/O-IMC Model Specification and Generation 145
4
0
5
1
2
3
start_recover_2?
start_recover_2?
start_recover_2?
recovered_C!
rate 1.0
start_recover_2?
start_recover_2?
rate 1.0
recovered_B!
recovering_2!
start_recover_2?
Figure B.4: The recovery interface I/O-IMC model. Input ={start recover 2}, Output
= {recovering 2, recovered B, recovered C}.
B.1.4 The recovery manager I/O-IMC
Figure B.5 shows the I/O-IMC model of the RM. The RM monitors the failure of
RU 1 and RU 2, and when an RU failure is detected, the RM grants its recovery by
outputting a ‘start recover’ signal. The RM has internally a queue of failing RUs
that keeps track of the order in which the RUs have failed. The recovery policy
of RM is to grant a ‘start recover’ signal to the first failing RU. In queuing theory
literature, this is referred to as a first-come-first-served (FCFS) policy. For instance,
consider the following sequence of states: 0, 1, 4, 7, 2, 6, and 0; this corresponds
to both RUs being initially operational, then RU 1 fails, immediately followed by
an RU 2 failure. Since RU 1 failed first, it is granted the ‘start recover’ signal
(transition from state 4 to state 7).Then, the RM waits for the ‘up’ signal of RU 1,
and once received, RM grants the ‘start recover’ signal to RU 2 since RU 2 is still
in the queue of failing RUs (transition from state 2 to state 6). Finally, the RM
receives ‘up 2’ and both RUs are operational again.
Note that any of the 4 I/O-IMC models presented here can be, to a certain extent,
locally modified without affecting the remaining models. As an example, one might
modify the RM by implementing a different recovery policy.
146 Chapter B. I/O-IMC Model Specification and Generation
8
4
0
5
1
6
2
7
3
failed_1?
failed_1?
up_2?
up_2?
up_2?
failed_2?
failed_2?
start_recover_1!
failed_2?
up_1?
up_1?
up_1?
failed_1?
failed_1?
failed_1?
up_2?
start_recover_2!
failed_2?
up_2?
start_recover_1!
failed_2?
up_1?
up_1?
failed_1?
failed_1?
start_recover_2!
up_2?
up_2?
failed_2?
up_1?
failed_2?
up_1?
failed_1?
failed_1?
up_2?
up_2?
failed_2?
failed_2?
up_1?
up_1?
Figure B.5: The recovery manager I/O-IMC model. Input = {failed 1, up 1, failed 2,
up 2}, Output = {start recover 1, start recover 2.}
B.2 I/O-IMC Model Specification with MIOA
We use a formal language called MIOA [71] to describe I/O-IMC models. MIOA
is based on the IOA language defined by N. Lynch et al. in [45] and for which,
in addition to interactive transitions, Markovian transitions have been added. The
MIOA language is used for describing concisely and formally an I/O-IMC. It provides
programming language constructs, such as control structures and data types, to
describe complex system model behaviors. A MIOA specification is structured as
shown in Figure B.6 and it is divided into 3 sections: i ) signature ii ) states, and iii )
transitions. In the signature section, input actions, output actions, internal actions
Chapter B. I/O-IMC Model Specification and Generation 147
and Markovian rates are specified
1
. The set of data types specified in the states
section determines the states of the I/O-IMC. Possible transitions of the I/O-IMC
are specified in a precondition-effect style. In order for the transition to take place,
the precondition has to hold. A precondition can be any expression that can be
evaluated either to true or false. Since I/O-IMCs are input-enabled, there is no
precondition specified for input actions.
1: IOIMC: <name>
2: signature:
3: input: <action_1>? ... <action_m>?
4: output: <action_1>! ... <action_n>!
5: internal: <action_1> ... <action_l>
6: markovian: <rate_1> ... <rate_k>
7: states:
8: <state definitions>
9: ...
10: transitions:
11: input: <action_i>?
12: effect:
13: ...
14: output: <action_i>!
15: precondition: <condition>
16: effect:
17: ...
18: internal: <action_i>
19: precondition: <condition>
20: effect:
21: ...
22: markovian: <rate_i>
23: precondition: <condition>
24: effect:
25: ...
Figure B.6: Basic building blocks of a MIOA specification.
The MIOA specification of the failure interface I/O-IMC model is shown in Fig-
ure B.7 as an example. The signature consists of actions that correspond to the
failed/up signals of the n modules belonging to the RU (Line 3), one output signal
for the ‘failed’ signal of the RU and one output signal for the ‘up’ signal of the RU
(Line 4). The states of the failure interface I/O-IMC are defined (Lines 5-7) using
Set and Bool data types, where ‘set’ (of size n) holds the names of the modules that
have failed and ‘rufailed’ indicates if the RU has or has not failed. The initial state
is also defined in the states section; for instance, the failure interface initial state is
composed of ‘set’ being empty (Line 6) and ‘rufailed’ being false (Line 7). There
are four kind of possible transitions; for example, the last transition (Lines 21-24)
indicates that an RU ‘up’ signal (up RU!) is output if ‘set’ is empty (i.e. all modules
1
MIOA actions correspond to I/O-IMC signals.
148 Chapter B. I/O-IMC Model Specification and Generation
are operational) and the RU has indeed failed at some point (i.e. ‘rufailed’ = true),
and the effect of the transition is to set ‘rufailed’ to false.
1: IOIMC: failure interface
2: signature:
3: input: failed(n:Int)? up(n:Int)?
4: output: failed_RU! up_RU!
5: states:
6: set: Set[n:Int] := {}
7: rufailed: Bool := false
8: transitions:
9: input: failed(i)?
10: effect:
11: if i set
12: add(i, set)
13: input: up(i)?
14: effect:
15: if i set
16: remove(i, set)
17: output: failed_RU!
18: precondition: set.size() > 0 rufailed = false
19: effect:
20: rufailed := true
21: output: up_RU!
22: precondition: set.size() = 0 rufailed = true
23: effect:
24: rufailed := false
Figure B.7: MIOA specification of the failure interface I/O-IMC model.
Once a MIOA specification/description of an I/O-IMC model has been laid down,
an algorithm can explore the state-space and generate the corresponding I/O-IMC
model (see Section B.3 for details). In fact, automatically deriving the I/O-IMC
models becomes essential as the models grow in size. For instance, the RM I/O-IMC
that coordinates 7 RUs has 27, 399 states and 397, 285 transitions. In our modeling
approach, the RM I/O-IMC size depends on the number of RUs, the failure/recovery
interface I/O-IMC size depends on the number of modules within the RU, and the
module I/O-IMC size is constant.
Chapter B. I/O-IMC Model Specification and Generation 149
B.3 I/O-IMC Model Generation
We have implemented the I/O-IMC model generation based on the MIOA spec-
ifications. Algorithm 3 shows how the states and transitions of an I/O-IMC are
generated based on its MIOA specification.
Algorithm 3 State space exploration and I/O-IMC generation based on MIOA
specification
1: stateSet ← {}
2: statesToBeProcessed ← {}
3: transitionSet ← {}
4: s ← intialize mioa.states
5: stateSet.add(s)
6: statesToBeProcessed.add(s)
7: while statesToBeProcessed = {} do
8: s ← statesToBeProcessed.removeLast()
9: for each mioa.transition t do
10: if s.check(t.precondition) = TRUE then
11: snew ← s.apply(t.effect)
12: if snew / ∈ stateSet then
13: stateSet.add(snew)
14: statesToBeProcessed.add(snew)
15: end if
16: transitionSet.add(s.id, snew.id, t.signal)
17: else
18: if t.signal = input then
19: transitionSet.add(s.id, s.id, t.signal)
20: end if
21: end if
22: end for
23: end while
The algorithm keeps track of the set of states, the states that are yet to be evaluated
and the set of transitions generated, which are all initialized as empty sets (Lines
1-3). An initial state is created which comprises the initialized state variables as
defined in the MIOA specification (Line 4). The initial state is added to the state
set and the set of states to be processed (Lines 5-6). Then, the algorithm iterates
over the states in the set of states to be processed until there is no state left to
be processed (Lines 7-8). For each state, all the transitions that are specified in
the MIOA specification are evaluated (Line 9). If the precondition of the transition
holds, a new state is created, on which the effects of the transition are reflected
(Lines 10-11). If the resulting state does not already exist, it is added to the set of
1: "X.bcg" = branching stochastic reduction of(
2: (hide start_recover_1 in
3: (hide failed_B,up_B,failed_C,up_C in
4: "fint_1.aut"
5: |[failed_1,failed_B,up_B,failed_C,up_C]|
6: (hide recovered_B in "module_B.aut"
7: |[recovered_B]|
8: (hide recovered_C in "module_C.aut"
9: |[recovered_C]|
10: "rint_1.aut"))
11: )
12: |[start_recover_1,failed_1,up_1]|
13: (hide start_recover_2 in
14: (hide failed_A,up_A in "fint_2.aut"
15: |[failed_2,failed_A,up_A]|
16: (hide recovered_A in "module_A.aut"
17: |[recovered_A]|
18: "rint_2.aut")
19: )
20: |[start_recover_2,failed_2,up_2]|
21: "recmgr.aut")));
Figure B.8: SVL script that composes the I/O-IMC models according to the example
decomposition as shown in Figure B.1
states and the set of states to be processed (Lines 12-15). Also, a new transition
from the original state to the resulting state is added to the set of transitions (Line
16). If the precondition of the transition does not hold for an input signal, then a self
transition is added to the set of transitions to ensure that the generated I/O-IMC
is input-enabled (Lines 17-22).
B.4 Composition and Analysis Script Generation
All the generated I/O-IMC models are output in the Aldebaran .aut file format so
that they can be processed with the CADP tool-set [42]. In addition to all the neces-
sary I/O-IMCs, a composition and analysis script is also generated, which conforms
to the CADP SVL scripting language. Figure B.8 shows a part of the generated
SVL script, which composes the I/O-IMC models for the example decomposition
with 3 modules and 2 RUs (See Figure B.1).
The generated composition script first composes the recovery interface I/O-IMCs of
RUs with the module I/O-IMCs that are comprised by the corresponding RUs. For
instance, in Figure B.8, the I/O-IMCs of modules B and C are composed with the
recovery interface I/O-IMC of RU 1 (Lines 6-10). Similarly, the module A I/O-IMC
150
is composed with the recovery interface of RU 2 (Lines 16-18). The resulting I/O-
IMCs are then composed with the failure interface I/O-IMCs of the corresponding
RUs. Finally, all the resulting I/O-IMCs are composed with the RM I/O-IMC. At
each composition step, the common input/output actions that are only relevant for
the I/O-IMCs being composed are “hidden”. That is, these actions are used for the
composition of I/O-IMCs and eliminated in the resulting I/O-IMC. The execution of
the generated SVL script within CADP composes and aggregates all the I/O-IMCs
based on the modules decomposition, reduces the final I/O-IMC into a CTMC, and
computes the steady-state availability.
151
152
Bibliography
[1] A.V. Aho, R. Sethi, and J.D. Ullman. Compilers, Principles, Techniques, and
Tools. Addison-Wesley, 1986.
[2] G. Arrango. Domain analysis methods. In Schafer, R. Prieto-Diaz, and
M. Matsumoto, editors, Software Reusability, pages 17–49. Ellis Horwood,
1994.
[3] A. Avizienis, J.-C. Laprie, B. Randell, and C. Landwehr. Basic concepts
and taxonomy of dependable and secure computing. IEEE Transactions on
Dependable and Secure Computing, 1(1):11–33, 2004.
[4] F. Bachman, L. Bass, and M. Klein. Deriving Architectural Tactics: A Step
Toward Methodical Architectural Design. Technical Report CMU/SEI-2003-
TR-004, SEI, Pittsburgh, PA, USA, 2003.
[5] L. Bas, P. Clements, and R. Kazman. Software Architecture in Practice.
Addison-Wesley, 1998.
[6] W. Beaton and J. des Rivieres. Eclipse platform technical overview. Technical
report, IBM, 2006. http://www.eclipse.org/articles/.
[7] P. Bengtsson, N. Lassing, J. Bosch, and H. van Vliet. Architecture-level mod-
ifiability analysis (ALMA). Journal of Systems Software, 69(1-2):129–147,
2004.
[8] P.O. Bengtsson and J. Bosch. Architecture level prediction of software mainte-
nance. In Proceedings of the Third European Conference Software Maintenance
and Reengineering (CSMR), pages 139–147, Amsterdam, The Netherlands,
1999.
[9] D. Binkley. Source code analysis: A road map. In Proceedings of the Confer-
ence on Future of Software Engineering (FOSE), pages 104–119, Washington,
DC, USA, 2007.
153
[10] F. Boudali, P. Crouzen, and M. Stoelinga. A compositional semantics for
dynamic fault trees in terms of interactive markov chains. In Proceedings of
the 5th International Symposium on Automated Technology for Verification
and Analysis (ATVA), volume 4762 of Lecture Notes in Computer Science,
pages 441–456, Tokyo, Japan, 2007. Springer-Verlag.
[11] H. Boudali, P. Crouzen, B. R. Haverkort, M. Kuntz, and M.I.A. Stoelinga.
Architectural dependability evaluation with arcade. In Proceedings of the 38th
IEEE/IFIP International Conference on Dependable Systems and Networks
(DSN), pages 512–521, Anchorage, Alaska, USA, 2008.
[12] F. Buschmann, R. Meunier, H. Rohnert, P. Sommerlad, and M. Stal. Pattern-
Oriented Software Architecture, A System of Patterns. Wiley, 1996.
[13] R. Buskens and O.J. Gonzalez. Model-centric development of highly available
software systems. In R. de Lemos, C. Gacek, and A. Romanovsky, editors, Ar-
chitecting Dependable Systems IV, volume 4615 of Lecture Notes in Computer
Science, pages 409–433. Springer-Verlag, 2007.
[14] D.G. Cacuci. Sensitivity and Uncertainty Analysis: Theory, volume 1. Chap-
man & Hall, 2003.
[15] G. Candea, J. Cutler, and A. Fox. Improving availability with recursive micro-
reboots: A soft-state system case study. Performance Evaluation, 56(1-4):213–
248, 2004.
[16] G. Candea, S. Kawamoto, Y. Fujiki, G. Friedman, and A. Fox. Microreboot: A
technique for cheap recovery. In Proceedings of the 6th Symposium on Operat-
ing Systems Design and Implementation (OSDI), pages 31–44, San Francisco,
CA, USA, 2004.
[17] K.M. Chandy and L. Lamport. Distributed snapshots: determining global
states of distributed systems. ACM Transactions on Computer Systems,
3(1):63–75, 1985.
[18] P. Clements, F. Bachmann, L. Bass, D. Garlan, J. Ivers, R. Little, R. Nord,
and J. Stafford. Documenting Software Architectures: Views and Beyond.
Addison-Wesley, 2002.
[19] P. Clements, R. Kazman, and M. Klein. Evaluating software architectures:
methods and case studies. Addison-Wesley, 2002.
[20] B. Cornelissen. Dynamic analysis techniques for the reconstruction of archi-
tectural views. In Proceedings of the 14th Working Conference on Reverse
154
Engineering (WCRE), pages 281–284, Vancouver, BC, Canada, 2007. IEEE
Computer Society.
[21] K. Czarnecki and U. Eisenecker. Generative Programming: Methods, Tools,
and Applications. Addison-Wesley, 2000.
[22] O. Das and C.M. Woodside. The fault-tolerant layered queueing network
model for performability of distributed systems. In Proceedings of the Inter-
national Performance and Dependability Symposium (IPDS), pages 132–141,
Durham, NC, USA, 1998.
[23] E.M. Dashofy, A. van der Hoek, and R.N. Taylor. An infrastructure for the
rapid development of XML-based architecture description languages. In Pro-
ceedings of the 22rd International Conference on Software Engineering (ICSE),
pages 266–276, Orlando, FL, USA, 2002. ACM.
[24] E.M. Dashofy, A. van der Hoek, and R.N. Taylor. Towards architecture-
based self-healing systems. In Proceedings of the first workshop on Self-healing
systems (WOSS), pages 21–26, Charleston, SC, USA, 2002.
[25] P.A. de Castro Guerra, C.M.F. Rubira, A. Romanovsky, and R. de Lemos.
A dependable architecture for cots-based software systems using protective
wrappers. In R. de Lemos, C. Gacek, and A. Romanovsky, editors, Architecting
Dependable Systems, volume 2667, pages 144–166. Springer-Verlag, 2003.
[26] V. de Florio and C. Blondia. A survey of linguistic structures for application-
level fault tolerance. ACM Computing Surveys, 40(2):1–37, 2008.
[27] R. de Lemos, P. Guerra, and C. Rubira. A fault-tolerant architectural approach
for dependable systems. IEEE Software, 23(2):80–87, 2004.
[28] I. de Visser. Analyzing User Perceived Failure Severity in Consumer Electron-
ics Products. PhD thesis, Technische Universiteit Eindhoven, Eindhoven, The
Netherlands, 2008.
[29] G. Deconinck, J. Vounckx, R. Lauwereins, and J.A. Peperstraete. Survey of
backward error recovery techniques for multicomputers based on checkpointing
and rollback. In Proceedings of the IASTED International Conference on
Modeling and Simulation, pages 262–265, Pittsburgh, PA, USA, 1993.
[30] Department of Defense. Military standard: Procedures for performing a failure
modes, effects and criticality analysis. Standard MIL-STD-1629A, Department
of Defense, Washington DC, USA, 1984.
155
[31] L. Dobrica and E. Niemela. A survey on software architecture analysis meth-
ods. IEEE Transactions on Software Engineering, 28(7):638–654, 2002.
[32] J. Dollimore, T. Kindberg, and G. Coulouris. Distributed Systems: Concepts
and Design. Addison Wesley, 2005.
[33] J.C. Duenas, W.L. de Oliveira, and J.A. de la Puente. A software architecture
evaluation model. In Proceedings of the Second International ESPRIT ARES
Workshop, pages 148–157, Las Palmas de Gran Canaria, Spain, 1998.
[34] J.B. Dugan. Software system analysis using fault trees. In M.R. Lyu, edi-
tor, Handbook of Software Reliability Engineering, chapter 15, pages 615–659.
McGraw-Hill, 1996.
[35] J.B. Dugan and M.R. Lyu. Dependability modeling for fault-tolerant software
and systems. In M. R. Lyu, editor, Software Fault Tolerance, chapter 5, pages
109–138. John Wiley & Sons, 1995.
[36] M. Elnozahy, L. Alvisi, Y. Wang, and D.B. Johnson. A survey of rollback-
recovery protocols in message passing systems. ACM Computing Surveys,
34(3):375–408, 2002.
[37] T. Elrad, R. Fillman, and A. Bader. Aspect-oriented programming. Commu-
nications of the ACM, 44(10):29–32, 2001.
[38] Erlang/OTP design principles, 2009. http://www.erlang.org/doc/.
[39] C.F. Eubanks, S. Kmenta, and K. Ishil. Advanced failure modes and effects
analysis using behavior modeling. In Proceedings of the ASME Design Theory
and Methodology Conference (DETC), number DTM-02 in 97-DETC, Sacra-
mento, CA, USA, 1997.
[40] J. Fenlason and R. Stallman. GNU gprof: the GNU profiler. Free Software
Foundation, 2000. http://www.gnu.org/.
[41] E. Gamma, R. Helm, R. Johnson, and J. Vlissides. Design Patterns: Elements
of Reusable Object-Oriented Design. Addison-Wesley, 1995.
[42] H. Garavel, F. Lang, R. Mateescu, and W. Serwe. CADP 2006: A toolbox for
the construction and analysis of distributed processes. In Proceedings of the
19th International Conference on Computer Aided Verification (CAV), volume
4590 of Lecture Notes in Computer Science, pages 158–163, Berlin, Germany,
2007. Springer-Verlag.
156
[43] D. Garlan, R. Allen, and J. OckerBloom. Architectural mismatch: why reuse
is so hard. IEEE Software, 12:17–26, 1995.
[44] D. Garlan, S. Cheng, and B. Schmerl. Increasing system dependability through
architecture-based self-repair. In R. de Lemos, C. Gacek, and A. Romanovsky,
editors, Architecting Dependable Systems, volume 2677 of Lecture Notes in
Computer Science, pages 61–89. Springer-Verlag, 2003.
[45] S. Garland, N. Lynch, J. Tauber, and M. Vaziri. IOA user guide and ref-
erence manual. Technical Report MIT-LCS-TR-961, MIT CSAI Laboratory,
Cambridge, MA, USA, 2004.
[46] S. Gokhale. Architecture-based software reliability analysis: Overview and
limitations. IEEE Transactions on Dependable and Secure Computing,
4(1):32–40, 2007.
[47] A. Gorbenko, V. Kharchenko, and O. Tarasyuk. FMEA- technique of web
services analysis and dependability ensuring. In M. Butler, C. Jones, A. Ro-
manovsky, and E. Troubitsyna, editors, Rigorous Development of Complex
Fault-Tolerant Systems, volume 4157 of Lecture Notes in Computer Science,
pages 153–167. Springer-Verlag, 2006.
[48] K. Goseva-Popstojanova and K.S. Trivedi. Architecture-based approach to re-
liability assessment of software systems. Performance Evaluation, 45(2-3):179–
204, 2001.
[49] V. Grassi, R. Mirandola, and A. Sabetta. An XML-based language to support
performance and reliability modeling and analysis in software architectures.
In R. Reussner, J. Mayer, J.A. Stafford, S. Overhage, S. Becker, and P.J.
Schroeder, editors, QoSA/SOQUA, volume 3712 of Lecture Notes in Computer
Science, pages 71–87. Springer-Verlag, 2005.
[50] J.M. Harris, J.L. Hirst, and M.J. Mossinghoff. Combinatorics and Graph
Theory. Springer-Verlag, 2000.
[51] B.R. Haverkort and K.S. Trivedi. Specification techniques for markov reward
models. Discrete Event Dynamic Systems, 3(2-3):219–247, 2006.
[52] A. Heddaya and A. Helal. Reliability, availability, dependability and per-
formability: A user-centered view. Technical Report BU-CS-97-011, Boston
University, 1997.
[53] J.N. Herder, H. Bos, B. Gras, P. Homburg, and A.S. Tanenbaum. Failure
resilience for device drivers. In Proceedings of the 37th Annual IEEE/IFIP
157
International Conference on Dependable Systems and Networks (DSN), pages
41–50, Edinburgh, UK, 2007.
[54] H. Hermanns. Interactive Markov Chains, volume 2428 of Lecture Notes in
Computer Science. Springer-Verlag, 2002.
[55] M. Hind. Pointer analysis havent we solved this problem yet? In Proceedings
of the ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software
Tools and Engineering (PASTE), pages 54–61, Snowbird, UT, USA, 2001.
[56] C. Hofmeister, R. Nord, and D. Soni. Applied Software Architecture. Addison-
Wesley, 1999.
[57] R.A. Howard. Dynamic probability systems. Volume 1: Markov models. John
Wiley & Sons, 1971.
[58] Y. Huang and C. Kintala. Software fault tolerance in the application layer. In
M. R. Lyu, editor, Software Fault Tolerance, chapter 10, pages 231–248. John
Wiley & Sons, 1995.
[59] G.C. Hunt, M. Aiken, M. Fhndrich, C. Hawblitzel, O. Hodson, J. Larus,
S. Levi, B. Steensgaard, D. Tarditi, and T. Wobber. Sealing OS processes
to improve dependability and safety. SIGOPS Operating Systems Review,
41(3):341–354, 2007.
[60] IEEE. IEEE standard glossary of software engineering terminology. Standard
610.12-1990, IEEE, 1990.
[61] A. Immonen and E. Niemel¨a. Survey of reliability and availability prediction
methods from the viewpoint of software architecture. Software and System
Modeling, 7(1):49–65, 2008.
[62] U. Isaksen, J.P. Bowen, and N. Nissanke. System and software safety in critical
systems. Technical Report RUCS/97/TR/062/A, The University of Reading,
UK, 1997.
[63] V. Issarny and J. Banatre. Architecture-based exception handling. In Proceed-
ings of the 34th Annual Hawaii International Conference on System Sciences
(HICSS), page 9058, Maui, Hawaii, USA, 2001.
[64] I. Jacobson, G. Booch, and J. Rumbaugh. The Unified Software Development
Process. Addison-Wesley, 1999.
158
[65] A. Jhumka, M. Hiller, and N. Suri. Component-based synthesis of depend-
able embedded software. In Proceedings of the 7th International Symposium
on Formal Techniques in Real-Time and Fault-Tolerant Systems (FTRTFT),
pages 111–128, Oldenburg, Germany, 2002.
[66] K.C. Kang, S. Cohen, J. Hess, W. Novak, and A. Peterson. Feature-oriented
domain analysis (FODA) feasibility study. Technical Report CMU/SEI-90-
TR-21, SEI, 1990.
[67] B. Kernighan and D. Ritchie. The C Programming Language. Prentice Hall,
1988.
[68] S. Krishnan and D. Gannon. Checkpoint and restart for distributed compo-
nents. In Proceedings of the 5th IEEE/ACM International Workshop on Grid
Computing (GRID), pages 281–288, Washington, DC, USA, 2004.
[69] P. Kruchten. The 4+1 view model of architecture. IEEE Software, 12(6):42–
50, 1995.
[70] P. Kruchten. The Rational Unified Process: An Introduction, Second Edition.
Addison-Wesley, 2000.
[71] G.W.M. Kuntz and B.R.H.M. Haverkort. Formal dependability engineering
with MIOA. Technical Report TR-CTIT-08-39, University of Twente, 2008.
[72] C.D. Lai, M. Xie, K.L. Poh, Y.S. Dai, and P. Yang. A model for availability
analysis of distributed software/hardware systems. Information and Software
Technology, 44(6):343–350, 2002.
[73] M. Lanus, L. Yin, and K.S. Trivedi. Hierarchical composition and aggregation
of state-based availability and performability models. IEEE Transactions on
Reliability, 52(1):44–52, 2003.
[74] J.-C. Laprie, J. Arlat, C. Beounes, and K. Kanoun. Architectural issues in
software fault tolerance. In M. R. Lyu, editor, Software Fault Tolerance, chap-
ter 3, pages 47–80. John Wiley & Sons, 1995.
[75] N. Lassing, D. Rijsenbrij, and H. van Vliet. On software architecture analysis
of flexibility, complexity of changes: Size isn’t everything. In Proceedings of
the Second Nordic Software Architecture Workshop (NOSA), pages 1103–1581,
Ronneby, Sweden, 1999.
[76] N.G. Leveson, S.S. Cha, and T.J. Shimeall. Safety verification of ada programs
using software fault trees. IEEE Software, 8(4):48–59, 1991.
159
[77] C. Lung, S. Bot, K. Kalaichelvan, and R. Kazman. An approach to soft-
ware architecture analysis for evolution and reusability. In Proceedings of
the Conference of the Center for Advanced Studies on Collaborative Research
(CASCON), pages 144–154, Toronto, Canada, 1997.
[78] M.W. Maier, D. Emery, and R. Hilliard. Software architecture: Introducing
IEEE Standard 1471. IEEE Computer, 34(4):107–109, 2001.
[79] I. Majzik and G. Huszerl. Towards dependability modeling of FT-CORBA
architectures. In Proceedings of the 4th European Dependable Computing Con-
ference, volume 2485 of Lecture Notes in Computer Science, pages 121–139.
Springer-Verlag, 2002.
[80] I. Majzik, A. Pataricza, and A. Bondavalli. Stochastic dependability analysis
of system architecture based on UML models. In R. de Lemos, C. Gacek,
and A. Romanovsky, editors, Architecting Dependable Systems, volume 2667
of Lecture Notes in Computer Science, pages 219–244. Springer-Verlag, 2003.
[81] D.F. McAllister and M.A. Vouk. Fault-tolerant software reliability engineering.
In M.R. Lyu, editor, Handbook of Software Reliability Engineering, chapter 14,
pages 567–613. McGraw-Hill, 1996.
[82] N. Medvidovic and R. N. Taylor. A classification and comparison framework
for software architecture description languages. IEEE Transactions on Soft-
ware Engineering, 26(1):70–93, 2000.
[83] B. S. Mitchell and S. Mancoridis. On the automatic modularization of software
systems using the bunch tool. IEEE Transactions on Software Eng., 32(3):193–
208, 2006.
[84] G. Molter. Integrating saam in domain-centric and reuse-based development
processes. In Proceedings of the Second Nordic Workshop Software Architecture
(NOSA), pages 1103–1581, Ronneby, Sweden, 1999.
[85] MPlayer official website, 2009. http://www.mplayerhq.hu/.
[86] N. Lynch and M. Tuttle. An Introduction to Input/output Automata. CWI
Quarterly, 2(3):219–246, 1989.
[87] M.L. Nelson. A survey of reverse engineering and program compre-
hension. Computing Research Repository (CoRR), abs/cs/0503068, 2005.
http://arxiv.org/abs/cs/0503068.
[88] N. Nethercote and J. Seward. Valgrind: a framework for heavyweight dynamic
binary instrumentation. SIGPLAN Notices, 42(6):89–100, 2007.
160
[89] Object Management Group. Fault tolerant CORBA. Technical Report OMG
Document formal/2001-09-29, Object Management Group, 2001.
[90] Y. Papadopoulos, D. Parker, and C. Grante. Automating the failure modes
and effects analysis of safety critical systems. In Proceedings of the 8th IEEE
International Symposium on High Assurance Systems Engineering (HASE),
pages 310–311, Tampa, FL, USA, 2004.
[91] E. Parchas and R. de Lemos. An architectural approach for improving avail-
ability in web services. In Proceedings of the Workshop on Architecting De-
pendable Systems (WADS), pages 37–41, Edinburgh, UK, 2004.
[92] M. Pistoia, S. Chandra, S.J. Fink, and E. Yahav. A survey of static analy-
sis methods for identifying security vulnerabilities in software systems. IBM
Systems Journal, 46(2):265–288, 2007.
[93] M. Rakic and N. Medvidovic. Increasing the confidence in off-the-shelf com-
ponents: a software connector-based approach. ACM SIGSOFT Software En-
gineering Notes, 26(3):11–18, 2001.
[94] B. Randell. System structure for software fault tolerance. In Proceedings of
the international conference on Reliable software, pages 437–449, Los Angeles,
CA, USA, 1975.
[95] B. Randell. Turing memorial lecture: Facing up to faults. Computer Journal,
43(2):95–106, 2000.
[96] R. Rashid, D. Julin, D. Orr, R. Sanzi, R. Baron, A. Forin, D. Colub, and
M. Jones. Mach: A system software kernel. In Proceedings of the 34th
Computer Society International Conference (COMPCON), pages 176–178, San
Francisco, CA, USA, 1989.
[97] F. Redmill. Exploring subjectivity in hazard analysis. IEE Engineering Man-
agement Journal, 12(3):139–144, 2002.
[98] D.J. Reifer. Software failure modes and effects analysis. IEEE Transactions
on Reliability, R-28(3):247–249, 1979.
[99] E. Roland and B. Moriarty. Failure mode and effects analysis. In System
Safety Engineering and Management. John Wiley & Sons, 1990.
[100] D. Rosenberg and K. Scott. Use Case Driven Object Modeling with UML: A
Practical Approach. Addison-Wesley, 1999.
[101] S.M. Ross. Introduction to Probability Models. Elsevier Inc., 2007.
161
[102] A.-E. Rugina, K. Kanoun, and M. Kaniche. A system dependability mod-
eling framework using AADL and GSPNs. In R. de Lemos, C. Gacek, and
A. Romanovsky, editors, Architecting Dependable Systems IV, volume 4615 of
Lecture Notes in Computer Science, pages 14–38. Springer-Verlag, 2006.
[103] J. Rumbaugh, I. Jacobson, and G. Booch. The Unified Modeling Language
Reference Manual. Addison-Wesley, 1998.
[104] J. Rumbaugh, I. Jacobson, and G. Booch, editors. The Unified Modeling
Language reference manual. Addison-Wesley, 1999.
[105] F. Ruskey. Simple combinatorial gray codes constructed by reversing sublists.
In Proceedings of the 4th International Symposium on Algorithms and Com-
putation (ISAAC), volume 762 of Lecture Notes in Computer Science, pages
201–208. Springer-Verlag, 1993.
[106] F. Ruskey. Combinatorial Generation. University of Victoria, Victoria, BC,
Canada, 2003. Manuscript CSC-425/520.
[107] T. Saridakis. Design patterns for graceful degradation. In Proceedings of the
10th European Conference on Pattern Languages and Programs (EuroPloP),
pages E7/1–18, Irsee, Germany, 2005.
[108] B. A. Schroeder. On-line monitoring: A tutorial. IEEE Computer, 28:72–78,
1995.
[109] M. Shaw and D. Garlan. Software Architecture: Perspectives on an Emerging
Discipline. Prentice Hall, 1996.
[110] H. Sozer and B. Tekinerdogan. Introducing recovery style for modeling and
analyzing system recovery. In Proceedings of the 7th Working IEEE/IFIP
Conference on Software Architecture (WICSA), pages 167–176, Vancouver,
BC, Canada, 2008.
[111] H. Sozer, B. Tekinerdogan, and M. Aksit. Extending failure modes and effects
analysis approach for reliability analysis at the software architecture design
level. In R. de Lemos, C. Gacek, and A. Romanovsky, editors, Architecting
Dependable Systems IV, volume 4615 of Lecture Notes in Computer Science,
pages 409–433. Springer-Verlag, 2007.
[112] H. Sozer, B. Tekinerdogan, and M. Aksit. FLORA: A framework for decom-
posing software architecture to introduce local recovery. Software: Practice
and Experience, 2009. Accepted.
162
[113] F. Tartanoglu, V. Issarny, A. Romanovsky, and N. Levy. Dependability in the
web services architectures. In R. de Lemos, C. Gacek, and A. Romanovsky,
editors, Architecting Dependable Systems, volume 2677 of Lecture Notes in
Computer Science, pages 90–109. Springer-Verlag, 2003.
[114] R.N. Taylor, N. Medvidovic, K.M. Anderson, Jr. E.J. Whitehead, J.E. Rob-
bins, K.A. Nies, P. Oreizy, and D.L. Dubrow. A component- and message-
based architectural style for GUI software. IEEE Transactions on Software
Engineering, 22(6):390–406, 1996.
[115] B. Tekinerdogan. Synthesis-Based Software Architecture Design. PhD thesis,
University of Twente, Enschede, The Netherlands, 2000.
[116] B. Tekinerdogan. Asaam: Aspectual software architecture analysis method.
In Proceedings of the 4th Working IEEE/IFIP Conference on Software Archi-
tecture (WICSA), pages 5–14, Oslo, Norway, 2004.
[117] B. Tekinerdogan, H. Sozer, and M. Aksit. Software architecture reliability
analysis using failure scenarios. In Proceedings of the 5th Working IEEE/IFIP
Conference on Software Architecture (WICSA), pages 203–204, Pittsburgh,
PA, USA, 2005.
[118] B. Tekinerdogan, H. Sozer, and M. Aksit. Software architecture reliability
analysis using failure scenarios. Journal of Systems and Software, 81(4):558–
575, 2008.
[119] R. Tischler, R. Schaufler, and C. Payne. Static analysis of programs as an aid
to debugging. SIGPLAN Notices, 18(8):155–158, 1983.
[120] Trader project, ESI, 2009. http://www.esi.nl.
[121] K.S. Trivedi and M. Malhotra. Reliability and performability techniques and
tools: A survey. In In Proceedings of the 7th Conference on Measurement,
Modeling, and Evaluation of Computer and Communication Systems (MMB),
pages 27–48, Aachen, Germany, 1993.
[122] K. Vaidyanathan and K.S. Trivedi. A comprehensive model for software reju-
venation. IEEE Transactions on Dependable and Secure Computing, 2(2):124–
137, 2005.
[123] R.F. van der Lans. The SQL Standard: a Complete Guide Reference. Prentice
Hall, 1989.
163
[124] M. Wallace. Modular architectural representation and analysis of fault propa-
gation and transformation. In Proceedings of the Second International Work-
shop on Formal Foundations of Embedded Software and Component-based Soft-
ware Architectures (FESCA), pages 53–71, Edinburgh, UK, 2005.
[125] E. Woods and N. Rozanski. Using architectural perspectives. In Proceedings of
the 5th Working IEEE/IFIP Conference on Software Architecture (WICSA),
pages 25–35, Pittsburgh, PA, USA, 2005.
[126] S. Yacoub, B. Cukic, and H.H. Ammar. A scenario-based reliability analysis
approach for component-based software. IEEE Transactions on Reliability,
53(14):465–480, 2004.
[127] Q. Yang, J.J. Li, and D.M. Weiss. A survey of coverage based testing tools.
In Proceedings of the International Workshop on Automation of Software Test
(AST), pages 99–103, Shanghai, China, 2006.
[128] J. Zhou and T. Stalhane. Using FMEA for early robustness analysis of web-
based systems. In Proceedings of the 28th International Computer Software
and Applications Conference (COMPSAC), pages 28–29, Hong Kong, China,
2004.
164
Samenvatting
De toenemende grootte en de complexiteit van softwaresystemen vermoeilijken dat
alle mogelijke fouten kunnen worden verhinderd dan wel tijdig worden verwijderd.
De fouten die in het systeem blijven bestaan kunnen er uiteindelijk toe leiden dat
het gehele systeem faalt. Teneinde hiermee om te kunnen gaan zijn er in de lit-
eratuur verscheidene zogenaamde fouten-tolerantietechnieken ge¨ıntroduceerd. Deze
technieken gaan ervan uit dat niet alle fouten verhinderd kunnen worden, en staan
daarom toleranter tegenover mogelijke fouten in het systeem. Daarbij zorgen deze
technieken ervoor dat het systeem zich kan herstellen indien fouten in het systeem
zijn gedetecteerd. Hoewel deze fouten-tolerantietechnieken zeer bruikbaar kunnen
zijn is het echter niet altijd eenvoudig ze in het systeem te integreren. In dit proef-
schrift richten we onze aandacht op de volgende problemen bij het ontwerpen van
fouten-tolerante systemen.
Ten eerste behandelen de huidige fouten-tolerantietechnieken niet expliciet fouten
die door de eindgebruiker waarneembaar zijn. Ten tweede zijn bestaande archi-
tectuur stijlen niet geschikt voor het specificeren, communiceren en analyseren van
ontwerpbeslissingen die direct zijn gerelateerd aan fouten tolerantie. Ten derde zijn
er geen adequate analyse technieken die de impact van de introductie van fouten-
tolerantie technieken op de functionele decompositie van de software architectuur
kunnen meten. Ten slotte, vereist het grote inspanning om een fouten-tolerante
ontwerp te realiseren en te onderhouden.
Om het eerste probleem aan te pakken introduceren we de methode SARAH die
gebaseerd is op scenario-gebaseerde analyse technieken en de bestaande betrouw-
baarheidstechnieken (FMEA en FTA) ten einde een software architectuur te anal-
yseren met betrekking tot betrouwbaarheid. Naast het integreren van architectuur
analyse methoden met bestaande betrouwbaarheidstechnieken, richt SARAH zich
uitdrukkelijk op het identificeren van mogelijke fouten die waarneembaar zijn door
de eindgebruiker, en hiermee het identificeren van gevoelige punten, zonder hierbij
een implementatie van het systeem te vereisen.
165
Voor het specificeren van fouten-tolerantie aspecten van de software architectuur
introduceren wij de zogenaamde Recovery Style. Deze wordt gebruikt voor het
communiceren en analyseren van architectuurontwerpbeslissingen en voor het on-
dersteunen van gedetailleerd ontwerp met betrekking tot het herstellen van het sys-
teem.
Voor het oplossen van het derde probleem introduceren we een systematische meth-
ode voor het optimaliseren van de software architectuur voor lokale herstelling van
fouten (local recovery), opdat de beschikbaarheid van het systeem wordt verruimd.
Om de methode te ondersteunen, hebben wij een gentegreerd reeks van gereed-
schappen ontwikkeld die gebruik maken van optimalisatietechnieken, toestands-
gebaseerde analytische modellen (CTMCs) en dynamische analyse van het systeem.
De methode ondersteunt de volgende aspecten: modelleren van de ontwerpruimte
van de mogelijke architectuur decompositie alternatieven, het reduceren van de on-
twerpruimte op basis van de domein en stakeholder constraints, en het balanceren
en selectie van de alternatieven met betrekking tot de beschikbaarheid en prestatie
(performance) metrieken.
Om de inspanningen voor de ontwikkeling en onderhoud van fouten-tolerante as-
pecten te reduceren is er een framework, FLORA, ontwikkeld die de decompositie en
de implementatie van softwarearchitectuur voor lokale herstelling ondersteunt. Het
framework verschaft herbruikbare abstracties voor het defini¨eren van herstelbare
eenheden en voor het integreren van de noodzakelijke co¨ordinatie en communicatie
protocollen die vereist zijn voor herstelling van fouten.
166
Index
AADL, 114
AOP, 123, 130
Arch-Studio, 84, 86
Arch-Edit, 85, 86
xADL, 86, 87
analytic parameter, 87
constructive parameter, 87
reliability interface, 87
architectural tactics, 5
availability, 1, 4, 10, 75, 80
available, 1
backward recovery, 13
check-pointing, 13, 123
log-based recovery, 13, 123
recovery blocks, 13
stable storage, 13
CADP, 76, 85
SVL scripting language, 103, 150
code coverage, 75
Compensation, 13
Criticality, 86–88
CTMC, 75, 101
data dependency, 80
dependability, 9
DTMC, 75
DTV, 2, 24
dynamic analysis, 74
error, 1, 9
failure, 1, 9
failure interface, 102, 141
fault, 1, 9
fault forecasting, 12
fault prevention, 11
fault removal, 11
fault tolerance, 11
error detection, 11
error recovery, 11, 12
error handling, 12
fault handling, 12
fault tree, 23, 33, 142
fault-tolerant design, 1
FCFS, 102, 145
feature diagram, 11
FLORA, 6, 109, 110, 118
Communication Manager, 119
Recoverable Unit, 119
Recovery Manager, 110, 119
stable storage, 125
FMEA, 5, 22
failure cause, 22
failure effect, 22
failure mode, 22
FMECA, 22, 54
severity, 22
forward recovery, 13
exception handling, 13
graceful degradation, 14
FT-CORBA, 114, 131
FTA, 23
function dependency, 80
global recovery, 14, 109
GNU gprof, 75, 85, 92
GNU nm, 93
167
gnuplot, 85, 98
I/O-IMC, 76, 77, 100
compositional aggregation, 78
input-enabled, 77, 142, 147, 150
interactive transitions, 77
input actions, 77
internal actions, 77
output actions, 77
Markovian transitions, 77
integrity, 10
IPC, 120, 123
local recovery, 4, 14, 62, 109
local recovery style, 6, 57
maintainability, 10
MIOA, 146, 149
input action, 146
input signal, 150
internal action, 146
Markovian rate, 147
output action, 146
signature, 146
states, 146
transitions, 146
module, 141
Module Dependency Graph, 93
MPlayer, 60, 66, 78, 88, 109
Demuxer, 60
Gui, 61
Libao, 61
Libmpcodecs, 60
Libvo, 60
Mplayer, 60
RU AUDIO, 66
RU GUI, 66
RU MPCORE, 66
Stream, 60
MTTF, 75, 86–88
MTTR, 75, 86–88
N-version programming, 13
performability, 11
program analysis, 74
recoverable unit, 4, 5, 118
Recovery Designer, 84, 98
Analytical Model Generator, 85, 100
Constraint Evaluator, 85, 89, 90
Deployment Constraints, 89
Domain Constraints, 89
Perf. Feasibility Constraints, 89,
90
Data Dependency Analyzer, 85, 95,
97, 98
DAI, 95
data access profile, 75, 95
MDDE, 95
Design Alternative Gen., 85, 91, 92,
97, 103
Function Dependency Analyzer, 85,
92, 98
func. depend. overhead, 93–95, 98,
99, 109
function call profile, 75, 92
MFDE, 92, 127
Optimizer, 85, 106
exhaustive search, 107
hill-climbing algorithm, 107
neighbor decompositions, 107
recovery interface, 103, 141
recovery manager, 101, 141
recovery style, 5, 57, 60, 69
applies-recovery-action-to, 63
conveys-information-to, 63
local recovery style, 60, 64
communication manager, 64–66
kills, 64
notifies-error-to, 64
provides-diagnosis-to, 64
recovery manager, 64–66
restarts, 64
sends-queued message-to, 64
168
non-recoverable unit, 63, 64
recoverable unit, 60, 63, 64
recovery view, 63, 66
reliability, 1, 10
reliable, 1
safety, 2, 10
safety-critical systems, 2
SARAH, 5, 24, 27
failure severity, 35
architectural element level analysis,
42
architectural element spot, 48
architecture-level analysis, 39
failure analysis report, 45
failure domain model, 29
failure scenario, 24, 28
fault tree set, 5, 24, 33
sensitive elements, 39
sensitivity analysis, 37
WPF, 39
set partitioning problem, 79
Bell number, 79
lexicographic algorithm, 91
partition, 79
restricted growth strings, 91
Stirling numbers of the second kind,
79
software architecture, 2, 14
architectural pattern, 18
architectural style, 18
architectural tactics, 17, 70
architecture description, 14
ADL, 16
IEEE 1471, 14, 58, 59, 69
Views and Beyond approach, 58, 69
concern, 15
perspectives, 70
stakeholder, 15
style, 59
view, 15, 58
viewpoint, 15, 58
viewtype, 59
allocation viewtype, 59
component & connector viewtype,
59
module viewtype, 59, 69
software architecture analysis
direct scenarios, 20
indirect scenarios, 20
measuring techniques, 17
questioning techniques, 17
scenario-based analysis, 20
SQL, 94
state space explosion, 76
state-based models, 75
static analysis, 74
TRADER, 2
UML, 14, 16
aggregation, 15
association, 15
cardinality, 15
role, 15
Valgrind, 75, 85, 95
169
Titles in the IPA Dissertation Series since 2002
M.C. van Wezel. Neural Networks
for Intelligent Data Analysis: theoret-
ical and experimental aspects. Faculty
of Mathematics and Natural Sciences,
UL. 2002-01
V. Bos and J.J.T. Kleijn. Formal
Specification and Analysis of Industrial
Systems. Faculty of Mathematics and
Computer Science and Faculty of Me-
chanical Engineering, TU/e. 2002-02
T. Kuipers. Techniques for Under-
standing Legacy Software Systems. Fac-
ulty of Natural Sciences, Mathematics
and Computer Science, UvA. 2002-03
S.P. Luttik. Choice Quantification in
Process Algebra. Faculty of Natural Sci-
ences, Mathematics, and Computer Sci-
ence, UvA. 2002-04
R.J. Willemen. School Timetable
Construction: Algorithms and Com-
plexity. Faculty of Mathematics and
Computer Science, TU/e. 2002-05
M.I.A. Stoelinga. Alea Jacta Est:
Verification of Probabilistic, Real-time
and Parametric Systems. Faculty of
Science, Mathematics and Computer
Science, KUN. 2002-06
N. van Vugt. Models of Molecular
Computing. Faculty of Mathematics
and Natural Sciences, UL. 2002-07
A. Fehnker. Citius, Vilius, Melius:
Guiding and Cost-Optimality in Model
Checking of Timed and Hybrid Systems.
Faculty of Science, Mathematics and
Computer Science, KUN. 2002-08
R. van Stee. On-line Scheduling and
Bin Packing. Faculty of Mathematics
and Natural Sciences, UL. 2002-09
D. Tauritz. Adaptive Information Fil-
tering: Concepts and Algorithms. Fac-
ulty of Mathematics and Natural Sci-
ences, UL. 2002-10
M.B. van der Zwaag. Models and
Logics for Process Algebra. Faculty
of Natural Sciences, Mathematics, and
Computer Science, UvA. 2002-11
J.I. den Hartog. Probabilistic Exten-
sions of Semantical Models. Faculty of
Sciences, Division of Mathematics and
Computer Science, VUA. 2002-12
L. Moonen. Exploring Software Sys-
tems. Faculty of Natural Sciences,
Mathematics, and Computer Science,
UvA. 2002-13
J.I. van Hemert. Applying Evolution-
ary Computation to Constraint Satis-
faction and Data Mining. Faculty of
Mathematics and Natural Sciences, UL.
2002-14
S. Andova. Probabilistic Process Alge-
bra. Faculty of Mathematics and Com-
puter Science, TU/e. 2002-15
Y.S. Usenko. Linearization in µCRL.
Faculty of Mathematics and Computer
Science, TU/e. 2002-16
J.J.D. Aerts. Random Redundant
Storage for Video on Demand. Faculty
of Mathematics and Computer Science,
TU/e. 2003-01
M. de Jonge. To Reuse or To
Be Reused: Techniques for component
composition and construction. Faculty
of Natural Sciences, Mathematics, and
Computer Science, UvA. 2003-02
J.M.W. Visser. Generic Traversal
over Typed Source Code Representa-
tions. Faculty of Natural Sciences,
Mathematics, and Computer Science,
UvA. 2003-03
S.M. Bohte. Spiking Neural Networks.
Faculty of Mathematics and Natural
Sciences, UL. 2003-04
T.A.C. Willemse. Semantics and
Verification in Process Algebras with
Data and Timing. Faculty of Math-
ematics and Computer Science, TU/e.
2003-05
S.V. Nedea. Analysis and Simula-
tions of Catalytic Reactions. Faculty
of Mathematics and Computer Science,
TU/e. 2003-06
M.E.M. Lijding. Real-time Schedul-
ing of Tertiary Storage. Faculty of
Electrical Engineering, Mathematics &
Computer Science, UT. 2003-07
H.P. Benz. Casual Multimedia Pro-
cess Annotation – CoMPAs. Faculty of
Electrical Engineering, Mathematics &
Computer Science, UT. 2003-08
D. Distefano. On Modelchecking the
Dynamics of Object-based Software: a
Foundational Approach. Faculty of
Electrical Engineering, Mathematics &
Computer Science, UT. 2003-09
M.H. ter Beek. Team Automata –
A Formal Approach to the Modeling of
Collaboration Between System Compo-
nents. Faculty of Mathematics and
Natural Sciences, UL. 2003-10
D.J.P. Leijen. The λ Abroad – A
Functional Approach to Software Com-
ponents. Faculty of Mathematics and
Computer Science, UU. 2003-11
W.P.A.J. Michiels. Performance Ra-
tios for the Differencing Method. Fac-
ulty of Mathematics and Computer Sci-
ence, TU/e. 2004-01
G.I. Jojgov. Incomplete Proofs and
Terms and Their Use in Interactive
Theorem Proving. Faculty of Mathe-
matics and Computer Science, TU/e.
2004-02
P. Frisco. Theory of Molecular Com-
puting – Splicing and Membrane sys-
tems. Faculty of Mathematics and Nat-
ural Sciences, UL. 2004-03
S. Maneth. Models of Tree Transla-
tion. Faculty of Mathematics and Nat-
ural Sciences, UL. 2004-04
Y. Qian. Data Synchronization and
Browsing for Home Environments. Fac-
ulty of Mathematics and Computer Sci-
ence and Faculty of Industrial Design,
TU/e. 2004-05
F. Bartels. On Generalised Coinduc-
tion and Probabilistic Specification For-
mats. Faculty of Sciences, Division
of Mathematics and Computer Science,
VUA. 2004-06
L. Cruz-Filipe. Constructive Real
Analysis: a Type-Theoretical Formal-
ization and Applications. Faculty of
Science, Mathematics and Computer
Science, KUN. 2004-07
E.H. Gerding. Autonomous Agents
in Bargaining Games: An Evolutionary
Investigation of Fundamentals, Strate-
gies, and Business Applications. Fac-
ulty of Technology Management, TU/e.
2004-08
N. Goga. Control and Selection Tech-
niques for the Automated Testing of Re-
active Systems. Faculty of Mathematics
and Computer Science, TU/e. 2004-09
M. Niqui. Formalising Exact Arith-
metic: Representations, Algorithms and
Proofs. Faculty of Science, Mathemat-
ics and Computer Science, RU. 2004-10
A. L¨oh. Exploring Generic Haskell.
Faculty of Mathematics and Computer
Science, UU. 2004-11
I.C.M. Flinsenberg. Route Planning
Algorithms for Car Navigation. Faculty
of Mathematics and Computer Science,
TU/e. 2004-12
R.J. Bril. Real-time Scheduling for
Media Processing Using Conditionally
Guaranteed Budgets. Faculty of Math-
ematics and Computer Science, TU/e.
2004-13
J. Pang. Formal Verification of Dis-
tributed Systems. Faculty of Sciences,
Division of Mathematics and Computer
Science, VUA. 2004-14
F. Alkemade. Evolutionary Agent-
Based Economics. Faculty of Technol-
ogy Management, TU/e. 2004-15
E.O. Dijk. Indoor Ultrasonic Position
Estimation Using a Single Base Station.
Faculty of Mathematics and Computer
Science, TU/e. 2004-16
S.M. Orzan. On Distributed Verifi-
cation and Verified Distribution. Fac-
ulty of Sciences, Division of Mathemat-
ics and Computer Science, VUA. 2004-
17
M.M. Schrage. Proxima - A
Presentation-oriented Editor for Struc-
tured Documents. Faculty of Math-
ematics and Computer Science, UU.
2004-18
E. Eskenazi and A. Fyukov. Quan-
titative Prediction of Quality Attributes
for Component-Based Software Archi-
tectures. Faculty of Mathematics and
Computer Science, TU/e. 2004-19
P.J.L. Cuijpers. Hybrid Process Alge-
bra. Faculty of Mathematics and Com-
puter Science, TU/e. 2004-20
N.J.M. van den Nieuwelaar. Super-
visory Machine Control by Predictive-
Reactive Scheduling. Faculty of Me-
chanical Engineering, TU/e. 2004-21
E.
´
Abrah´am. An Assertional Proof
System for Multithreaded Java -Theory
and Tool Support- . Faculty of Mathe-
matics and Natural Sciences, UL. 2005-
01
R. Ruimerman. Modeling and Re-
modeling in Bone Tissue. Faculty of
Biomedical Engineering, TU/e. 2005-
02
C.N. Chong. Experiments in Rights
Control - Expression and Enforce-
ment. Faculty of Electrical Engineering,
Mathematics & Computer Science, UT.
2005-03
H. Gao. Design and Verification of
Lock-free Parallel Algorithms. Faculty
of Mathematics and Computing Sci-
ences, RUG. 2005-04
H.M.A. van Beek. Specification and
Analysis of Internet Applications. Fac-
ulty of Mathematics and Computer Sci-
ence, TU/e. 2005-05
M.T. Ionita. Scenario-Based System
Architecting - A Systematic Approach
to Developing Future-Proof System Ar-
chitectures. Faculty of Mathematics
and Computing Sciences, TU/e. 2005-
06
G. Lenzini. Integration of Analy-
sis Techniques in Security and Fault-
Tolerance. Faculty of Electrical Engi-
neering, Mathematics & Computer Sci-
ence, UT. 2005-07
I. Kurtev. Adaptability of Model
Transformations. Faculty of Electrical
Engineering, Mathematics & Computer
Science, UT. 2005-08
T. Wolle. Computational Aspects of
Treewidth - Lower Bounds and Network
Reliability. Faculty of Science, UU.
2005-09
O. Tveretina. Decision Procedures
for Equality Logic with Uninterpreted
Functions. Faculty of Mathematics and
Computer Science, TU/e. 2005-10
A.M.L. Liekens. Evolution of Fi-
nite Populations in Dynamic Environ-
ments. Faculty of Biomedical Engineer-
ing, TU/e. 2005-11
J. Eggermont. Data Mining using
Genetic Programming: Classification
and Symbolic Regression. Faculty of
Mathematics and Natural Sciences, UL.
2005-12
B.J. Heeren. Top Quality Type Er-
ror Messages. Faculty of Science, UU.
2005-13
G.F. Frehse. Compositional Verifi-
cation of Hybrid Systems using Simu-
lation Relations. Faculty of Science,
Mathematics and Computer Science,
RU. 2005-14
M.R. Mousavi. Structuring Struc-
tural Operational Semantics. Faculty
of Mathematics and Computer Science,
TU/e. 2005-15
A. Sokolova. Coalgebraic Analysis of
Probabilistic Systems. Faculty of Math-
ematics and Computer Science, TU/e.
2005-16
T. Gelsema. Effective Models for the
Structure of pi-Calculus Processes with
Replication. Faculty of Mathematics
and Natural Sciences, UL. 2005-17
P. Zoeteweij. Composing Constraint
Solvers. Faculty of Natural Sciences,
Mathematics, and Computer Science,
UvA. 2005-18
J.J. Vinju. Analysis and Transfor-
mation of Source Code by Parsing and
Rewriting. Faculty of Natural Sciences,
Mathematics, and Computer Science,
UvA. 2005-19
M.Valero Espada. Modal Abstraction
and Replication of Processes with Data.
Faculty of Sciences, Division of Math-
ematics and Computer Science, VUA.
2005-20
A. Dijkstra. Stepping through Haskell.
Faculty of Science, UU. 2005-21
Y.W. Law. Key management and
link-layer security of wireless sensor
networks: energy-efficient attack and
defense. Faculty of Electrical Engineer-
ing, Mathematics & Computer Science,
UT. 2005-22
E. Dolstra. The Purely Functional
Software Deployment Model. Faculty of
Science, UU. 2006-01
R.J. Corin. Analysis Models for Se-
curity Protocols. Faculty of Electrical
Engineering, Mathematics & Computer
Science, UT. 2006-02
P.R.A. Verbaan. The Computational
Complexity of Evolving Systems. Fac-
ulty of Science, UU. 2006-03
K.L. Man and R.R.H. Schiffelers.
Formal Specification and Analysis of
Hybrid Systems. Faculty of Mathe-
matics and Computer Science and Fac-
ulty of Mechanical Engineering, TU/e.
2006-04
M. Kyas. Verifying OCL Specifica-
tions of UML Models: Tool Support and
Compositionality. Faculty of Mathe-
matics and Natural Sciences, UL. 2006-
05
M. Hendriks. Model Checking Timed
Automata - Techniques and Applica-
tions. Faculty of Science, Mathematics
and Computer Science, RU. 2006-06
J. Ketema. B¨ ohm-Like Trees for
Rewriting. Faculty of Sciences, VUA.
2006-07
C.-B. Breunesse. On JML: topics
in tool-assisted verification of JML pro-
grams. Faculty of Science, Mathematics
and Computer Science, RU. 2006-08
B. Markvoort. Towards Hybrid
Molecular Simulations. Faculty of
Biomedical Engineering, TU/e. 2006-
09
S.G.R. Nijssen. Mining Structured
Data. Faculty of Mathematics and Nat-
ural Sciences, UL. 2006-10
G. Russello. Separation and Adap-
tation of Concerns in a Shared Data
Space. Faculty of Mathematics and
Computer Science, TU/e. 2006-11
L. Cheung. Reconciling Nondetermin-
istic and Probabilistic Choices. Faculty
of Science, Mathematics and Computer
Science, RU. 2006-12
B. Badban. Verification techniques for
Extensions of Equality Logic. Faculty of
Sciences, Division of Mathematics and
Computer Science, VUA. 2006-13
A.J. Mooij. Constructive formal
methods and protocol standardization.
Faculty of Mathematics and Computer
Science, TU/e. 2006-14
T. Krilavicius. Hybrid Techniques for
Hybrid Systems. Faculty of Electrical
Engineering, Mathematics & Computer
Science, UT. 2006-15
M.E. Warnier. Language Based Secu-
rity for Java and JML. Faculty of Sci-
ence, Mathematics and Computer Sci-
ence, RU. 2006-16
V. Sundramoorthy. At Home In Ser-
vice Discovery. Faculty of Electrical
Engineering, Mathematics & Computer
Science, UT. 2006-17
B. Gebremichael. Expressivity of
Timed Automata Models. Faculty of
Science, Mathematics and Computer
Science, RU. 2006-18
L.C.M. van Gool. Formalising In-
terface Specifications. Faculty of Math-
ematics and Computer Science, TU/e.
2006-19
C.J.F. Cremers. Scyther - Semantics
and Verification of Security Protocols.
Faculty of Mathematics and Computer
Science, TU/e. 2006-20
J.V. Guillen Scholten. Mobile Chan-
nels for Exogenous Coordination of Dis-
tributed Systems: Semantics, Imple-
mentation and Composition. Faculty of
Mathematics and Natural Sciences, UL.
2006-21
H.A. de Jong. Flexible Heterogeneous
Software Systems. Faculty of Natural
Sciences, Mathematics, and Computer
Science, UvA. 2007-01
N.K. Kavaldjiev. A run-time recon-
figurable Network-on-Chip for stream-
ing DSP applications. Faculty of
Electrical Engineering, Mathematics &
Computer Science, UT. 2007-02
M. van Veelen. Considerations on
Modeling for Early Detection of Ab-
normalities in Locally Autonomous Dis-
tributed Systems. Faculty of Mathe-
matics and Computing Sciences, RUG.
2007-03
T.D. Vu. Semantics and Applications
of Process and Program Algebra. Fac-
ulty of Natural Sciences, Mathematics,
and Computer Science, UvA. 2007-04
L. Brand´an Briones. Theories for
Model-based Testing: Real-time and
Coverage. Faculty of Electrical Engi-
neering, Mathematics & Computer Sci-
ence, UT. 2007-05
I. Loeb. Natural Deduction: Shar-
ing by Presentation. Faculty of Sci-
ence, Mathematics and Computer Sci-
ence, RU. 2007-06
M.W.A. Streppel. Multifunctional
Geometric Data Structures. Faculty
of Mathematics and Computer Science,
TU/e. 2007-07
N. Trˇcka. Silent Steps in Transition
Systems and Markov Chains. Faculty
of Mathematics and Computer Science,
TU/e. 2007-08
R. Brinkman. Searching in encrypted
data. Faculty of Electrical Engineering,
Mathematics & Computer Science, UT.
2007-09
A. van Weelden. Putting types to
good use. Faculty of Science, Math-
ematics and Computer Science, RU.
2007-10
J.A.R. Noppen. Imperfect Infor-
mation in Software Development Pro-
cesses. Faculty of Electrical Engineer-
ing, Mathematics & Computer Science,
UT. 2007-11
R. Boumen. Integration and Test
plans for Complex Manufacturing Sys-
tems. Faculty of Mechanical Engineer-
ing, TU/e. 2007-12
A.J. Wijs. What to do Next?:
Analysing and Optimising System Be-
haviour in Time. Faculty of Sciences,
Division of Mathematics and Computer
Science, VUA. 2007-13
C.F.J. Lange. Assessing and Improv-
ing the Quality of Modeling: A Series of
Empirical Studies about the UML. Fac-
ulty of Mathematics and Computer Sci-
ence, TU/e. 2007-14
T. van der Storm. Component-based
Configuration, Integration and Deliv-
ery. Faculty of Natural Sciences, Math-
ematics, and Computer Science,UvA.
2007-15
B.S. Graaf. Model-Driven Evolu-
tion of Software Architectures. Faculty
of Electrical Engineering, Mathematics,
and Computer Science, TUD. 2007-16
A.H.J. Mathijssen. Logical Calculi
for Reasoning with Binding. Faculty
of Mathematics and Computer Science,
TU/e. 2007-17
D. Jarnikov. QoS framework for
Video Streaming in Home Networks.
Faculty of Mathematics and Computer
Science, TU/e. 2007-18
M. A. Abam. New Data Structures
and Algorithms for Mobile Data. Fac-
ulty of Mathematics and Computer Sci-
ence, TU/e. 2007-19
W. Pieters. La Volont´e Machinale:
Understanding the Electronic Voting
Controversy. Faculty of Science, Math-
ematics and Computer Science, RU.
2008-01
A.L. de Groot. Practical Automa-
ton Proofs in PVS. Faculty of Science,
Mathematics and Computer Science,
RU. 2008-02
M. Bruntink. Renovation of Id-
iomatic Crosscutting Concerns in Em-
bedded Systems. Faculty of Electrical
Engineering, Mathematics, and Com-
puter Science, TUD. 2008-03
A.M. Marin. An Integrated System
to Manage Crosscutting Concerns in
Source Code. Faculty of Electrical En-
gineering, Mathematics, and Computer
Science, TUD. 2008-04
N.C.W.M. Braspenning. Model-
based Integration and Testing of High-
tech Multi-disciplinary Systems. Fac-
ulty of Mechanical Engineering, TU/e.
2008-05
M. Bravenboer. Exercises in Free
Syntax: Syntax Definition, Parsing,
and Assimilation of Language Conglom-
erates. Faculty of Science, UU. 2008-06
M. Torabi Dashti. Keeping Fairness
Alive: Design and Formal Verification
of Optimistic Fair Exchange Protocols.
Faculty of Sciences, Division of Math-
ematics and Computer Science, VUA.
2008-07
I.S.M. de Jong. Integration and Test
Strategies for Complex Manufacturing
Machines. Faculty of Mechanical En-
gineering, TU/e. 2008-08
I. Hasuo. Tracing Anonymity with
Coalgebras. Faculty of Science, Math-
ematics and Computer Science, RU.
2008-09
L.G.W.A. Cleophas. Tree Algo-
rithms: Two Taxonomies and a Toolkit.
Faculty of Mathematics and Computer
Science, TU/e. 2008-10
I.S. Zapreev. Model Checking Markov
Chains: Techniques and Tools. Faculty
of Electrical Engineering, Mathematics
& Computer Science, UT. 2008-11
M. Farshi. A Theoretical and Exper-
imental Study of Geometric Networks.
Faculty of Mathematics and Computer
Science, TU/e. 2008-12
G. Gulesir. Evolvable Behavior Speci-
fications Using Context-Sensitive Wild-
cards. Faculty of Electrical Engineer-
ing, Mathematics & Computer Science,
UT. 2008-13
F.D. Garcia. Formal and Computa-
tional Cryptography: Protocols, Hashes
and Commitments. Faculty of Sci-
ence, Mathematics and Computer Sci-
ence, RU. 2008-14
P. E. A. D¨ urr. Resource-based Veri-
fication for Robust Composition of As-
pects. Faculty of Electrical Engineering,
Mathematics & Computer Science, UT.
2008-15
E.M. Bortnik. Formal Methods in
Support of SMC Design. Faculty of Me-
chanical Engineering, TU/e. 2008-16
R.H. Mak. Design and Performance
Analysis of Data-Independent Stream
Processing Systems. Faculty of Math-
ematics and Computer Science, TU/e.
2008-17
M. van der Horst. Scalable Block
Processing Algorithms. Faculty of
Mathematics and Computer Science,
TU/e. 2008-18
C.M. Gray. Algorithms for Fat Ob-
jects: Decompositions and Applications.
Faculty of Mathematics and Computer
Science, TU/e. 2008-19
J.R. Calam´e. Testing Reactive Sys-
tems with Data - Enumerative Meth-
ods and Constraint Solving. Faculty of
Electrical Engineering, Mathematics &
Computer Science, UT. 2008-20
E. Mumford. Drawing Graphs for
Cartographic Applications. Faculty of
Mathematics and Computer Science,
TU/e. 2008-21
E.H. de Graaf. Mining Semi-
structured Data, Theoretical and Ex-
perimental Aspects of Pattern Evalua-
tion. Faculty of Mathematics and Nat-
ural Sciences, UL. 2008-22
R. Brijder. Models of Natural Compu-
tation: Gene Assembly and Membrane
Systems. Faculty of Mathematics and
Natural Sciences, UL. 2008-23
A. Koprowski. Termination of
Rewriting and Its Certification. Faculty
of Mathematics and Computer Science,
TU/e. 2008-24
U. Khadim. Process Algebras for Hy-
brid Systems: Comparison and Devel-
opment. Faculty of Mathematics and
Computer Science, TU/e. 2008-25
J. Markovski. Real and Stochastic
Time in Process Algebras for Perfor-
mance Evaluation. Faculty of Mathe-
matics and Computer Science, TU/e.
2008-26
H. Kastenberg. Graph-Based Soft-
ware Specification and Verification.
Faculty of Electrical Engineering,
Mathematics & Computer Science, UT.
2008-27
I.R. Buhan. Cryptographic Keys
from Noisy Data Theory and Applica-
tions. Faculty of Electrical Engineering,
Mathematics & Computer Science, UT.
2008-28
R.S. Marin-Perianu. Wireless Sen-
sor Networks in Motion: Clustering Al-
gorithms for Service Discovery and Pro-
visioning. Faculty of Electrical Engi-
neering, Mathematics & Computer Sci-
ence, UT. 2008-29
M.H.G. Verhoef. Modeling and Vali-
dating Distributed Embedded Real-Time
Control Systems. Faculty of Science,
Mathematics and Computer Science,
RU. 2009-01
M. de Mol. Reasoning about Func-
tional Programs: Sparkle, a proof as-
sistant for Clean. Faculty of Science,
Mathematics and Computer Science,
RU. 2009-02
M. Lormans. Managing Requirements
Evolution. Faculty of Electrical Engi-
neering, Mathematics, and Computer
Science, TUD. 2009-03
M.P.W.J. van Osch. Automated
Model-based Testing of Hybrid Systems.
Faculty of Mathematics and Computer
Science, TU/e. 2009-04
H. Sozer. Architecting Fault-Tolerant
Software Systems. Faculty of Electrical
Engineering, Mathematics & Computer
Science, UT. 2009-05

Architecting Fault-Tolerant Software Systems

Hasan S¨zer o

Architecting Fault-Tolerant Software Systems

Hasan S¨zer o

Centre for Telematics and Information Technology (CTIT). Ir. The Netherlands CTIT Ph. This work has been carried out as part of the Trader project under the responsibility of the Embedded Systems Institute. Ak¸it. Ir. The work in this thesis has been carried out under the auspices of the research school IPA (Institute for Programming research and Algorithmics). Delft University of Technology. Smit. 09-135) IPA Dissertation Series 2009-05 Cover design by Hasan S¨zer o Printed by PrintPartners Ipskamp. P. This project is partially supported by the Dutch Government under the Bsik program. Enschede. Dr. The Netherlands Prof. University of Twente. Box 217 . R. B. di ssertation committee: Chairman and secretary: Prof. A. University of Twente.7500 AE Enschede. Ir. University of Twente. G.O. The Netherlands s Assistant promoter : Dr. Ir. Mouthaan. University of Kent.Ph. The Netherlands Dr. van Gemund. United Kingdom Prof. United Kingdom Prof. The Netherlands Copyright c 2009. University of Twente. de Lemos. The Netherlands Promoter : Prof. Ir. The Netherlands. Bilkent University. A. Broenink. Tekinerdo˘an. Turkey g Members: Dr. Dr. Newcastle University. Hasan S¨zer. thesis series no. A. Enschede. M. Dr.D. ISBN 978-90-365-2788-0 ISSN 1381-36-17 (CTIT Ph. Dr. Dr. Ir. 09-135.D. thesis series no. J.J.D. The Netherlands o . Romanovsky.

45 by Hasan S¨zer o born on the 21st of August 1980 in Bursa. Prof.Architecting Fault-Tolerant Software Systems DISSERTATION to obtain the degree of doctor at the University of Twente. on the authority of the rector magnificus. on account of the decision of the graduation committee. Turkey . H. Brinksma. Dr. to be publicly defended on Thursday the 29th of January 2009 at 16.

Dr. Ak¸it (promoter) s Dr. Bedir Tekinerdo˘an (assistant promoter) g . M.This dissertation is approved by Prof. Ir. Ir.

the more you realize you know nothing.” .Socrates .“The more you know.

.

He was a visiting assistant professor there at that time. During my studies.Acknowledgements When I was a M. Alexander Romanovsky. I have witnessed his ability to foresee pitfalls and I have been convinced about the accuracy of his predictions in research. committee: Jan Broenink.10) presents the results of our collaboration. I would like to thank him for his contributions to my intellectual growth and for his continuous encouragement. vii . I would like to thank him for the faith he had in me. Towards the end g of my M. and Gerard Smit for e spending their valuable time and energy to evaluate my work. I would like to thank him for his reliable guidance. Their useful comments enabled me to dramatically improve this thesis. I have sometimes been exposed to challenging critics but always with a positive. position at the University of Twente.Sc. First of all. I have also had the opportunity to work together with Hichem Boudali and Mari¨lle Stoelinga from the formal methods group. Arjan van Gemund. Over the years. I have met with Bedir Tekinerdo˘an.D. In these meetings. I would like to thank to the members of my Ph. student at Bilkent University.D.D.Sc. He has also recommended me for this position. optimistic attitude and encouragement. I would like to thank them for their contribution. he has notified me about the vacancy for a Ph. I have carried out my Ph. I have always been impressed by his ability to abstract away key points out of details and his writing/presentation skills based on a true empathy towards the intended audience. he became my daily supervisor and we have been working very closely thereafter. studies at the software engineering group lead by Mehmet Ak¸it. Rog´rio de Lemos. Following my admission to this position. studies. I have learned a lot e from them and an important part of this thesis (Section 5. which has been an important source of motivation for me. We have had regular meetings with him to discuss my progress s and future research directions.

Ben Pronk brought up the research direction on local recovery. He has also read and corrected my official Dutch letters. The set of strategies has been later extended for surviving at mountains and at the military service as well. but almost healthy at the same time). Joke Lammerink. I had the chance to meet with several other good friends while I was living at a student house (the infamous 399) in the campus. In particular. Hilda Ferweda and Nathalie van Zetten for their invaluable administrative support. who have provided me such an environment during the last four years. To be able to finish this work. Christian and Somayeh have always been open to give their opinion about any issue I may bring up and help me if necessary. Elvira Dijkhuis. Janson and Pierre van de Laar on failure scenarios. In addition to the Dutch courses provided by the university.I would like to thank to the members of the Trader project for their useful feedback during our regular project meetings. In the following. Christian and Somayeh. I would like to thank Ellen Roberts-Tieke. Espen is very effective in killing stress and boosting courage in any circumstance (almost like alcohol. When I first arrived in Enschede. He has helped me a lot to get acquainted with the new environment. Previously we had several discussions with Iulian Nitescu. In particular. first of all I had to feel secure and comfortable in my social environment. Jozef Hooman and Teun Hendriks have reviewed my work closely. I would like to extend my gratitude to people. Joost has given me a ‘complementary’ course on Dutch language and Dutch culture. I thank my roommates over the years: Joost. David Watts. He has provided me a useful set of survival strategies to deal with never-ending official procedures. The life is a lot easier if you always have a reliable friend around to talk to. I have been sharing an apartment with Espen during the last three years. I would like to thank them also for the comfortable working environment I have had. Besides Espen. The members of the software engineering group have provided me useful feedback during our regular seminars. G¨rcan was one of the few people I knew at the u university. which later happened to be the main focus of my work. I am sure that we will keep in touch in the future. in one way or another. fault/error/failure classes and recovery strategies. which would have caused quite some trouble if they were not corrected. He has also spent his valuable time to provide us TV domain knowledge together with Rob Golsteijn. Paul L. . These discussions have also directly or indirectly contributed to this thesis.

In addition. I have had endless love and support from my family throughout my life. I would like to thank all the people who have contributed to Tusat. I would like to thank all my friends who accompany me in recreational trips and various other activities during the last four years. Similarly. I am grateful to people who have volunteered to work for making our life more social and enjoyable in the university. Selim and o Emre are also in this group and I especially thank them for “supporting” me during my defense. . for example. I thank foremost my parents for always standing by me regardless of the geographic distance between us. members of Esn Twente over the years. I have also had the chance to meet with many new Turkish friends during my studies. They have became very close friends of mine and they have helped me not to feel so much homesick. Stichting Kleurrijke Dans has also made my life more colorful lately. There are many people to count in this group and I will shortly refer to them as ‘¨ztwenteliler’.Although I have been living abroad.

.

xi . FMEA. we propose a systematic method for optimizing the decomposition of software architecture for local recovery. realizing a fault-tolerant design usually requires a substantial development and maintenance effort. which is an effective fault tolerance technique to attain high system availability. As a new architectural style. we propose a scenario-based software architecture reliability analysis method. state-based analytical models (i. there are no adequate analysis techniques that evaluate the impact of fault tolerance techniques on the functional decomposition of software architecture. Fourth. Faults that remain in the system can eventually lead to a system failure. Fault tolerance techniques are introduced for enabling systems to recover and continue operation when they are subject to faults. First. CTMCs) and dynamic analysis on the system. Many fault tolerance techniques are available but incorporating them in a system is not always trivial. called SARAH that benefits from mature reliability engineering techniques (i. To support the method. FTA) to provide an early reliability analysis of the software architecture design. We consider the following problems in designing a fault-tolerant system. To tackle the first problem.Abstract The increasing size and complexity of software systems makes it hard to prevent or remove all possible faults. we have developed an integrated set of tools that employ optimization techniques. we introduce Recovery Style for specifying fault-tolerant aspects of software architecture. Second. existing architecture styles are not well-suited for specifying.e. SARAH evaluates potential failures from the end-user perspective to identify sensitive points of a system without requiring an implementation. communicating and analyzing design decisions that are particularly related to the fault-tolerant aspects of a system. As a solution for the third problem. Recovery Style is used for communicating and analyzing architectural design decisions and for supporting detailed design with respect to recovery. existing reliability analysis techniques generally do not prioritize potential failures from the end-user perspective and accordingly do not identify sensitivity points of a system.e. Third.

ii ) reducing the design space with respect to domain and stakeholder constraints and iii ) making the desired trade-off between availability and performance metrics. The framework provides reusable abstractions for defining recoverable units and for incorporating the necessary coordination and communication protocols for recovery.The method enables the following: i ) modeling the design space of the possible decomposition alternatives. . we propose a framework. FLORA that supports the decomposition and implementation of software architecture for local recovery. To reduce the development and maintenance effort.

3 2. . . 1. . . . .3. . . Framework for the realization of software architecture recovery design . . . . . . . . . .1. . . . . . . . . . . . Motivation .4 Software architecture reliability analysis using failure scenarios Architectural style for recovery . . .2 Software Architecture Descriptions . . . . . . . . . . . . . . . . . . . . . 14 2. . 12 Software Architecture Design and Analysis . . . . . . . . . . . . . . .4 1. . . . . . . . . . . . . . . . . . .3 Thesis Scope . . . . . 10 Dependability Means . . . . Quantitative analysis and optimization of software architecture decomposition for recovery . . . . . . . . . . The Approach . . . . . . . . . . . . 16 xiii . . . . . . . . . . . . . . . . . . . . . . . . . . .2 1. . . . .1 1. . . .3. .1 2. 2 Background and Definitions 2.1. . . . . . . . . . . . . . .Contents 1 Introduction 1. . . . . . . . . . .2. . . . . . .3 1. . . . . . . . .1. . . . . . . . . . . . . . .1 1. . . . . . . . . . . . . . 14 Software Architecture Analysis .3. . . . 2. . . . . .1 2. . .2 2. . . . . . . . 11 Fault Tolerance and Error Handling .1 Dependability and Fault Tolerance . . . . . . . . . .2 1. . . .2 Dependability and Related Quality Attributes . . . . . . . .2. . . . . . . . . . .3. . . . . . . . . . . . . . . . 1 2 3 4 5 5 6 6 7 9 9 Thesis Overview . . . . . . . . . . . . . . . . . . . . .

. 28 Software Architecture Reliability Analysis . . . . . . . . . . . . . . . . . . .2. . . .6 4. 47 3. . . 61 Documenting the Recovery Design . . . 59 Case Study: MPlayer . . 60 4. . . . . . . . . . . .2 3. .2 3. . . . . . . . . . . . . . . . 70 . . . 51 Related Work . .2 Refactoring MPlayer for Recovery . . . . . . . . . . . 66 Discussion . . . . . . . . . . . . . . . . . . . .1 4. . . .3. . . . . . . . . .3. . . . . . . . . 17 19 3 Scenario-Based Software Architecture Reliability Analysis 3. . . . . . . . . . . . . . . . . . . . . . . .3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 3. . . . . . . . .4 3. . . .3. . .5 4. . . . . . . Patterns and Styles . . .4 3.3 Software Architecture Views and Models . . . . . .4 4. . . . . . . . . . .7 4. . . . . . . .3 Scenario-Based Software Architecture Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3. 54 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3. . . .3. . . 20 FMEA and FTA . . . . . . . . . . . 22 SARAH .6 Discussion . . .2. . . . . . . . . . . . . . . 64 Using the Recovery Style . . . . . 24 3. . . 38 Architectural Adjustment . . . . . . . . . . . . . . . . . . . . . . . 26 Software Architecture and Failure Scenario Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 57 4 Software Architecture Recovery Style 4. . . . . . . .2 4. . . . . . . . . . . . . . . . . . . . . . 63 4. . . . . . . . . . . . . 24 The Top-Level Process . .8 Recovery Style Local Recovery Style . . . . . . . . . . . 58 The Need for a Quality-Based View . . . . . . . .3 Architectural Tactics.5 3. . . . . . . 69 Related Work . . . .5 Case Study: Digital TV . . .3 3. . . .1 4. . . . . . . . . . . . . . . . . . . . . . . . .1 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 . .

. . . . . . . . . . . . . . . . . . . . . . . 101 5. .1 Modeling Approach . . . . . . . . 91 Performance Overhead Analysis . . . . . . . . . . . . . . . . . .1 5. . . . . . .10. . .9 Decomposition Alternative Selection . . . . . . . . . .5 5. . . . . . . . . . . .6 5. . . . . . . . . . . . . . .2 Source Code and Run-time Analysis . . . . . . . . . . . . . . . . . . . . . . . . .8. . . . . . . . . . . . . . . . .8. . . . . . . . . 74 Analysis based on Analytical Models . .3. . . . . . . . . . . . . . . . . .2. . . . . . . . . . . . . . . 77 Software Architecture Decomposition for Local Recovery . . . . . 104 5. . . . . . .3. . .3 Function Dependency Analysis . . . . . . .15 Conclusions . . . 98 5.14 Related Work . . . . . . 88 Design Alternative Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4. . . . . . .13 Discussion . . . . . . . . 109 5. . . . . . . . . 113 5. . . . . . . . . . .11 Optimization . . . . . . . . . 78 5. . .4 5. . . 100 5. . . .1 5. . . . . . . . . .3 The I/O-IMC formalism . . . . . . . . . .2 5. . . . . . . . . . . . . . . . 72 5 Quantitative Analysis and Optimization of Software Architecture Decomposition for Recovery 73 5. . . . . . 81 Software Architecture Definition . . . . . . . . . 86 Constraint Definition . .12 Evaluation . . . . . . . . . . . . . . . . . 98 5. . . . . . . . . . .8. . . . 92 Data Dependency Analysis . . . . . . . . . . 106 5. . . . . . . 75 5. . . . . . . . . . .2 Analysis Results . . . . . . . . . . . . . . . . . . . . . 115 .10 Availability Analysis . . . .1 5. . . . 79 Criteria for Selecting Decomposition Alternatives . . . . . . . . . . . .8 The Overall Process and the Analysis Tool . . . . . . . . . .1 5. . . . . . 80 5. . . . . .7 5. . . . . . . . 112 5. . . . . . . . . . . . . . . 92 5. . . . . . . . . . .9 Conclusions . . . . . . . . . . . .2 Design Space . . . . . .10. . . . . . . . . . . . . . . . . . . . . . . . . 95 Depicting Analysis Results . . . . . . . . . .

. . . . . . . 137 139 141 A SARAH Fault Tree Set Calculations B I/O-IMC Model Specification and Generation B. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4 Early Reliability Analysis . . . .3 The recovery interface I/O-IMC . . . . . . 142 B. . . . . . . . . . . . 118 FLORA: A Framework for Local Recovery . .5 6. . . . . . . . .2. . . . .1 6. .4 The recovery manager I/O-IMC . .2. . . . . . . . . . . . . . . . . . 120 Evaluation . . . . . . 143 B. . . . . 135 Optimization of Software Decomposition for Recovery . . . . . . . . 135 7. 135 Modeling Software Architecture for Recovery .3 6. . . . . 131 133 7 Conclusion 7. . .4 6. .2 I/O-IMC Model Specification with MIOA . . . . .1. . . . . . . . . . . . . . . . . . . . . . 146 . . . . . . . . . . . .1 The module I/O-IMC . .1. . . .6 Realization of Software Architecture Recovery Design 6. . . . . . . . . . . . . . . . . . . . . . . . .2. . . . . . . . . . . . . . . . . . . . . 136 Realization of Local Recovery . .1 7. . . . . . . . . . . . . . . . . .1. 126 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2 The failure interface I/O-IMC . . . . . . . . . .2. . . . . . . . . . . . . . . . . . . . . . . . . 128 Related Work . . . . . . . . . . . . . . . . . . . . . .1 Example I/O-IMC Models . . . . . . . . . . . . . 136 7. . . . . . . . . . .1 7. 129 Conclusions .2 6. . . . . . . . .1. . . . . . . . . . . . . . .3 Future Work . 141 B. . . . . . . .6 6. . . . . . . . 145 B. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 7.2 Problems . . . . . . . . . . . . . . . . 118 Application of FLORA . . .7 117 Requirements for Local Recovery . . . . 134 Solutions . . . . . .2 7. . . . . 142 B. . . . . . . .

. . .4 Composition and Analysis Script Generation . . . . . . . . .B.3 I/O-IMC Model Generation . . . . . . 150 Bibliography Samenvatting Index 151 164 166 . . . . 149 B. . . . . . . . . . . . . . . . .

As a consequence.Chapter 1 Introduction A system is said to be reliable [3] if it can continue to provide the correct service. 1 . Potential failures can be prevented by designing the system to be recoverable from errors regardless of the faults that cause them. the system can remain available to its users. The system state that leads to a failure is defined as an error and the cause of an error is called a fault [3]. By this way. the system needs to be able to tolerate faults to increase its reliability. It may therefore not be economically and/or technically feasible to implement a fault-free system. It becomes harder to prevent or remove all possible faults in a system as the size and complexity of software increases. which implements the required system function. A failure occurs when the delivered service deviates from the correct service. which aims at enabling a system to continue operation in case of an error. possibly with reduced functionality and performance rather than failing completely. This is the motivation for faulttolerant design. we introduce methods and tools for the application of fault tolerance techniques to increase the reliability and availability of software systems. In this thesis. Moreover. the behavior of current systems are affected by an increasing number of external factors since they are generally integrated in networked environments. interacting with many systems and users.

By this way. the quality of the system can be assessed before realizing the fault-tolerant design. In addition. Software architecture represents the grosslevel structure of the system that directly influences the subsequent analysis. design and implementation. it is important to evaluate the impact of fault tolerance techniques on the software architecture. For a long period. A number of important trends can be observed in the development of embedded systems like DTVs. the cost and the perception of the user turn out to be the primary concerns. Hence.2 Chapter 1. embedded systems are more and more integrated in networked environments that affect these systems in ways that might not have been foreseen during their construction. the implementation of functionality is shifting from hardware to software. in particular. Altogether. whereas failures that can be directly 1 TRADER stands for Television Related Architecture and Design to Enhance Reliability. instead. Third. . First. we have focused on the consumer electronics domain. Finally. DTV systems. For such systems. these trends increase the size and complexity of software in embedded systems and as such make software faults a primary threat for reliability. Introduction 1. products are not solely developed by just one manufacturer only but it is host to multiple parties. Second. due to the high industrial competition and the advances in hardware and software technology. Different type of fault tolerance techniques are employed in different application domains depending on their requirements. which are probably less subjected to catastrophic failures. These systems are very cost-sensitive and failures that are not directly perceived by the user can be accepted to some extent. However. the additional cost of the fault-tolerant design due to the required hardware/software resources is a minor issue. there is a continuous demand for products with more functionality. In the TRADER project [120]. It must be ensured that the design of the software architecture supports the application of necessary fault tolerance techniques. For safety-critical systems. safety is the primary concern and any failure that can cause harm to people and environment must be prevented. in face of the current trends it has now been recognized that reliability analysis should focus more on software components. reliability and fault-tolerant aspects of embedded systems have been basically addressed at hardware level or source code. incorporation of some fault tolerance techniques should be considered at a higher abstraction level than source code. In that context.1 Thesis Scope The work presented in this thesis has been carried out as a part of the TRADER1 [120] project. The objective of the project is to develop methods and tools for ensuring reliability of digital television (DTV) sets. This is essential to identify potential risks and avoid costly redesigns and reimplementations. such as the ones used in nuclear power plants and airplanes.

Introduction 3 observed by the user require a special attention. understanding. it is not feasible to design a system that can tolerate all the faults of each of its elements. Accordingly. Existing reliability analysis techniques do not generally prioritize potential failures from the end-user perspective and they do not identify sensitive points of a system accordingly. These faults are mostly activated by timing issues and peak conditions in workload that could not have been anticipated before. Many such fault tolerance techniques are available but developing a fault-tolerant system is not always trivial. Each view supports the modeling. the fault tolerance techniques to be applied must be evaluated with respect to their additional costs and their effectiveness based on the user perception. the erroneous system state that is caused by such faults has also been assumed to be permanent. Thus. In this context. The set of fault tolerance techniques should be then selected based on the type of faults activated by the identified sensitive elements. After we analyze the system and select the set of fault tolerance techniques accordingly. So. software systems are also exposed to so-called transient faults. we should adapt the software architecture description to specify the faulttolerant design. whose failures might cause the most critical system failures. As a result. Errors that are caused by such faults are likely to be resolved when the software is re-executed after a clean-up and initialization [58]. we should identify sensitive elements of the system. This is because current software systems are too complex to represent all the concerns in one model.Chapter 1. 1.2 Motivation Traditional software fault tolerance techniques are mostly based on design diversity and replication because software failures are generally caused by design and coding faults. which are permanent [58]. communication and analysis of the software architecture for different concerns. An analysis of the current practice for representing architectural views reveals that they focus mainly on functional concerns and are not well-suited for communicating and analyzing design decisions that are particularly related to . First of all. On the other hand. we need to analyze potential failures and prioritize them based on user perception. we need to know which fault tolerance techniques to select and where in the system to apply the selected set of techniques. it is possible to design a system that can recover from a significant fraction of errors [16] without replication and design diversity and as such without requiring substantial hardware/software resources [58]. The software architecture of a system is usually described using more than one architectural view. Due to the cost-sensitivity of consumer electronics products like DTVs.

this thesis provides software architecture modeling. 1. However. construction and analysis of quality models. but will result in a lower availability since the failure of one module will affect the others as well. for selecting a decomposition alternative we have to cope with a trade-off between availability and performance. Accordingly. Usually there are many different alternative ways to decompose the system for local recovery. . and as such attaining high system availability [3]. this will also introduce an additional performance overhead since more modules will be isolated from each other. Fault tolerance techniques may influence the decomposition of software architecture. The overall goal is to employ fault tolerance techniques that can mainly tolerate transient faults and as such to improve the reliability and availability of cost-sensitive systems from the user point of view. Developers need to be supported for the implementation of the selected recovery design. New architectural styles are required to be able to document the software architecture from the fault tolerance point of view. Introduction the fault-tolerant aspects of a system. As a result. As a result. is not trivial and requires a substantial development and maintenance effort [26]. There are no adequate integrated set of analysis techniques to directly support this trade-off analysis. recoverable units) that can be recovered in isolation.e. which aims at making the system ready for correct service as much as possible. while preserving the existing decomposition.4 Chapter 1. we summarize the approaches that we have taken for supporting the design and implementation of fault-tolerant software systems. On the other hand. For achieving local recovery the architecture needs to be decomposed into separate units (i. We need the utilization and integration of several analysis techniques to optimize the decomposition of software architecture for recovery. keeping the modules together in one recoverable unit will increase the performance. the realization of local recovery.3 The Approach In the following subsections. The optimal decomposition for recovery is usually not aligned with the existing decomposition of the system. which requires optimization techniques. Local recovery is such a fault tolerance technique. analysis and realization techniques to improve the reliability and availability of software systems by introducing fault tolerance techniques. Increasing the number of recoverable units can provide higher availability. and analysis of the existing code base to automatically derive dependencies between modules of the system.

Similarly. it is of importance to analyze the hazards that can lead to failures and to analyze their impact on the reliability of the system before we select and implement fault tolerance techniques. before committing enormous organizational resources [31]. which represent the units of isolation. we introduce the recovery style for modeling the structure of the system related to the recovery concern. they should be incorporated into the existing software architecture. For this purpose. 1. Our second approach aims at modeling the resulting decomposition explicitly by providing a practical and easy-to-use method to document the software architecture from a recovery point of view. it is important to predict the quality of the system and identify potential risks. .3. FTS is used for providing a severity analysis for the overall software architecture and the individual architectural elements. Since implementing the software architecture is a costly process. The analysis results can be used for identifying so-called architectural tactics [5] to improve the reliability.3.2 Architectural style for recovery Once we have selected the appropriate fault tolerance techniques and related architectural tactics. As a further specialization of the recovery style. The style defines basic relations for coordination and application of recovery actions. Hereby. we introduce a software architecture reliability analysis approach ( SARAH) that benefits from mature reliability engineering techniques and scenariobased software architecture analysis to provide an early software reliability analysis. Introduction 5 1. For this purpose. SARAH defines the notion of failure scenario model that is based on the Failure Modes and Effects Analysis method (FMEA) in the reliability engineering domain. Despite conventional reliability analysis techniques which prioritize failures based on criteria such as safety concerns. It is used for communicating and analyzing architectural design decisions and supporting detailed design with respect to recovery.1 Software architecture reliability analysis using failure scenarios Our first approach aims at analyzing the potential failures and the sensitivity points at the software architecture design phase before the fault-tolerant design is implemented. error containment and recovery control. The recovery style considers recoverable units as first class architectural elements.Chapter 1. in SARAH failure scenarios are prioritized based on severity from the end-user perspective. architectural tactics form building blocks of design patterns for fault tolerance. Introduction of fault tolerance mechanisms usually requires dedicated architectural elements and relations that impact the software architecture decomposition. The failure scenario model is applied to represent so-called failure scenarios that define a Fault Tree Set (FTS).

the software architecture should be partitioned accordingly.3. new supplementary architectural elements and relations should be implemented to enable local recovery. We have then . we have introduced local recovery to the open-source media player called MPlayer for several decomposition alternatives.4 Framework for the realization of software architecture recovery design After the optimal decomposition for recovery is selected. the designer can systematically evaluate and compare decomposition alternatives. Introduction the local recovery style is provided. we have developed an integrated set of tools that employ i ) dynamic program analysis to estimate the performance overhead introduced by different decomposition alternatives. which is used for documenting a local recovery design including the decomposition of software architecture into recoverable units and the way that these units are controlled. The framework provides reusable abstractions for defining recoverable units and the necessary coordination and communication protocols for recovery. To support the approach. Using our framework. CTMCs) to estimate the availability achieved by different decomposition alternatives. FLORA that supports the decomposition and implementation of software architecture for local recovery. To reduce the resulting development and maintenance efforts we introduce a framework.3. and select an optimal decomposition. • modeling the design space of the possible decomposition alternatives • reducing the design space with respect to domain and stakeholder constraints • making the desired trade-off between availability and performance metrics With this approach. first we need to select a decomposition among many alternatives.3 Quantitative analysis and optimization of software architecture decomposition for recovery To introduce local recovery to the system. In addition.e. ii ) state-based analytical models (i.6 Chapter 1. and iii ) optimization techniques for automatic evaluation of decomposition alternatives with respect to performance and availability metrics. We propose a systematic approach dedicated to optimizing the decomposition of software architecture for local recovery. 1. 1. The approach enables the following.

The evaluations. Chapter 6 presents the framework FLORA that supports the decomposition and implementation of software architecture for local recovery.4 Thesis Overview The thesis is organized as follows. [117]. These are all integrated to support the approach. Chapter 4 introduces a new architectural style. Chapter 7 provides our conclusions. called Recovery Style for modeling the structure of the software architecture that is related to the fault tolerance properties of a system. fault tolerance and software architectures. This chapter is a revised version of the work described in [111]. 1. The output of SARAH can be utilized as an input by the techniques that are introduced in Chapter 4 and Chapter 5. SARAH is a scenario-based analysis method. This chapter is a revised and extended version of the work described in [112]. It also supports the understanding of Chapter 5. which aims at providing an early evaluation and feedback at the architecture design phase. discussions and related work for the particular contributions are provided in the corresponding chapters. It introduces the basic concepts of reliability. . In this chapter.Chapter 1. Chapter 2 provides background information and a set of definitions that is used throughout this thesis. Introduction 7 performed measurements on these implementations to validate the results of our analysis approaches. The framework provides reusable abstractions for defining recoverable units and the necessary coordination and communication protocols for recovery. The style is used for communicating and analyzing architectural design decisions and supporting detailed design with respect to recovery. It utilizes mature reliability engineering techniques to prioritize failure scenarios from the user perspective and identifying sensitive elements of the architecture accordingly. analytical models and optimization techniques. Chapter 5 proposes a systematic approach dedicated to optimizing the decomposition of software architecture for local recovery. Chapter 3 presents the software architecture reliability analysis method (SARAH). This style is used in Chapter 6 to represent the designs to be realized. we explain several analysis techniques and tools that employ dynamic analysis. and [118]. This chapter is a revised version of the work described in [110].

8 Chapter 1. the rounded rectangles represent the chapters of the thesis.1. ! " ! " Figure 1. The solid arrows represent the recommended reading order of these chapters. Chapter 4 should be read before Chapter 6. Introduction An overview of the main chapters is depicted in Figure 1. the reader can immediately start reading Chapter 3. After reading Chapter 2. Hereby. All the chapters provide complementary work and the works that are presented through chapters 4 to 6 are directly related.1: The main chapters of the thesis and the recommended reading order. 4 or 5. All the other chapters are self-contained. .

This is the fault. This is the failure. As a result. 2.1 depicts the fundamental chain of these concepts that leads to a failure. An error is defined as the system state that is liable to lead to a failure and the cause of an error is called a fault [3].1: The fundamental chain of dependability threats leading to a failure 9 . A system is considered to be dependable if it can avoid failures (service failures) that are more frequent or more severe than is acceptable [3]. As an example. Figure 2. It is an integrative concept that encompasses several quality attributes including reliability. the operating system kills the corresponding process and the user observes that the software crashes. A failure occurs when the delivered service of a system deviates from the required system function [3].Chapter 2 Background and Definitions In our work. In this chapter.1 Dependability and Fault Tolerance Dependability is the ability of a system to deliver service that can justifiably be trusted [3]. assume that a software developer allocates an insufficient amount of memory for an input buffer. Figure 2. integrity and maintainability. we utilize concepts and techniques from both the areas of dependability and software architectures. the size of the incoming data overflows this buffer. availability. we provide background information on these two areas and we introduce a set of definitions that will be used throughout the thesis. This is the error. At some point during the execution of the software. safety.

where a system is in a functioning condition. Reliability is the ability of a system to perform its required functions under stated conditions for a specified period of time. reliability). Ideally. for instance. For this reason. Even if there are no faults activated. a fault-tolerant system must become operational again as soon as possible to increase its availability. such as speed and accuracy [60]. In this thesis. whereas safety.1 Dependability and Related Quality Attributes Dependability encompasses the following quality attributes [3]: • reliability: continuity of correct service. a fault-tolerant system can recover from errors before any failure is observed by the user (i. The overhead can be caused. different emphasis might be put on different attributes.e. That is the ability of a system to function without a failure.10 Chapter 2. fault tolerance techniques introduce a performance overhead during the operational time of the system. integrity and maintainability are out of the scope of our work. . by monitoring of the system for error detection.1. Reliability and availability are very important quality attributes in the context of fault-tolerant systems and they are closely related. collecting and logging system traces for diagnosis. in addition to reliability and availability. 2. where an error propagates to other errors and finally leads to a failure [3]. saving data for recovery and wrapping system elements for isolation. there are multiple errors involved in the chain. Usually. Performance is defined as the degree to which a system accomplishes its designated functions within given constraints. However. we consider performance as a relevant quality attribute in this work although it is not a dependability attribute. • integrity: absence of improper system alterations. this is practically not always possible and the system can be unavailable during its recovery. • availability: readiness for correct service. After a fault is activated. we have considered reliability and availability. • maintainability: ability to undergo modifications and repairs. Background and Definitions Figure 2. Availability is the proportion of time. • safety: absence of catastrophic consequences on the user(s) and the environment. Depending on the application domain.1 shows the simplest possible chain of dependability threats.

.2: Basic features of dependability means and fault tolerance A feature diagram is a tree. Figure 2. availability and performance quality attributes.2 depicts the dependability means and the features of fault tolerance as a simple feature diagram.1 must be broken. ii ) removing existing faults. some fault-tolerant systems can also be partially available (also known as performance degradation). we accept that faults may occur but we deal with their consequences before they lead to a failure. This is possible through i ) preventing occurrence of faults. In addition. Error detection is the first necessary step for fault tolerance. The quantification of performability rewards the system for the performance that is delivered not only during its normal operational time but also during partial failures and their recovery [52]. a system is assumed to be either up and running.Chapter 2. or it is not. However. Based on [3]. which combines reliability. the chain of dependability threats as shown in Figure 2. This fact is taken into account by performability [52].1. 2. Background and Definitions 11 In traditional reliability and availability analysis. if possible. detected errors must be recovered. In the last approach. Essentially. or iii ) tolerating faults.2 Dependability Means To prevent a failure. it measures how well the system performs in the presence of failures over a specified period of time. where the root element represents the domain or concept. Figure 2.

2 also shows the two mandatory features of fault tolerance: error detection and recovery. exactly one of them is required if their parent feature is selected). cost-effectiveness. The solid circles indicate mandatory features (i. high performance. The domain of recovery is quite broad due to different type of faults (e. Fault forecasting aims at evaluating the system behavior with respect to fault occurrence or activation [3]. Error recovery is generally defined as the action. the feature is required if its parent feature is selected).g. fault forecasting is also included in Figure 2.2 as a dependability means. Figure 2. The fault handling feature prevents faults from being activated again. safety-critical systems. fault tolerance techniques provide the necessary mechanisms to detect and recover from errors. 58] . Diagnosis can also enable a more effective error handling.3 Fault Tolerance and Error Handling When faults manifest themselves during system operations. In addition to fault prevention. which eliminates errors from the system state [3]. before they propagate and cause a system failure.3) are alternatives to each other (i. transient.e.g. 36. Recovery has one mandatory feature.g. 2. if possible. Mandatory features that are connected through an arc decorated edge (See Figure 2. 29. error handling.e. We have derived the features of the recovery domain through a domain analysis based on the corresponding literature [3. This requires further features such as diagnosis. high availability) imposed by different type of systems (e. consumer electronics). with which the system is set to a correct state from an erroneous state [3]. Background and Definitions The other nodes represent its features. fault removal and fault tolerance. permanent) to be tolerated and different requirements (e.3 shows a partial view of error handling features for fault tolerance. the recovery procedure can take actions concerning the associated components without impacting the other parts of the system and related system functionality. The empty circles indicate optional features. which reveals and localizes the cause(s) of error(s) [3]. If the cause of the error is localized. Figure 2.12 Chapter 2.1.

. log-based recovery). Background and Definitions 13 Figure 2. In either case.e. Backward recovery (i. This requires replication of system functionality and data. compensation.e. or best output. After the execution of the first version. N-version programming is a compensation technique. check-pointing) or for saving messages and events to replay them after recovery (i. backward recovery and forward recovery [3].e. Restarting a system is also an example for backward recovery. If the output is not acceptable.3: A partial view of error handling features for fault tolerance As shown in Figure 2. Recovery blocks approach uses multiple versions of a software for backward recovery.). several versions are executed and tested sequentially until the output is acceptable [81]. error handling can be organized into three categories. Backward recovery can employ different features for saving data to be restored after recovery (i. The system fails if no acceptable output is obtained after all the versions are tried. Similarly. rollback) puts the system in a previous state. the output is tested. All the outputs of these versions are compared to determine the correct. a stable storage is required to store recovery-related information (data. Compensation means that the system continues to operate without any loss of function or data in case of an error. which was known to be error free. where N independently developed functionally equivalent versions of a software are executed in parallel. Forward recovery (i. the state of the system is rolled back to the state before the first version is executed. in which the system is put back to its initial state.3. rollforward) puts the system in a new state to recover from an error.Chapter 2.e. if one exists [81]. messages etc.

Background and Definitions Exception handling is an example forward recovery technique. It introduces a set of concepts and relations among them as depicted in Figure 2. system engineers and anybody who has an interest in the construction of the software system. Software architecture represents a common abstraction of a system [5] and as such it forms a basis for mutual understanding and communication among architects. the externally visible properties of those elements.1 Software Architecture Descriptions An architecture description is a collection of documents to describe a system’s architecture [78]. Software architecture also promotes large-scale reuse by transferring architectural models across systems that exhibit common quality attributes and functional requirements [5].2. In the following subsections. In the case of local recovery. The granularity of the error handling in recovery can differ as well. Different error handling features can be utilized based on the fault assumptions and system characteristics. 2. As one of the earliest artifact of the software development life cycle.4 with a UML (The Unified Modeling .g. 2. which impacts the system’s detailed design. deployment and maintenance. In the case of global recovery. The IEEE 1471 standard [78] is a recommended practice for architectural description of software-intensive systems. a system with local recovery can provide a higher system availability to its users in case of component failures. the recovery mechanism can take actions on the system as a whole (e. Hence. software architecture embodies early design decisions. developers. where the execution is transfered to the corresponding handler when an exception occurs. we introduce basic techniques and concepts that are used for i ) describing architectures ii ) analyzing quality properties of an architecture and iii ) achieving or supporting qualities in the architecture design. and the relationships among them [5]. erroneous parts can be isolated and recovered while the rest of the system is available.14 Chapter 2. which comprise elements. Thus. it must be carefully documented and analyzed.2 Software Architecture Design and Analysis A software architecture for a program or computing system consists of the structure or structures of that system. implementation. restart the whole system). Graceful degradation [107] is a forward recovery approach that puts the system in a state with reduced functionality and performance.

Stakeholders may have different. end users. Aggregations are identified with a diamond shape at the end of an edge and they represent part-whole relationships. for instance. developers. concerns that they wish the system to provide or optimize. Figure 2. reliability and evolvability. Figure 2. certain run-time behavior. Background and Definitions 15 Language) [100] diagram. For example. for instance. These might include. its operation or any other aspects that are critical or otherwise important to one or more stakeholders. A viewpoint is a specification of . For example.4: Basic concepts of architecture description (IEEE 1471 [78]) The people or organizations that are interested in the construction of the software system are called stakeholders.Chapter 2. Hereby. Associations are labeled with a role and cardinality. the key concepts are marked with bold rectangles and two types of relations are defined: association and aggregation. which pertain to the systems development. architects. Figure 2. A concern is an interest.4 shows that a concern is important to (the role) 1 or more (the cardinality) stakeholders and a stakeholder has 1 or more concerns. These might include. performance.4 shows that a view is a part of an architectural description. system engineers and maintainers. A view is a representation of the whole system from the perspective of a related set of concerns. possibly conflicting.

Similarly. and possibly graphical syntax. UML [103] is an example standard notation for modeling object-oriented designs. model checkers. This conceptual framework provides a set of definitions for key terms and outlines the content requirements for describing a software architecture. understandable. compilers and code synthesis tools [82]. Software architecture analysis helps to predict the risks and the quality of a system before it is built.16 Chapter 2. Software architecture design processes and methods are out of cope of this thesis. parsers. Another issue that is not standardized by IEEE 1471 [78] is the notation or format that is used for describing architectures. • A view conforms to a viewpoint and it consists of a set of models that represent one aspect of an entire system. it is important that the architecture design supports the required qualities of a software system. but not necessarily formally defined semantics. which can be utilized for describing architectures as well. To summarize the key concepts as described in the IEEE 1471 framework: • A system has an architecture. Some ADLs are designed to have a simple. Attribute-Driven Design [5] and Synthesis-Based Software Architecture Design [115]. it does not standardize or put restrictions to how an architecture is designed and how its description is produced. usually it is also necessary to evaluate the architecture of a legacy system if . Some other ADLs encompass formal syntax and semantics. • An architecture is described by one or more architectural descriptions.2. Background and Definitions the conventions for constructing and using a view. There are several software design processes and architecture design methods proposed in the literature. There have been both general-purpose and domain-specific ADLs proposed. the Rational Unified Process [64].2 Software Architecture Analysis Software architecture forms one of the key artifacts in software development life cycle since it embodies early design decisions. several architecture description languages (ADLs) have been introduced as modeling notations to support architecture-based development. such as. 2. Accordingly. On the other hand. However. supported by powerful analysis tools. thereby reducing unnecessary maintenance costs. • A viewpoint covers one or more concerns of stakeholders. • An architectural description selects one or more viewpoints.

we present an analysis approach based on measuring techniques. a model with enough semantics) for measurement. In Chapter 3. there are two complementary software architecture analysis techniques: i ) questioning techniques and ii ) measuring techniques [5]. The type of analysis depends on the underlying semantic model of the ADL. . Depending on the application domain. 2. which enables to suppress the unnecessary details and focus only on the relevant aspects for analysis. Such architectural decisions are characterized as architectural tactics [4]. Measuring techniques use architectural metrics. Architectural tactics are viewed as basic design decisions and building blocks of patterns and styles [5]. Patterns and Styles A software architect makes a wide range of design decisions.g. On the other hand. questioning techniques can be applied on hypothetical architectures much earlier in the life cycle [5]. On the other hand.g. Questioning techniques use scenarios. to use redundancy for providing fault tolerance and in turn to increase system dependability). many of these design decisions are made to provide required functionalities. However. Software architecture constitutes an abstraction of the system.2. In Chapter 5. there are also several design decisions made for supporting a desired quality attribute (e. Basically. As a drawback. In general. (possibly quantitative) results of questioning techniques are inherently subjective. measuring techniques provide more objective results compared to questioning techniques. which is an analysis method based on questioning techniques. we explore both analysis approaches. while designing the software architecture of a system.3 Architectural Tactics. a prototype implementation. In this thesis. we present SARAH. questionnaires and check-lists to review how the architecture responds to various situations [19]. they require the presence of a working artifact (e. such as queuing networks [22]. where usually a quality model is applied.Chapter 2. simulations and static analysis of formal architectural models [82] to provide quantitative measures of qualities such as performance and availability. porting or integration with other systems [5]. Most of the software architecture analysis methods that are based on questioning techniques use scenarios for evaluating architectures [31]. These methods take as input the architecture design and estimate the impact of predefined scenarios on it to identify the potential risks and the sensitivity points of the architecture. Questioning techniques sometimes employ measurements as well but these are mostly intuitive estimations relying on hypothetical models without formal and detailed semantics. Background and Definitions 17 it is subject to major modification.

) [12]. clients and servers. which constrain the design of the system to support certain qualities [109]. They capture recurring idioms. architectural patterns/styles provide a common design vocabulary (e. . Background and Definitions An architectural pattern is a description of element and relation types together with a set of constraints on how they may be used [5]. Many styles are also equipped with semantic models.18 Chapter 2. Similar to Object-Oriented design patterns [41]. analysis tools and methods that enable style-specific analysis and property checks. pipes and filters.g. etc. The term architectural style is also used for describing the same concept.

3. we present SARAH and illustrate it for analyzing reliability of the software architecture of the next release of a Digital TV.Chapter 3 Scenario-Based Software Architecture Reliability Analysis To select and apply appropriate fault tolerance techniques. FMEA and FTA. In section 3. We conclude the chapter after discussing lessons learned and related work in sections 3. In this chapter. 19 .5. which prioritizes failure scenarios based on user perception and provides an early software reliability analysis of the architecture design. we need to analyze potential system failures and identify architectural elements that cause system failures.4 and 3. we propose the Software Architecture Reliability Analysis Approach (SARAH). The chapter is organized as follows. It is a scenariobased software architecture analysis method that benefits from mature reliability engineering techniques. we introduce background information on scenario-based software architecture analysis. respectively. FMEA and FTA. In the following two sections.

• Individually evaluate indirect scenarios: Scenarios that can be directly supported by the architecture are called direct scenarios. Scenarios that require the redesign of the architecture are called indirect scenarios. The basic activities of SAAM are illustrated with a UML [100] activity diagram in Figure 3. • Define scenarios: Scenarios are developed for stakeholders to illustrate the kinds of activities the system must support (usage scenarios) and the anticipated changes that will be made to the system over time (change scenarios).1. • Describe architectures: The candidate architecture designs are described. practical and a mature method.1 are explained. The required changes for the architecture in case of indirect scenarios are attributed to the fact that the architecture has not been appropriately designed to meet the given requirements. Most of the other scenario-based analysis methods are proposed as extensions to SAAM or in some way they adopt the concepts used in this method [31]. • Classify/Prioritize scenarios: Scenarios are prioritized according to their importance as defined by the stakeholders. which include the systems’ computation/data components and their relationships.g.20 Chapter 3. change scenarios [7]) depending on the quality attributes that they focus on. .e. Software Architecture Analysis Method (SAAM) can be considered as the first scenario-based architecture analysis method. flows) represent transitions between activities. Scenario-Based Software Architecture Reliability Analysis 3. The filled circle is the starting point and the filled circle with a border is the ending point. usage scenarios [19]. which has been validated in various cases studies [19]. Some methods define scenarios just as brief descriptions. It is simple. Different analysis methods use different type of scenarios (e. SAAM activities as depicted in Figure 3. The beginning of parallel activities are denoted with a black bar with one flow going into it and several leaving it.1 Scenario-Based Software Architecture Analysis Scenario-based software architecture analysis methods take as input a model of the architecture design and measure the impact of predefined scenarios on it to identify the potential risks and the sensitivity points of the architecture [31]. while some other methods define them in a more structured way with annotations [19]. For each indirect scenario the required changes to the architecture are listed and the cost for performing these changes is estimated. The rounded rectangles represent activities and arrows (i. In the following.

Semantically close scenarios should interact at the same component. SAAMER [77] on reusability and evolution. it is implicitly assumed that scenarios correspond to the particular quality attributes that need to be analyzed. Scenario interaction measures the extent to which the architecture supports an appropriate separation of concerns. For example.1: SAAM activities [19] • Assess scenario interaction: Determining scenario interaction is a process of identifying scenarios that affect a common set of components. ALMA [7] on modifiability. • Create overall evaluation: Each scenario and the scenario interactions are weighted in terms of their relative importance and this weighting determines an overall ranking. Some methods such as SAEM [33] and ATAM [19] have considered the need for a specific quality model for deriving the . Later. SAAM was originally developed to analyze the modifiability of an architecture [19]. and ASAAM [116] on identifying aspects for increasing maintainability. numerous scenario-based architecture analysis methods have been developed each focusing on a particular quality attribute or attributes [31]. SAAMCS [75] has focused on analyzing complexity of an architecture.Chapter 3. Scenario-Based Software Architecture Reliability Analysis 21 Figure 3. Semantically distinct scenarios that interact point out a wrong decomposition. Hereby. ESAAMI [84] on reuse from existing component libraries. ALPSM [8] on maintainability.

The analysis results are organized by means of a worksheet. its local and global effects (concerning other parts of the product and the environment) and the associated component. System: Date: Compiled by: Approved by: ID 1 2 Car Engine 10-10-2000 J. failure cause.22 Chapter 3. Some components failure . its causes. Green Item ID CE5 … Failure Severity Failure Causes Failure Effects Mode Class fails to Motor shorted Motor overheats V operate and burns … … … … Figure 3. ATAM has also addressed the interactions among multiple quality attributes and trade-off issues emerging from these interactions. Effects and Criticality Analysis (FMECA) extends FMEA with severity and probability assessments of failure occurrence. which uses failure scenarios for analysis.2. In this chapter. The basic operations of the method are i ) to question the ways that each component fails (failure modes) and ii ) to consider the reaction of the system to these failures (effects). the analysis is subjective [97]. A failure effect is the (undesirable) consequence of a failure mode. Failure Modes. its evaluation and improvement. failure mode. 3. Smith D. We define a failure scenario model that is based on the established Failure Modes and Effects Analysis method (FMEA) in the reliability engineering domain as explained in the next subsection. related component. A failure mode is defined as the manner in which the element fails. A failure cause is the possible cause of a failure mode. FMEA and FMECA can be employed for risk assessment and for discovering potential single-point failures. failure id.2 FMEA and FTA Failure Modes and Effects Analysis method (FMEA) [99] is a well-known and mature reliability analysis method for eliciting and evaluating potential risks of a system systematically. A simplified FMECA worksheet template is presented in Figure 3. Scenario-Based Software Architecture Reliability Analysis corresponding scenarios. which comprises information about each failure mode. At the downside. 6 attributes of a failure scenario are identified. we propose a reliability analysis method.2: An example FMECA worksheet based on MIL-STD-1629A [30] In FMECA. Severity is associated with the cost of repair. Systematic analysis increases the insight in the system and the analysis results can be used for guiding the design. failure effect and severity.

3: An example fault tree The top node (i. which defines causal relationships between faults. Scenario-Based Software Architecture Reliability Analysis 23 modes can be overlooked and some information (e. severity) regarding the failure modes can be incorrectly estimated at early design phases.g. . FMEA is usually applied together with Fault Tree Analysis (FTA) [34]. Faults. Once the fault tree is constructed. the tree can be processed in a top-down manner for diagnosis to determine the potential faults that may cause the failure.Chapter 3. OR gates) that infer propagation and contribution of faults to the failure. failure probability. FTA is based on a graphical model. combined effects and coordination failures can also be missed. root) of the fault tree represents the system failure and the leaf nodes represent faults. The nodes of the fault tree are interconnected with logical connectors (e. An example fault tree can be seen in Figure 3. fault tree. Since these techniques focus on individual components at a time. Figure 3. the analysis is effort and time consuming.3. are defined as undesirable system states or events that can lead to a system failure.g. it can be processed in a bottom-up manner to calculate the probability that a failure would take place. This calculation is done based on the probabilities of fault occurrences and interconnections between the faults and the failure [34].e. AND. In addition. Additionally. which are assumed to be provided.

. In SARAH. FTS shows the causal and logical connections among the failure scenarios.24 Chapter 3. This example will be used throughout the remainder of the section. we present this industrial case. 3.3. utilities and modules that control the streaming process. the modules for decoding and processing audio/video signals are not explained here.1 Case Study: Digital TV A conceptual architecture of Digital TV (DTV) is depicted in Figure 3. the prioritization and analysis of failure scenarios are based on user perception [28]. in which a Digital TV architecture is introduced. As such while explaining the approach we also discuss our experience and obstacles in applying the approach. The bottom layer. For brevity. The approach is illustrated using an industrial case for analyzing user-perceived reliability of future releases of Digital TVs. namely the streaming layer. Conventional reliability analysis techniques prioritize failures according to how serious their consequences are with respect to safety. In the following. involves modules taking part in streaming of audio/video information. which will be referred throughout the section. Failure scenarios define potential component failures in the software system and they are used for deriving a fault tree set (FTS). we briefly explain some of the important modules that are part of the architecture. The upper layer consists of applications. SARAH defines the notion of failure scenario model that is inspired from FMEA. In the following subsection. The reliability analysis forms the key input to identify architectural tactics ([4]) for adjusting the architecture and improving its dependability. SARAH results in a failure analysis report that defines the sensitive elements of the architecture and provides information on the type of failures that might frequently happen. To a large extent SARAH integrates the best practices of the conventional and stable reliability analysis techniques with the scenario-based software architecture analysis approaches. The design mainly comprises two layers. SARAH provides another distinguishing property by focusing on user perceived reliability. which forms the last phase in SARAH. Similar to a fault tree in FTA. where the activities of SARAH are explained and illustrated.3 SARAH We propose the Software Architecture Reliability Analysis (SARAH) approach that benefits from both reliability analysis and scenario-based software architecture analysis to provide an early reliability analysis of next product releases. The structure of FTS and related analysis techniques are also adapted accordingly. Scenario-Based Software Architecture Reliability Analysis 3.4. Besides this.

• Audio Controller (AC). Scenario-Based Software Architecture Reliability Analysis 25 # # ! ! " $ % & ' ( ) Figure 3. located at the top middle of the figure. It keeps track of application states. located at the bottom right of the figure.4: Conceptual Architecture of DTV • Application Manager (AMR). user modes and redirects commands/information to specific applications or controllers accordingly.Chapter 3. initiates and controls execution of both resident and downloaded applications in the system. controls .

through keypad or remote control) and sends corresponding commands to AMR. • Communication Manager (CMR). tunes to a specific program based on commands received from AMR. presents and provides navigation of electronic program guide regarding a channel. located at the middle of the figure.2 The Top-Level Process For understanding and predicting quality requirements of the architectural design [4]. located at the bottom right of the figure. • Content Browser (CB). located at the middle of the figure. presents and provides navigation of content residing in a connected external device. identify four important requirements: i ) provide a specification . located at the middle left of the figure. • Last State Manager (LSM). • Electronic Program Guide (EPG). located at the middle of the figure. located at the middle of the figure. controls video features like scaling of the video frames based on commands received from AMR. 3.e. • Conditional Access (CA). located at the bottom left of the figure. located at the middle right of the figure. located at the bottom middle of the figure. • Video Controller (VC). • Teletext (TXT). • Program Manager (PM). • Command Handler (CH). • Graphics Controller (GC). Bachman et al. employs protocols for providing communication with external devices. authorizes information that is presented to the user. bass and treble based on commands received from AMR. is responsible for generation of graphical images corresponding to user interface elements. • Program Installer (PI). located at the top left of the figure. frequency). located at the top left of the figure. searches and registers programs together with channel information (i. keeps track of last state of user preferences such as volume level and selected program. Scenario-Based Software Architecture Reliability Analysis audio features like volume level. handles acquisition. interprets externally received signals (i.e.3.26 Chapter 3. interpretation and presentation of teletext pages.

The steps of SARAH are presented as a UML activity diagram in Figure 3. in the analysis process. The approach consists of three basic processes: i ) Definition ii ) Analysis and iii ) Adjustment. ! Figure 3.5: Activity Diagram of the Software Architecture Analysis Method .Chapter 3. the failure domain model. The focus in SARAH is the specification of the reliability quality attribute. SARAH is in alignment with these key assumptions. the fault trees and the severity values for failures are defined. an architectural level analysis and an architectural element level analysis are performed. Based on this input. In the definition process the architecture. the failure scenarios. The failure analysis report is used in the adjustment process to identify the architectural tactics and adapt the software architecture. ii ) enumerate the architectural decisions to achieve the quality requirements. In the following subsections the main steps of the method will be explained in detail using the industrial case study.5. iii ) couple the architectural decisions to the quality attribute requirements. Scenario-Based Software Architecture Reliability Analysis 27 of the quality attribute requirements. the analysis of the architecture based on this specification and the identification of architectural tactics to adjust the architecture. and iv ) provide the means to compose the architectural decisions into a design. The results are presented in the failure analysis report.

failure mode and failure effect. First the relevant failure domain model is defined. failure scenarios are derived in two steps.3 Software Architecture and Failure Scenario Definition Describe the Architecture Similar to existing software architecture analysis methods SARAH starts with describing the software architecture.28 Chapter 3. error and failure) for a component of the system. . In SARAH. its features.1: Template for Defining Failure Scenarios FID AEID Fault Error Failure A numerical value to identify the failures (i. A failure scenario defines a chain of dependability threats (i. Failure ID) An acronym defining the architectural element for which the failure scenario applies (i.4. The architecture that we analyzed is depicted in Figure 3. Develop Failure Scenarios SARAH is a scenario-based architecture analysis method.e. Architectural Element ID) The cause of the failure defining both the description of the cause and its features Description of the state of the element that leads to the failure together with its features The description of the failure.e. error and failure are used instead of the concepts failure cause. For clarity in SARAH fault. fault. To specify the failure scenarios in a uniform and consistent manner a failure scenario template. then failure scenarios are derived from this failure domain. user/element(s) that are affected by the failure The template is inspired from FMEA [99].1 is adopted for specifying failure scenarios. SARAH defines the concept of failure scenario to analyze the architecture with respect to reliability. Currently. respectively. Table 3.e. the method itself does not presume a particular architectural view [18] to be provided but in our project we have basically applied it to the module view. as defined in Table 3. that is. Scenario-Based Software Architecture Reliability Analysis 3. The following subsections describe these steps in detail. scenarios are the basic means to analyze the architecture. The description includes the architectural elements and their relationships.3.

Figure 3. . To define the space of relevant failures SARAH defines relevant domain model for faults. In fact. Avizienis et al. several researchers have already focused on modeling and classifying failures.. which includes the features type and target. For that reason. errors and failures using a systematic domain analysis process [2]. Hence. the given domain is further scoped by focusing only on the faults. but as we will show in the next sections this does not impact the steps in the analysis method itself. error types or failure modes. The target of a failure defines what is/are affected by the failure.Chapter 3. dimension and persistence. In Figure 3. ii ) caused by other element(s) of the system that interact(s) with the element in consideration or iii ) caused by external entities with respect to the system. failure scenarios are defined per architectural element. provide a nice overview of this related work and provide a comprehensive classification of faults. The failure domain model of Figure 3. it is not trivial to define fault classes.6 has been derived after a thorough domain analysis and in cooperation with the domain experts in the project. The provided domain classification by Avizienis et al. in which their properties are represented by domain models that provide the scope for the project requirements.6(b). which comprise the type of error together with its detectability and reversibility properties.. there is a high risk that several potential and relevant failure scenarios are missed or that other irrelevant failure scenarios are included. Scenario-Based Software Architecture Reliability Analysis Define Relevant Failure Domain Model 29 The failure scenario template can be adopted to derive scenarios in an ad hoc manner using free brainstorming sessions. Figure 3. the relevant features of an error are shown. In Figure 3.6. errors and failures [3]. In SARAH. the target can be the user or other element(s) of the system. defines the derived domain model that is considered relevant for our project. is rather broad. and be transient or persistent. for example.6(c) defines the features for failures. the source of the fault can be either i ) internal to the element in consideration. a feature diagram is presented. These domain models provide a first scoping of the potential scenarios. Therefore. however. The key issue here is that failure scenarios are defined based on the FMEA model. where faults are identified according to their source. Faults could be caused by software or hardware. errors and failures that are considered relevant for the actual project. and one can assume that for a given reliability analysis project not all the potential failures in this overall domain are relevant. In principle. for example. However. In this case.6(a). for different project requirements one may come up with a slightly different domain model.

Scenario-Based Software Architecture Reliability Analysis (a) Feature Diagram of Fault (b) Feature Diagram of Error (c) Feature Diagram of Failure Figure 3.6: Failure Domain Model .30 Chapter 3.

Chapter 3. for example. If the failure type is “presentation quality” then the failure target can only be “user”.reversibility.7 the five elements FID.presentation quality requires Failure. In the fault domain model (feature diagram in Figure 3. Figure 3. Since a scenario is a composition of selection of features from the failure domain model. For example. . 12 × 8 × 24 = 2304 failure scenarios can be defined.type.6 usually mutex and requires predicates are used [66]. which defines the scope of the relevant scenarios.reversible Failure.6 in theory. AEID. a list of nine selected failure scenarios that have been derived for the reliability analysis of the DTV. for example. fault. the features Dimension and Persistence 2 values. error and failure are represented as columns headings.type.7 provides. for example. and 4 × 2 = 8 different failures are captured by the failure domain model. specified as follows: Error. we can state that for the given failure domain model in Figure 3. Similarly. namely Source. In Figure 3. Scenario-Based Software Architecture Reliability Analysis Define Failure Scenarios 31 The domain model defines a system-independent specification of the space of failures that could occur. However.user Once the failure domain model together with the necessary constraints have been specified we can start defining failure scenarios for the architectural elements. from the error domain model we can derive 6 × 2 × 2 = 24 different errors. This means that the fault model captures 3 × 2 × 2 = 12 different faults.too late mutex Error.target.type. The feature Source can have 3 different values.6(a)). in case the error type is “too late” then the error cannot be “reversible”. The given two example constraints are. For each architectural element we check the possible failure scenarios and define an additional description of the specific fault. we can define faults based on three features. To specify such kind of constraints for the given model in Figure 3. not all the combinations instantiated from the failure domain are possible in practice. Failure scenarios are represented in rows. Dimension and Persistence. The number and type of failure scenarios are implicitly defined by the failure domain model. error or failure.

type: wrong value/presentation target: AMR(F1) description: Can not provide data. VC persistence: permanent type: wrong value detectability: detectable reversibility: reversible Failure description: Switching to an undesired mode. type: timing target: CB(F3) F3 description: Can not present content of the connected device. connected device.7: Selected failure scenarios for the analysis of the DTV architecture . CMR source: external dimension: software type: too early/late persistence: permanent detectability: detectable reversibility: irreversible description: Software fault. Scenario-Based Software Architecture Reliability Analysis Fault description: Reception of irrelevant signals. type: wrong path detectability: undetectable reversibility: reversible description: Information can not be acquired from the connected device. not be scaled appropriately. not be presented due to lack source: AMR(F2) of information. type: presentation quality target: user F8 F9 description: Provide inaccurate information. description: Correct scaling source: internal ratio can not be calculated dimension: software from the video signal.32 FID Chapter 3. source: CH(F4) OR AMR CMR(F6) dimension: software persistence: permanent description: Can not acquire information. type: wrong value CMR dimension: software persistence: permanent detectability: undetectable reversibility: reversible description: Reception of description: Scaling out-of-spec signals. type: wrong value/presentation target: AMR(F1) description: Can not provide information. type: timing target: AMR(F2) F4 F5 F6 F7 description: Provide irrelevant information. CH dimension: software type: wrong value persistence: permanent detectability: undetectable reversibility: reversible description: Communication description: Protocol can not be sustained with the mismatch. CB dimension: software type: too early/late persistence: permanent detectability: detectable reversibility: irreversible description: Software fault. description: Signals are source: internal interpreted in a wrong way. type: behavior target: user F1 F2 description: Can not provide information. information can not be source: external interpreted from meta-data. description: Signals are source: internal interpreted in a wrong way. AMR source: CMR(F5) dimension: software persistence: permanent AEID Error description: Working mode is changed when it is not desired. type: too early/late detectability: detectable reversibility: irreversible description: Can not description: Information can acquire information. type: wrong value/presentation target: VP(F8) Figure 3. source: DDI(F7) AND type: wrong value VP VC(F9) detectability: undetectable dimension: software reversibility: reversible persistence: permanent description: Software fault. type: wrong value/presentation target: VP(F8) description: Provide distorted video image. DDI dimension: software type: wrong value persistence: permanent detectability: detectable reversibility: reversible description: Inaccurate description: Video image can scaling ratio information. type: behavior target: user description: Provide irrelevant information.

for instance. in failure scenario F2. the source of F1 is defined as CH(F4) OR CMR(F6) indicating that F1 occurs due to a failure in CH as defined in F4 or due to a failure in CMR as defined in F6.6(a). which is used to explain the domain specific details of the failure scenarios. The column AEID includes acronyms of names of architectural elements to which the identifiers refer. modules). The columns Fault. Error and Failure describe the specific faults. To make all these connections explicit. the architectural elements that will be analyzed will be the nodes. the fault source is defined as CMR(F5).e. where the architectural elements are the implementation units (i. Note that the separate features of the corresponding domain models are represented as keywords in the cells. Derive Fault Tree Set from Failure Scenarios A close analysis of the failure scenarios shows that they are connected to each other. when an architectural element is represented as a first-class entity in the model and it affects the failure behavior. For example. In principle. each column also includes the keyword description. Scenario-Based Software Architecture Reliability Analysis 33 The failure ids represented by FID do not have a specific ordering but are only used to identify the failure scenarios. The type of the architectural elements that are analyzed can vary depending on the architectural view utilized [18]. . errors and failures. For example. FTS is a graph G(V. In this paper. the failure scenario F1 is caused by the failure scenario F4 or F6. Typically. In case of a deployment view. Besides the different features. The source of the fault can be caused by a combination of failures. Fault includes the features source. This is expressed by logical connectives. which are together defined as a Fault Tree Set (FTS). Typically these descriptions are obtained from domain experts. the architectural elements will be components and connectors. For example. in SARAH fault trees are defined. G has the following properties. and likewise these are represented as keywords. we focused on module view of the architecture. E) consisting of the set of fault trees. a given set of failure scenarios leads to a set of fault trees. For component and connector view. then it can be used in SARAH for analysis. For example. The columns Fault and Failure include the fields source and target respectively. indicating that the fault in F2 occurs due to a failure in CMR as defined in F5. dimension and persistence as defined in Figure 3. Both of these refer to failure scenarios of other architectural elements.Chapter 3.

7 we can derive the FTS as depicted in Figure 3. v ∈ V . Finally.8. ∀g ∈ A. A = AAN D ∪ AOR such that. 5. system failures). Fu ⊆ F is the set of failure scenarios comprising failures that are perceived by the user (i. the FTS consists of three fault trees. Scenario-Based Software Architecture Reliability Analysis 1. F7 F9 F2 F4 F6 F5 Figure 3. For the example case.8: Fault trees derived from failure scenarios in Figure 3.7 Here.e. v) where u. 3. based on the failure scenarios in Figure 3. E is the set of directed edges (u. F is the set of failure scenarios each of which is associated with an architectural element. The middle column indicates that failure scenario F3 is caused by F2 which is on its turn caused by F5. outdegree(g) = 1∧ indegree(g) ≥ 1 6. Vertices residing in this set constitute root nodes of fault trees. On the left the fault tree shows that F1 is caused by F4 or F6. .34 Chapter 3. in the last column the fault tree shows that F8 is caused by both F7 and F9. F1 F3 F8 Fu: set of failure scenarios comprising failures that are perceived by the user. V = F ∪ A 2. A is the set of gates representing logical connectives. AAN D is the set of AND gates AOR is the set of OR gates 7. 4.

2.2. Scenario-Based Software Architecture Reliability Analysis Define Severity Values in Fault Tree Set 35 As we have indicated before we are focused on failure analysis in the context of consumer electronics domain. To define user perceived failure severities. Basic user-perceived functions fail.9. As such we need to analyze in particular those failures which have the largest impact on the user perception.Chapter 3. 5. an elaborate user model can be developed based on experiments with subjects from different age groups and education levels [28]. Annoying performance degradation is perceived. We consider the user model as an external input and any such model can be incorporated to the method. In this paper. the severity values for failures F1. Here we can see that F5 has been assigned the severity value 5 indicating that it has a very high impact on the user perception.2: Prioritization of severity categories Severity 1. User perceives serious loss of functions. These different user perceptions on the failures have a clear impact on the way how we process the fault trees. we first assign severity values to the root failures based on their impact on the user. Before computing the failure probabilities of the individual leaf nodes. 3. A failure is perceived but not really annoying. In our case the severity values are based on the prioritization of the failure scenarios as defined in Table 3. The severity degrees range from 1 to 5 and are provided by domain experts. a complete black screen will have a larger impact on the user than a temporary distortion in the image. F3 and F8 are depicted in Figure 3. For example. System locks up and does not respond. In fact. the impact of a failure can differ from user to user. we consider the average user type. Type very low low moderate high very high Annoyance Description User hardly perceives failure. Table 3. 4. . For example.

9 for example. If we take the AND gate at the right hand side of Figure 3. P (v|f ) = P (v ∩ f )/P (f ). Here. This makes sense because P (v|f ) denotes the probability that v will occur. s(F 7) = s(F 8) × P (F 8|F 7) = 4 × P (F 9) and s(F 9) = s(F 8) × P (F 8|F 9) = 4 × P (F 7).(f.9 for example. Equation 3.36 Chapter 3. According to the probability theory. So. .t.2) s(v) × P (v|f ) Equation 3. In this case. P (v ∩ f ) = P (f ). We multiply this value with the severity of v to calculate the severity of f . P (v|F 6) = 1. These values are calculated based on the following equations: ∀u ∈ Fu . s(u) = su ∀f ∈ F ∧ f ∈ Fu . provided that f occurs. s(f ) = / ∀vs. the situation is different. If we take the OR gate at the left hand side of Figure 3. P (v|f ) = P (f )/P (f ) = 1. If we know that F6 occurs. v will definitely occur (the probability is equal to 1).v)∈E (3. to calculate P (v|f ) we need to know P (x) for all x except f .1 defines the assignment of severity values to the root nodes. Scenario-Based Software Architecture Reliability Analysis Figure 3. P (v|F 7) = P (F 7) × P (F 9)/P (F 7) = P (F 9) and similarly P (v|F 7) = P (F 9). including f . P (v|f ) denotes the probability that the occurrence of f will cause the occurrence of v [34].2 defines the calculation of the severity values for the other nodes.1) (3. If v is an OR gate. This is also the case for F8. then the output of v will always occur whenever f occurs. So. s(F 6) = s(F 8) = s(F 1) = 4. Since P (v|f ) = P (v ∩ f )/P (f ). As a result. P (v ∩ f ) = ΠP (x) for all vertices x that is connected to v. If v is an AND gate. That is.9: Fault trees with severity values The severity values of the root nodes are used for determining the severity values of the other nodes of the FTS.

If probabilities are not equal or have different impacts on the severity values then either the second or third strategy can be used. In SARAH all these three strategies can be used depending on the available knowledge on probabilities. 1 ∂ ( P (root))dp sensitivity(node) = 0 ∂p (3. where fault probabilities are varied and their effect on the probability of user perceived failures are observed. • What-if analysis: A range of probability values can be considered. In case the probabilities are not known or they do not have a relevant impact on the final outcome then fixed probability values can be adopted. p). ∀n ∈ F ∧ n = node ∧ n = root. P (root) = f (p . An example is to assume that each failure will have equal weight and likewise the probabilities of individual failures are basically defined by the AND and OR gates. In the second strategy.Chapter 3. In our case we propose the following model (based on [14] and [126]) for estimating the sensitivity with respect to a fault probability even if the probability values are not known. In principle there are three strategies that can be used to determine these required probability values: • Using Fixed values: All probability values can be fixed to a certain value.3) (3.4) . sensitivity analysis is based on cut-sets in a fault tree and the probability values of fault occurrences [34]. node ∈ F. the what-if analysis can be useful if no information is available on an existing system. P (n) = p P (node) = p . and deriving execution probabilities is cumbersome. the historical data can be missing or not accessible. However. Scenario-Based Software Architecture Reliability Analysis 37 To define the final severity values for all the nodes obviously we need to know the probability of each failure. To identify the appropriate strategy a preliminary sensitivity analysis can be performed. this analysis leads to complex formulas and it requires that the probability values are known a priori. • Measurement of actual failure rate: The actual failure rate can be measured based on historical data or execution probabilities of elements that are obtained from scenario-based test runs [126]. The measurement of actual failure rates is usually the most accurate way to define the severity values. However. In conventional FTA. root ∈ Fu .

In this model we use the basic sensitivity analysis approach. On the other hand. if we are interested in the sensitivity of P (F 8) to P (F 9). Additionally. (u. We completed the definition process by deriving the corresponding fault tree set and calculating severity values as explained in section 3. (∂/∂p P (F 8))dp = pdp = p2 /2. P (F 8) = P (F 7) × P (F 9) = p × p. Finally. we skip the sensitivity analysis and assume equal probabilities (i. Based on this assumption. For example. which simplifies the severity assignment formula for AND gates as s(v) × P (v|f ) = s(v) × pindegree(v)−1 . the severity value is 4/2 because F8 has the severity value of 4 and it is connected to an AND gate with indegree = 2. For the example case.5.4 Software Architecture Reliability Analysis In our example case. For the analysis presented here. 3. F1 has the assigned severity value of 4 and this is also assigned to the failures F6 or F8 that are connected to F1 through an OR gate. Scenario-Based Software Architecture Reliability Analysis Equations 3. in which the severity value is derived as the sum of the severity values calculated for all these connections. This gives the rate of change of P (root) with respect to p (P (F 9)).5) (v∈F ∨v∈AOR ) In our analysis.v)∈E∧ s(v) × pindegree(v)−1 v∈AAN D (3.7.4 show the calculation of the sensitivity of probability of a user perceived failure (P (root)) to the probability of a node (P (node)) of the fault tree.t.3. where we change a parameter one at a time and fix the others ([126]). s(f ) = ∀vs.3. we assign p to P (node) and fix the probability values of all the other nodes to p. In case of F7 and F9. A failure scenario can be connected to multiple gates and other failures. P (x) = p for all x). this will yield to ∂/∂p P (F 8) = p. (u. Then. The resulting formulas are simple enough to be computed with spreadsheets. the result of the partial derivation is integrated with respect to p for all possible probability values (range from 0 to 1) to calculate the overall impact. Thus P (root) turns out to be a function of p and p .e. Here. we fixed the value of p to be 0. the result of the integration from 0 to 1 will be 0.5. we use derivation to calculate the rate of change ([14]) and we use integration to take the range of probability values into account. for instance. So.v)∈E∧ s(v) + ∀vs. we defined a total of 37 failure scenarios including the scenarios presented in Figure 3.3 and 3. we take a partial derivative of P (root) with respect to p .3 .38 Chapter 3. the severity calculation for intermediate failures is as shown in the following equation. For our example.t.

Sensitive elements are the elements. error types and the actual sources of their failures (internal or other elements) are identified. To take the severity of failures into account we define the Weighted Percentage of Failures (WPF) as given in Equation 3. The results that were obtained during the definition process are utilized by the analysis process as described in the subsequent sections. we collect the set of failures associated with them and we add up their severity values. Scenario-Based Software Architecture Reliability Analysis 39 (See Appendix A). As a primary and straightforward means of comparison. Perform Architecture-Level Analysis The first step of the analysis process is architecture-level analysis in which we pinpoint sensitive elements of the architecture with respect to reliability. For each element c the value for P F is calculated as follows: # of f ailures associated with c × 100 # of f ailures P Fc = (3. These include not only the failure scenarios caused by internal faults but also the failure scenarios caused by other elements. in the architectural element level analysis. the fault.7) ∀u∈F s. A first analysis of this figure already shows that the Application Manager (AMR) and Teletext (TXT) modules have to cope with a higher number of failures than other modules.10. In the architectural-level analysis step. we consider the percentage of failures (P F ) that are associated with elements.5).6) This means that simply all the number of failures related to an element are summed and divided by the total number of failures (in this case 44). W P Fc = AEID(u)=c ∀u∈F s(u) For each element. Later. we aim at identifying sensitive elements with respect to the most critical failures from the user perception.Chapter 3. In this way the effort that is provided for reliability analysis is scoped with respect to failures that are directly perceivable by the users.3.t. which are associated with the majority of the failure scenarios.7. After averaging this value with respect to all failures and . s(u) × 100 (3. This analysis treats all failures equally. The results are shown in Figure 3. Sensitive elements provide a starting point for considering the relevant fault tolerance techniques and the elements to improve with respect to fault tolerance (see section 3.

. From the project perspective it is not always possible to focus on the total set of possible failures due to the related cost for fault tolerance.00 4.3. Scenario-Based Software Architecture Reliability Analysis 16.11: Weighted percentage of failure scenarios impacting architectural elements Although weighted percentage presents different results compared to the previous one. The elements are then categorized based on the proximity of their WPF values. Weighted Percentage of Failures 30. in SARAH the architectural elements are ordered in ascending order with respect to their WPF values. elements of the same group have WPF values that are close to each other. Here we can see that. The result of the analysis is shown in Figure 3.00 25.10: Percentage of failure scenarios impacting architectural elements calculating the percentage.00 5.00 12.00 20.00 Percentage of Failures 14.00 10.00 6. For this.00 0.00 8.00 15. Accordingly.00 10. To optimize the cost usually one would like to consider the failures that have the largest impact on the system.00 0. group 4 consists of two modules AMR and TXT.40 Chapter 3.00 R DD I EP G TX T G C O AC AM R P I PM C A C M LS VO CH G AP A CB M T Architectural Elements Figure 3. The results of this prioritization and grouping are provided in Table 3. for example. The reason for this is that their WPF values are the highest and close to each other. The sum of their WPF values is 25 + 14 = 39%.11.00 R DD I EP G T TX T G C O AC AM R VO CH G P I PM C A VC VC M AP A CB C M LS V V P P Architectural Elements Figure 3. the Application Manager (AMR) and Teletext(TXT) modules again appears to be very critical. we obtain the WPF value. which also shows the sum of the WPF values of elements for each group.00 2.

CMR AMR. Perform Architectural Element Level Analysis The architectural level analysis provides only an analysis of the impact of failure scenarios on the given system. AO.3 are ordered along the x-axis with respect to the number of elements they include. PM DDI. 100 percentage of elements 80 60 19 40 20 0 group 1 group 2 group 3 WPF group 4 0 38 percentage of architectural elements Figure 3. VC. VP CB. The plotted line represents the WPF value for each group. GC.12: Pareto Chart showing the largest impact of the smallest set of architectural elements In the Pareto chart. T. However. for example. EPG. The chart helps us in this way to focus on the most important set of elements which are associated with the majority of the user perceived failures. CH.Chapter 3. see that group 4 (consisting of two elements) represents 10% of all the elements but has a WPF of 39%. LSM. The y-axis on the left hand shows the percentage values from 0 to 100 and is used for scaling the percentages of architectural elements whereas the y-axis on the right hand side scales WPF values. CA. AP. Scenario-Based Software Architecture Reliability Analysis Table 3. PI.12. VO. for error handling and recovery it is also necessary to define the type of failures that might affect the identified sensitive WPF . the groups of architectural elements shown in Table 3. TXT WPF 28% 18% 15% 39% 41 To highlight the difference in impact of the groups of architectural elements we define a Pareto chart as presented in Figure 3. The percentage of elements that each group includes is depicted with bars. In the figure we can. G.3: Architectural elements grouped based on WPF values Group # 1 2 3 4 Modules AC.

As such.3) had to deal with largest set of failure scenarios. error and failure categories of failure scenarios associated with the element. we group them in accordance with the features presented in Figure 3.3. Scenario-Based Software Architecture Reliability Analysis elements. This is analyzed in the architectural element level analysis in which the features of faults. . Teletext Module has internal faults as much as faults stemming from the other modules.5). respectively. the results obtained for Application Manager and Teletext modules are shown in Figure 3.42 Chapter 3.14. distribution of features reveals characteristics of faults.13 and Figure 3. in the architectural-level analysis it appeared that elements residing in the 4th group (see Table 3. errors and failures that impact an element are determined. errors and failures associated with individual elements of the architecture. Following the derivation of the set of failure scenarios impacting an element. This information is later utilized for architectural adjustment (See section 3. For example.6. in architectural element level analysis. For the example case. Therefore. we see that most of the faults impacting Application Manager Module are caused by other modules. we will focus on members of this group. If we take a look at fault features presented on those figures for instance. This grouping results in the distribution of fault. On the other hand. namely Application Manager and Teletext modules.

error and failure features associated with the Application Manager Module co rru pt io n de ad lo ck wr wr on on g va g ex lu e ec ut ou io n.. qu al ity re ot he rc om po ne nt s us er en t an en t Chapter 3. fr es ou rc to es o ea r ly /la te de te ct ab un le de te ct ab le re ve rs ib le irr ev er sib le 43 . to . Scenario-Based Software Architecture Reliability Analysis Figure 3..13: Fault..percentage of failures 100 90 80 70 60 50 40 30 20 10 0 percentage of errors percentage of faults 100 90 80 70 60 50 40 30 20 10 0 da ta 100 90 80 70 60 50 40 30 20 10 0 tim in g ot he rc om be ha vio ur in te rn al po ne nt s ex te rn al so ftw ar e ha rd wa tra ns i pe rm pr es en ta tio n wr on g va lu e/ pr es .

44 percentage of failures 100 90 80 70 60 50 40 30 20 10 0 100 90 80 70 60 50 40 30 20 10 0 percentage of errors percentage of faults 100 90 80 70 60 50 40 30 20 10 0 da ta tim in g ot he rc om in te rn al po ne nt s ex te rn al so ftw ar e ha rd wa tra ns i pe rm pr es en ta tio n ur wr on g va lu e/ pr es .. qu al ity be ha vio re ot he rc om po ne nt s us er en t an en t Figure 3. to .14: Fault.. Scenario-Based Software Architecture Reliability Analysis . error and failure features associated with Teletext Module co rru pt io n de ad lo ck wr wr on on g va g ex lu e ec ut ou io n. fr es ou rc to es o ea r ly /la te de te ct ab un le de te ct ab le re ve rs ib le irr ev er sib le Chapter 3..

errors and failures which include features of interest to the project. 2. 3. respectively.4 and 3. error and failure features for all failure scenarios as depicted in Figure 3. This is explained in the next section. Scenario-Based Software Architecture Reliability Analysis Provide Failure Analysis Report 45 SARAH defines a detailed description of the fault tree sets. 5. 4. which are in accordance with the steps of SARAH. Table 3.4. The fifth section depicts the fault tree set generated from the failure scenarios together with the severity values assigned to each. 7. The fourth section contains list of failure scenarios annotated based on this domain model.15. The third section presents the domain model of faults.3. The first section describes the project context. Finally.4: Sections of the failure analysis report. Additionally.Chapter 3. the failure scenarios.4. 6. The sixth and seventh sections include analysis results as presented in sections 3. Introduction Software Architecture Failure Domain Model Failure Scenarios Fault Tree Set Architecture Level Analysis Architectural Element Level Analysis Architectural Tactics . cost-effectiveness). information sources and specific considerations (e. the sixth section includes the distribution of fault.3. 8. These are described in the failure analysis report that summarizes the previous analysis results and provides hints for recovery. Sections comprised by the failure analysis report are listed in Table 3. the report includes a section on first hints for architectural recovery as titled architectural tactics [4]. the architectural level analysis and the architectural element level analysis.g. 1. The second section describes the software architecture.

error and failures for all failure scenarios /la te te ct ab un le de te ct ab le re ve rs ib irr le ev er si bl e Chapter 3. ce s ur es o ea rly lu fr va ou to to o de pr es en ta tio n wr on g va lu e/ pr es .15: Feature distribution of fault. qu al ity w ro ng ad lo c ex te rn al so ftw ar e ha rd wa tra ns i pe rm re ot he rc om po ne nt s us er en t an en t Figure 3.46 percentage of failures 100 90 80 70 60 50 40 30 20 10 0 percentage of errors percentage of faults 100 90 80 70 60 50 40 30 20 10 0 da ta co rru 100 90 80 70 60 50 40 30 20 10 0 tim in g de ur pt io n k e ot he rc om in te rn al po ne nt s be ha vio w ro ng ex ec ut i .. Scenario-Based Software Architecture Reliability Analysis ...

AC. PM AMR. GC.5 Architectural Adjustment The failure analysis report that is defined during the reliability analysis forms an important input to the architectural adjustment. CB. TXT. AP. LSM AC. AO. Table 3. PM. DDI AMR. VO AMR. This requires the following three steps: i ) defining the elements to which the failure relates ii ) identifying the architectural tactics and iii ) application of the architectural tactics. AP. The previous steps in SARAH result in a prioritization of the most sensitive elements in the architecture together with the corresponding failures . PM. VP AMR. CH. PI. T. G AMR. CA DDI. DDI. CMR. TXT. AO. PM AMR. VC. Hereby the architecture will be enhanced to cope with the identified failures. CA. LSM. CA. CMR AMR AMR. DDI AMR. DDI GC AMR. In SARAH we apply the concept of architectural tactics to derive architectural design decisions for supporting reliability. CB AMR. T CA. VC Define Architectural Element Spots Architectural tactics have been introduced as a characterization of architectural decisions that are used to support a desired quality attribute [4]. LSM. CA AMR. EPG. EPG. AEID AMR AC AP AO CA CB CH CMR DDI EPG G GC LSM PI PM T TXT VC VO VP Architectural Element Spot AC. VC AMR.3. LSMR. Scenario-Based Software Architecture Reliability Analysis 47 3.5: Architectural element spots. VO AMR. VP AMR.Chapter 3. PM.

g. we proceed with identifying the architectural tactics for enhancing dependability. DDI. each fault tolerance technique is tagged with error (and fault) types that it aims to . EPG.2. As discussed in section 2. This is because local treatments of individual elements might directly impact the dependent elements. Table 3. Each technique provides a solution for a particular set of problems.16. for applying a recovery technique to an element we need also to consider the other elements coupled with it. fault prevention and fault tolerance (See Figure 3.5) and FTS. These undependable elements can be traced with the help of the architectural element spot (section 3. we might need to consider applying fault tolerance techniques to these undependable elements instead. the element spots and the potential set of failure scenarios. For this reason. N-version programming [81]) on the element to tolerate such faults. CH. AO. we particularly focus on fault tolerance techniques. In that case. fault removal. dependability means [3] are categorized as fault forecasting. LSM. The coupling can be statically defined by analyzing the relations among the elements. Thus. CB. In Figure 3. this means that the failures are caused by the other elements that are undependable. TXT. if an element comprises faults that are internal and permanent. check-pointing and roll-back recovery [58]. A broad set of fault tolerance techniques are available for this purpose such as exception handling. if the source of the fault is external.5 shows architectural element spots for each element in the example case.48 Chapter 3. To make this explicit.1. CMR. On the other hand. PM.3. VC and VO. This draws the boundaries of the design fragment [4] to which the design decisions will be applied. If the source of the fault is internal this means that the fault tolerance techniques should be applied to the sensitive element itself. After identifying the sensitive elements we analyze the fault and error features related to these elements (Figures 3. recovery blocks. These elements should be considered while incorporating mechanisms to AMR or any refactoring action that includes alteration of AMR and its interactions.5 shows the elements that are in the architectural element spot of AMR as AC. Faults can be either internal or external to the sensitive element. we might consider applying design diversity (e.16). we define architectural element spot for a given element as the set of elements with which it interacts.13 and 3. In general. GC. restarting.14). In SARAH. Identify Architectural Tactics Once we have identified the sensitive elements. PI. SARAH prioritizes actually the design fragments [4] to which the specific tactics will be applied and to improve dependability. some examples for fault tolerance techniques have been given. Scenario-Based Software Architecture Reliability Analysis that might occur. However. Some elements can tolerate external faults originating from other undependable elements. Table 3. For example.

On the other hand. Additionally. The possible set of architectural tactics is determined as described in the previous section. other criteria need to be considered including cost and limitations imposed by the system (resource constraints).6 shows. selected architectural tactics should be applied to adjust the architecture if necessary. in Figure 3. As an example. the potential fault tolerance techniques that can be applied for AMR and TXT modules.16: Partial categorization of dependability means with corresponding fault and error features detect and/or recover.16 shows only a few fault tolerance techniques as an example. Derivation of all fault tolerance tactics is out-of-scope of this work. Scenario-Based Software Architecture Reliability Analysis 49 & & ! ! " $ % ! # Figure 3.Chapter 3. For the given case. N-version programming [81] can be used for compensating wrong value errors that are caused by internal and persistent faults. watchdog [58] can be applied to detect deadlock errors for tolerating faults resulting in such errors.16. Apply Architectural Tactics As the last step of architectural adjustment. Figure 3. Table 3. for example. The last column of the table shows the selected candidate .

resource checks TXT techniques. we evaluate a local recovery technique and its impact on software decomposition with respect to performance and availability. The event interpretation specification is built into the monitoring system. resource checks.6: Treatment approaches for sensitive components AEID AMR Potential Means on-line monitoring. interface checks on-line monitoring. which can be achieved by partially or fully restarting the observed system and it is executed when a wrong value is detected. interface checks Selected Means on-line monitoring. In chapter 5. checkpoint & restart.50 Chapter 3. Schroeder defines in [108] basic concepts of on-line monitoring and explains its characteristics. in [74] software fault tolerance techniques that are based on diversity (e. AMR module) and they are always enabled.g. For instance. watchdog. resource checks. recovery blocks. n-version programming. action). N-version programming) are evaluated with respect to dependability and cost.e. a monitoring system is an external observer that monitors a fully functioning application. Action specification is a recovery attempt. Scenario-Based Software Architecture Reliability Analysis Table 3. Based on the provided classification scheme [108] (purpose. on-line monitoring is selected as a candidate error detection technique to be applied to AMR module. Sensors are the observers of a set of values represented in the element (i. Accordingly. availability. For example. The realization of an architectural tactic and measuring its properties (reengineering and performance overhead. we can specify an error detection mechanism as follows: The purpose of the mechanism is error detection. watchdog. sensors. interface checks checkpoint & restart. . requirements. event interpretation. performability) for a system requires dedicated analysis and additional research. Sensing is based on sampling. Each architectural tactic has a set of characteristics. optional and alternative features and constraints. checkpoint & restart.

Basically we have looked at all the elements of the architecture. For the next releases of the product this information can be used in deciding for the dependability means to be used in the architecture. Actually. .3. Digital TV) in a product line. we did had access to the next release implementation of the Digital TV the reliability analysis with SARAH still provided useful insight in the critical failures and elements of the current architecture. To provide a more systematic and stable approach we have done a thorough analysis on failures and defined a fault domain model that represents essentially the reliability quality attribute.Chapter 3. During the whole process we were supported by TV domain experts. Early reliability analysis Similar to other software architecture analysis methods. Without such an analysis it would be hard to denote the critical modules that have the risk to fail. Scenario-Based Software Architecture Reliability Analysis 51 3. A number of lessons can be learned from this study. we were able to identify the critical modules Application Manager. in the initial stages of the project we first tried to directly collect failure scenarios by interviewing several stakeholders. Utilizing a quality model for deriving scenarios The use of an explicit domain model for failures has clearly several benefits.3. based on our experiences with SARAH we can state that early analysis of quality at the software architecture design level has a practical and important benefit [31].4 Discussion The method and its application provide new insight in the scenario-based architecture analysis methods. This model does not only provide systematic means for deriving failure scenarios but also defined the stopping criteria for defining failure scenarios. For example. checked the failure domain and expressed the failure scenarios using the failure scenario template that we have presented in section 3. their causes and their effects. Teletext and also got an insight in the failures.g. In our case the early analysis relates to the analysis of the next release of a system (e. Although. In our experience this has clear limitations because i ) the set of failure scenarios for a given architecture is in theory too large and ii ) even for the experts in the project it was hard to provide failure scenarios.

To handle this inherent problem in a satisfactory manner SARAH guides the scenario elicitation by using the relevant failure domain model as described in section 3.3. In case more knowledge on probabilities is known in the project the analysis will be more accurate accordingly.3. Calculation of probability values of failures One of the key issues in the reliability analysis is the definition of the probability values of the individual failures.3 we have described three wellknown strategies that can be adopted to define the probability values.3.3.3 and the use of failure scenario template in section 3.3. This requirement had a direct impact on the way how we proceeded with the reliability analysis.3 SARAH does not adopt a particular strategy and can be used with any of these strategies.52 Chapter 3. In principle. but the independent calculation of the severities for intermediate failures is defined by the method itself as defined in section 3. We have illustrated the approach using fixed values. In section 3. the set of selected failure scenarios and values assigned to their attributes (severity) directly affect the results of the analysis. A key requirement of the industrial case was to provide a reliability analysis that takes the user-perception as the primary criteria. it meant that all the failures that could be directly or indirectly perceived by the end-user had to be prioritized before the other failures.3. SARAH provides a framework for reliability analysis and the method could be easily adapted to highlight other types of properties of failures. from a broader sense. In fact. Inherent dependence on domain knowledge Obviously. In our analysis this was realized by weighing the failures based on their severities from the user-perspective. The initial assignment of severity values for the user-perceived failures is defined by the domain experts. . Scenario-Based Software Architecture Reliability Analysis Impact of project requirements and constraints From the industrial project perspective it was not sufficient to just define a failure domain model and derive the scenarios from it. As described in section 3. As it is the case with other scenario-based analysis methods both the failure scenario elicitation and the prioritization are dependent on subjective evaluations of the domain experts. the focus on user-perceived failures could just be considered an example.3.

We have modeled a limited set of software fault tolerance techniques and defined the relation of these techniques with the faults and errors in the failure domain model that had been defined before. The concept of architectural tactics as an appropriate architectural design decision to enhance a particular quality attribute of the architecture remains of course the same. The difference of FTS from a single fault tree is that an intermediate failure can participate in multiple fault trees and there exists multiple root nodes each of which represents a system failure perceived by the user. matching these with general scenarios and deriving quality attribute frameworks [4]. derive architectural tactics by first listing concrete scenarios. Only the probability values are propagated to calculate the risk associated with the root node. . In our analysis. which is basically a set of fault trees. The fault tree can be also processed in a top-down manner to find the root causes of this failure. Bachman et al. The system or system state can be represented by a single fault tree where the root node of the tree represents the failure of the system. the severity of the root node is not distributed to the intermediate nodes. fault trees have been used for a long time in order to calculate the probability that a system fails [34]. Scenario-Based Software Architecture Reliability Analysis Extending the Fault Tree concept 53 In the reliability engineering community. In this context. One of the contributions of this paper from the reliability engineering perspective is the Fault Tree Set (FTS) concept. This information is derived from the probabilities of fault occurrences by processing the tree in a bottom-up manner. Architectural Tactics for fault tolerance After the identification of the sensitive architectural elements. Our analysis based on FTS is also different from the traditional usage of FTA.g. In a sense this could be considered as a bottom-up approach. the related element spots and the corresponding failures in SARAH architectural tactics are defined. This refinement enables us to discriminate different types of system failures (e. failure means a crash-down in which no more functionality can be provided further. In the current literature on software architecture design this is not explicit and basically we had to rely on the techniques that are defined by the reliability engineering community. In SARAH the approach is more top-down because it essentially starts at the domain models of failures and fault tolerance techniques and then derives the appropriate matching. based on severity) and infer what kind of system failures that an intermediate failure can lead to. we assign severity values to the intermediate nodes of FTS based on to what extend they influence the occurrence of root nodes and the severity of these nodes.Chapter 3. In FTA.

Also. The method is applied together with the Jacobson’s method [100]. efforts for applying reliability analysis to software [62] mainly focus on the safety-critical systems. where the systems are usually not safety-critical. These templates are composed according to the control flow of the program to derive a fault tree for the whole software.5 Related Work FMEA has been widely used in various industries such as automotive and aerospace. where the methodology is adapted and extended accordingly. Scenario-Based Software Architecture Reliability Analysis 3. In SFMEA. the analysis is applied at the source code level rather than the software architecture design level. In [76]. new failure taxonomy. In our case. SARAH separates fault. FMECA [30] incorporates severity and probability assessments for faults. whose failure may have very serious consequences such as loss of human life and large-scale environmental damage. failure modes for software components are identified such as computational. Note that the severity definition in our method is different from the one that is employed by FMECA. Both FMEA and FTA have been employed for the analysis of software systems and named as Software Failure Modes and Effects Analysis (SFMEA) and Software Fault Tree Analysis (SFTA). Almost all reliability analysis techniques have primarily devised to analyze failures in hardware components. We take a user-centric approach and define the severities for failures based on their impact on the user. denial of service. This classification resembles the failure domain model of SARAH. Hereby. respectively. which identifies .54 Chapter 3. a severity is associated with every fault to distinguish them based on the cost of repair. For instance. However. logic and data I/O. we focus on consumer electronics domain. error and failure concepts and provides a more detailed categorization for each. Application of FMEA to software has a long history [98]. financial loss. intrusions and various failure effects (data loss. In general. Failure Modes. note that the failure domain model can vary depending on the project requirements and the system. Effects and Criticality Analysis (FMECA) is a well-known method that is built upon FMEA. However. The probabilities of occurrence of faults (assumed to be known) are utilized together with the fault tree in order to calculate the probability that the system will fail. SFTA is used for the safety verification of software. etc. As an extension to FMEA. FMEA has been used in other domains as well. On the other hand. Utilization of FMEA is also proposed in [128] for early robustness analysis of Web-based systems. These techniques have been extended and adapted to be used for the analysis of software systems. It has also been extended to make it applicable to the other domains or to achieve more analysis results. To use FMEA for analyzing the dependability of Web Services.) are taken into account in [47]. a set of failure-mode templates outlines the failure modes of programming language elements like assignments and conditional statements.

The semantics of the transformation is captured in the ”type” tags of failure scenarios.source. This facilitates the automatic derivation of the propagation of the failures throughout the system. can be the user (i. In [47]. ii ) entity objects that are objects from the domain and iii ) control objects that serve as a glue between boundary objects and entity objects. . actor) or the other components.Chapter 3.target). FPTC is used for specifying the failure behavior of each component (i. which also focuses on the analysis at the early stages of design. Further. how a component introduces or transforms a failure type). In this work. FMEA tables are being integrated with a web service deployment architecture so that they can be dynamically updated by the system. In our method. On the other hand. An Advanced Failure Modes and Effect Analysis (AFMEA) is introduced in [39]. there is a body of work focusing on tool-support for FMEA and FTA. we represent the propagation of failure scenarios with fault trees. In our method.e. The result of the fault tree synthesis is a network of interconnected fault trees. where a failure mode can contribute to more than one system failure. In [90]. which is analogous to the fault tree set in our method. we do not presume any specific decomposition of the software architecture and we do not categorize objects or components. fault trees are synthesized automatically. a modular representation called Fault Propagation and Transformation Calculus (FPTC) is introduced.e. However. for example. for instance. In [124]. Scenario-Based Software Architecture Reliability Analysis 55 three types of objects in a system: i ) boundary objects that communicate with actors in a use-case. we made use of spreadsheets that define the failure scenarios and automatically calculate the severity values in the fault trees (after initial assignment of the user-perceived failure severities). the aim of this work is to enhance FMEA in terms of the number and range of failure modes captured. we categorize failure scenarios based on the failure domain model and each failure scenario is associated with a component. Jacobson’s classification [100] is aligned with our failure domain model with respect to the propagation of failures (fault. This is achieved by constructing a behavior model for the system. The target feature of failure. multiple failures are taken into account. failure. This is a straightforward calculation and as such we have not elaborated on this issue. However.

56

Chapter 3. Scenario-Based Software Architecture Reliability Analysis

3.6

Conclusions

In this chapter, we have introduced the software architecture reliability analysis method (SARAH). SARAH utilizes and integrates mature reliability engineering techniques and scenario-based architectural analysis methods. The concept of fault, error failure models, the failure scenarios, fault tree set and the severity calculations are inspired from the reliability engineering domain [3]. The overall scenario-based elicitation and prioritization is derived from the work on software architecture analysis methods [31]. SARAH focuses on the reliability quality attribute and uses failure scenarios for analysis. Failure scenarios are prioritized based on the user perception. We have illustrated the steps of SARAH for analyzing reliability of a DTV software architecture. SARAH has helped us to identify the sensitivity points and architectural tactics for the enhancement of the architecture with respect to fault tolerance. The basic limitation of the approach is the inherent subjectivity of the analysis results. SARAH provides systematic means to evaluate potential risks of the system and identify possible enhancements. However, the analysis results at the end depends on the failure scenarios that are defined. We use scenarios because they are practical and integrate good with the current methods. However, the correctness and completeness of these scenarios cannot be guaranteed in absolute sense because scenarios are subjective. Another critical issue is the definition of the probability values of failures in the analysis. However, just like existing reliability analysis approaches in the literature the accuracy of the analysis depends on the available knowledge in the project. SARAH does not adopt a fixed strategy but can be rather considered as a general process in which different strategies for defining probability values can be used.

Chapter 4 Software Architecture Recovery Style
The software architecture of a system is usually documented with multiple views. Each view helps to capture a particular aspect of the system and abstract away from the others. The current practice for representing architectural views focus mainly on functional properties of systems and they are limited to represent faulttolerant aspects of a system. In this chapter, we introduce the recovery style and its specialization, the local recovery style, which are used for documenting a local recovery design. The chapter is organized as follows. In section 4.1, we provide background information on documenting sofware architectures with multiple views. In section 4.2, we define the problem statement. Section 4.3 illustrates the problem with a case study, where we introduce local recovery to the open-source media player, MPlayer. This case study will be used in the following two chapters as well. In sections 4.4 and 4.5, we introduce the recovery style and the local recovery style, respectively. In section 4.6, we illustrate the usage of the recovery style for documenting recovery design alternatives of MPlayer software. We conclude this chapter after discussing alternatives, extensions and related work. The usage of the recovery style for supporting analysis and implementation are discussed in the following two chapters.

57

58

Chapter 4. Software Architecture Recovery Style

4.1

Software Architecture Views and Models

Due to the complexity of current software systems, it is not possible to develop a model of the architecture that captures all aspects of the system. A software architecture cannot be described in a simple one-dimensional fashion [18]. For this reason, as explained in section 2.2.1, architecture descriptions are organized around a set of views. An architectural view is a representation of a set of system elements and relations associated with them to support a particular concern [18]. Different views of the architecture support different goals and uses, often for different stakeholders. In the literature initially some authors have prescribed a fixed set of views to document the architecture. For example, the Rational’s Unified Process [70], which is based on Kruchten’s 4+1 view approach [69] introduces the logical view, development view, process view and physical view. Another example is the Siemens Four Views model [56] that uses conceptual view, module view, execution view and code view to document the architecture. Because of the different concerns that need to be addressed for different systems, the current trend recognizes that the set of views should not be fixed but different sets of views might be introduced instead. For this reason, the IEEE 1471 standard [78] does not commit to any particular set of views although it takes a multi-view approach for architectural description. IEEE 1471 indicates in an abstract sense that an architecture description consists of a set of views, each of which conforms to a viewpoint [78] associated with the various concerns of the stakeholders. Viewpoints basically represent the conventions for constructing and using a view [78] (See Figure 4.1).

Figure 4.1: Viewpoint, view, viewtype and style The Views and Beyond (V&B) approach as proposed by Clements et al. is another

Chapter 4. Software Architecture Recovery Style

59

multi-view approach [18] that appears to be more specific with respect to the viewpoints. To bring order to the proposed views in the literature, the V&B approach holds that a system can be generally viewed from so-called three different viewtypes: module viewtype, component & connector viewtype and allocation viewtype. Each viewtype can be further specialized or constrained in so-called architectural styles (See Figure 4.1). For example, layered style, pipe-and-filter style and deployment style are specializations of the module viewtype, component & connector viewtype and allocation viewtype, respectively [18]. The notion of styles makes the V&B approach adaptable since the architect may in principle define any style needed.

4.2

The Need for a Quality-Based View

Certainly, existing multi-view approaches are important for representing the structure and functionality of the system and are necessary to document the architecture systematically. Yet, an analysis of the existing multi-view approaches reveals that they still appear to be incomplete when considering quality properties. The IEEE 1471 standard is intentionally not specific with respect to concerns that can be addressed by views. Thus, each quality property can be seen as a concern. In the V&B approach quality concerns appear to be implicit in the different views but no specific style has been proposed for this yet. One could argue that software architecture analysis approaches have been introduced for addressing quality properties. The difficulty here is that these approaches usually apply a separate quality model, such as queuing networks or process algebra, to analyze the quality properties. Although these models are useful to obtain precise calculations, they do not depict the decomposition of the architecture. They need to abstract away from the actual decomposition of the system to support the required analysis. As a complementary approach, preferably an architectural view is required to model the decomposition of the architecture based on the required quality concern. One of the key objectives of the TRADER project [120] is to develop techniques for analyzing recovery at the architecture design level. Hereby, modules in a DTV can be decomposed in various ways and each alternative decomposition might lead to different recovery properties. To represent the functionality of the system we have developed different architectural views including module view, component & connector view and deployment view. None of these views however directly shows the decomposition of the architecture based on recovery concern. On the other hand, using separate quality models such as fault trees helps to provide a thorough analysis but is separate from the architectural decomposition. A practical and easyto-use approach coherent with the existing multi-view approaches was required to understand the system from a recovery point of view.

60

Chapter 4. Software Architecture Recovery Style

In this chapter, we introduce the recovery style as an explicit style for depicting the structure of the architecture from a recovery point of view. The recovery style is a specialization of the module viewtype in the V&B approach. Similar to the module viewtype, component viewtype and allocation viewtype, which define units of decomposition of the architecture, the recovery style also provides a view of the architecture. Unlike quality models that are required by conventional analysis techniques, recovery views directly represent the decomposition of the architecture and as such help to understand the structure of the system related to the recovery concern. The recovery style considers recoverable units as first class elements, which represent the units of isolation, error containment and recovery control. Each recoverable unit comprises one or more modules. The dependency relation in module viewtype is refined to represent basic relations for coordination and application of recovery actions. As a further specialization of the recovery style, the local recovery style is provided, which is used for documenting a local recovery design in more detail. The usage of the recovery style is illustrated by introducing local recovery to the open-source software, MPlayer [85].

4.3

Case Study: MPlayer

We have applied local recovery to an open-source software, MPlayer [85]. MPlayer is a media player, which supports many input formats, codecs and output drivers. It embodies approximately 700K lines of code and it is available under the GNU General Public License. In our case study, we have used version v1.0rc1 of this software that is compiled on Linux Platform (Ubuntu version 7.04). Figure 4.2 presents a simplified module view [18] of the MPlayer software architecture with basic implementation units and direct dependencies among them. In the following, we briefly explain the important modules that are shown in this view. • Stream reads the input media by bytes, or blocks and provides buffering, seek and skip functions. • Demuxer demultiplexes (separates) the input to audio and video channels, and reads them from buffered packages. • Mplayer connects the other modules, and maintains the synchronization of audio and video. • Libmpcodecs embodies the set of available codecs. • Libvo displays video frames.

Chapter 4. Software Architecture Recovery Style

61

Figure 4.2: A Simplified Module View of the MPlayer Software Architecture • Libao controls the playing of audio. • Gui provides the graphical user interface of MPlayer. We have derived the main modules of MPlayer from the package structure of its source code (e.g. Source files that are related to the graphical user interface are collected under the folder ./Gui).

4.3.1

Refactoring MPlayer for Recovery

In principle, fault tolerance should be considered early in the software development life cycle. Relevant fault classes, possible fault tolerance provisions should be carefully analyzed [95] and the software system must be structured accordingly [94]. However, there exist many software systems that have been already developed without fault tolerance in mind. Some of these systems comprise millions lines of code and it is not possible to develop these systems from scratch within time constraints. We have to refactor and possibly restructure them to incorporate fault tolerance mechanisms. In general, software systems are composed of a set of modules. These modules provide a set of functionalities each of which has a different importance from the

and in general will not be. the control of the communication during recovery and the coordination of recovery actions. aligned one-to-one with the module decomposition of the system. Software Architecture Recovery Style users’ point of view [28]. they should take part in the architecture description of the system. in which only the erroneous parts of the system are recovered. while the other units are operational. Note that the system is already decomposed into a set of modules. We have introduced a local recovery mechanism to MPlayer to make it fault-tolerant against transient faults. Effective communication is one of the fundamental purposes of an architectural description that comprise several views. Architecture description of the system should comprise information not only regarding the functional decomposition but also decomposition for recovery to support such analysis. Local recovery is an effective fault tolerance technique to attain high system availability. it leads to a performance overhead due to isolation. the way that the information regarding error detection and diagnosis is conveyed.2 Documenting the Recovery Design Designing a system with recovery or refactoring it to introduce recovery requires the consideration of several design choices related to: the types of errors that will be recovered. For example. the system has to be decomposed into several units that are isolated from each other. error containment boundaries. architectural elements that embody these mechanisms and the way that they interact with the rest of the system. A goal of fault tolerance mechanisms should be to make important functions available to the user as much as possible.3. To be able to communicate these design choices. Decomposition for local recovery is not necessarily. As one of the requirements of local recovery. diagnosis. For example. However. splitting the system for local recovery on one hand increases availability of the system during recovery but on the other hand.62 Chapter 4. . Architecture description should also enable analysis of recovery design alternatives. the audio/video streaming in a DTV should preferably not be hampered by a fault in a module that is not directly related to the streaming functionality. Multiple modules can be wrapped in a unit and recovered together. viewpoints that capture functional aspects of the system are limited for representing recovery mechanisms and it is inappropriate for understandability to populate these views with elements and complex relations related to recovery. We had to decompose it into several units. mechanisms for error detection. The communication between multiple units had to be controlled and recovery actions had to be coordinated so that the units can be recovered independently and transparently. This leads to a trade-off analysis. 4. The other parts can remain available during recovery.

In Chapter 6.e.e. functional importance) and reliability. types of errors that can be detected. Properties of NRU: types of errors that can be detected (i.Chapter 4. we introduce the recovery and the local recovery styles. Architectural description related to recovery is also required for providing a roadmap for the implementation and supporting the detailed design. . In this chapter.4 Recovery Style The key elements and relations of the recovery style are listed below. we present analysis techniques that can be performed based on the recovery style. exception handling. monitoring capabilities). In Chapter 5. In short.3 that can be used for documenting a recovery view. non-recoverable unit (NRU) • Relations: applies-recovery-action-to. we need a architectural view for recovery to i ) communicate recovery design decisions ii ) analyze design alternatives iii ) support the detailed design. the style is used for supporting the detailed design and realization. For this purpose.). 4. Figure 4.3: Recovery style notation • Elements: recoverable unit (RU). we describe these styles and illustrate their usage for documenting the recovery design alternatives of MPlayer. Software Architecture Recovery Style 63 Realizing a recovery mechanism usually requires dedicated architectural elements and relations that have system-wide impact. supported recovery actions. etc. type of isolation (process. A visual notation based on UML [104] is shown in Figure 4. conveys-information-to • Properties of elements: Properties of RU: set of system modules together with their criticality (i.

etc. A RU is a unit of recovery in the system. diagnosis results or a chosen recovery strategy. A RU has the ability to recover independently from other RU and NRUs it is connected to. Software Architecture Recovery Style • Properties of relations: Type of communication (synchronous / asynchronous) and timing constraints if there are any. Conveyed information can be related to detected errors. Figure 4. kills. . we present the local recovery style in which the elements and relations of the recovery style are further specialized. roll-back. communication manager • Relations: restarts. kill. RU and NRU. The relation conveys-information-to is about communication between elements for guiding and coordinating the recovery actions. It provides interfaces for conveying information about detected errors and responding to triggered recovery actions. The relation applies-recovery-action-to is about the control imposed to a RU and it affects the functioning of the target RU (suspend. This specialization is introduced in particular to document the application of our local recovery framework (FLORA) to MPlayer as presented in Chapter 6.4 provides a notation for specialized elements and relations based on UML.) 4. recovery manager. • Properties of relations: notifies-error-to relation has a type property to indicate the error type if multiple types of errors can be detected. sends-queued messageto • Properties of elements: Each RU has a name property in addition to those defined in the recovery style. on the other hand. It can only be recovered together with all other connected elements. The recovery style introduces two types of elements.64 Chapter 4. restart. • Topology: The target of a applies-recovery-action-to relation can only be a RU. notifies-error-to. which wraps a set of modules and isolates them from the rest of the system. can not be recovered independent from other RU and NRUs. A NRU.5 Local Recovery Style In the following. • Elements: RU. This can happen in the case of global recovery or a recovery mechanism at a higher level in the case of a hierarchic decomposition. provides-diagnosis-to.

. Software Architecture Recovery Style 65 Figure 4.Chapter 4. Figure 4.4: Local recovery style notation • Topology: restarts and kills relations can be only from a recovery manager to a RU.5 depicts a simple instance of the local recovery style with one communication manager. Recovery manager restarts A and/or B. being restarted) and sent (retried) after A becomes available. Errors are detected by the communication manager and notified to the recovery manager. Communication manager connects RUs together. The messages that are sent from the communication manager to A are stored in a queue by the communication manager while A is unavailable (i. one recovery manager and two RUs.e. A and B. routes messages and informs connected elements about the availability of another connected element. sends-queued message-to can be between a RU and a communication manager. Recovery manager applies recovery actions on RUs.

This would basically lead to a global recovery since all the modules are encapsulated in a single RU. we can also see two new architectural elements that are not recoverable. a communication manager that mediates and controls the communication between the RUs and a recovery manager that applies the recovery actions on RUs. We use the local recovery style in Chapter 6. Figure 4. One such alternative is to introduce a local recovery mechanism comprising 3 RUs as listed below. We could consider many other design alternatives.7(c). One alternative would be to have only 1 RU (See Figure 4.66 Chapter 4. we use the recovery style notation to represent all the recovery design alternatives.6 depicts the boundaries of these RUs. We could also have more than 3 recoverable units as shown in Figure 4. Note that Figure 4.4) 4.7(a)). In this chapter. Here.7(b) shows just one possible design for the recovery mechanism to be introduced to the MPlayer.5: A simple recovery view based on the local recovery style (KEY: Figure 4.3) to choose and realize a recovery design among several design alternatives.7(b) the recovery design corresponding to this RU selection is shown1 . • RU AUDIO: provides the functionality of Libao • RU GUI : encapsulates the Gui functionality • RU MPCORE : comprises the rest of the system.6 Using the Recovery Style Our aim is to use the recovery style for MPlayer (Section 4. to represent local recovery designs in more detail. Software Architecture Recovery Style Figure 4. 1 . which are overlayed on the module view of the MPlayer software architecture. In Figure 4.

3 Chapter 5 presents an analysis method together with a tool-set for optimizing the decomposition of software architecture into RUs. The names of the RUs are not significant and they are assigned based on the comprised modules. These elements could be recoverable units as well. 2 .Chapter 4. distributed or hierarchical control). each module of the system is assigned to a separate RU2 .g. On the other hand. Recovery views of the system helps us to document and communicate such design alternatives. The following chapters further elaborate on the related analysis techniques and realization activities supported by the recovery style.6: The Module View of the MPlayer Software Architecture with the Boundaries of the Recoverable Units Hereby. Software Architecture Recovery Style 67 Figure 4. there could be more than one element that controls the recovery actions and communication (e. We could consider many such alternatives to decompose the system into a set of RUs3 .

7: Architectural alternatives for the recovery design of MPlayer software (KEY: Figure 4.3) .68 Chapter 4. Software Architecture Recovery Style (a) recovery design with 1 RU (b) recovery design with 3 RUs (c) recovery design with 7 RUs Figure 4.

we could also derive a recovery style for the component & connector viewtype or the allocation viewtype [18] to represent recovery views of run-time or deployment elements. where the architectural elements related to fault tolerance and corresponding relations (recovery actions and error notifications) are shown. . which restarts the processes. on the other hand. Software Architecture Recovery Style 69 4. So. This would provide a broader view on recovery and we consider this as a possible extension of our work. Utilization of Recovery Style in different contexts There are several works that apply local recovery in different contexts. Yet.Chapter 4. They provide a view of the architecture of the operating system. styles and views [18]. The IEEE 1471 standard [78]. where microreboot [16] is applied to increase the failure resilience of operating systems. To clarify this focus we adopted the term style instead of the more general term viewpoint. and views are considered as instances of these styles. mediates and controls communication. For example.7 Discussion Style or viewpoint? In this work we have extended the module viewtype to represent a recovery style. Hereby. In the IEEE 1471 standard. describes viewpoints and views. In the V&B approach a distinction is made between viewtypes. in [53] device drivers are executed on separate processes. styles can be considered as specialized viewtypes. Data Store. in the parlance of this standard. Reincarnation Server monitors the system to facilitate error detection and diagnosis and it guides the recovery procedure. and does not describe the concept of style. the recovery style that we have introduced can be considered as a viewpoint as well because it represents a language to define recovery views. is a non-recoverable unit that coordinates the recovery actions. in this paper we have focused on representing recovery views for units of implementation which is made explicit in the V&B approach through the module viewtype. Recovery Style based on other viewtypes Although we have focused on defining a recovery view for implementation units. viewpoint represents the language to represent views. According to this view [53] Process Manager. which is a name server for interconnected components. They can be represented with the local recovery style as well.

The iC2C differentiates between service requests/responses and interface/failure exceptions as subtypes of requests and notifications in the C2 architectural style. Components are wrapped to cope with interface and architectural mismatches [43]. iC2C specializes the elements and relations of the C2 architectural style to explicitly represent fault tolerance provisions. The recovery style is used for representing and analyzing an architectural tactic for recovery. by means of which the recovery design of a system can be captured.8 Related Work Perspectives [125] guide an architect to modify a set of existing views to document and analyze quality properties. it can be instantiated for both components and connectors. The approach is based on the idealized C2 component (iC2C). The idealized fault-tolerant COTS (commercial off-the-shelf) component is introduced for encapsulating COTS components to add fault tolerance capabilities [25]. where we needed an explicit view for recovery to depict the related decomposition.3). whereas the iC2C AbnormalActivity component is responsible for error recovery. The idealized fault-tolerant architectural element (iFTE) [27] is introduced to make exception handling capabilities explicit at the architectural level. in which components communicate only through asynchronous messages mediated by connectors. which is instantiated as a view. Architectural tactics [4] aim at identifying architectural decisions related to a quality attribute requirement and composing these into an architecture design. Figure 6. It relies on multiple versions of a component for different operations to improve the dependability of a system. However. a NormalActivity component and an AbnormalActivity component together with an additional connector between these. This approach enables the analysis of recovery properties of a system based on its architecture design. This was less an issue in our project context. The C2 architectural style is a componentbased style. which is compliant with the C2 architectural style [114]. It also introduces two internal components. which includes. Software Architecture Recovery Style 4. One of the motivations for using perspectives instead of creating a quality-based view is to avoid the duplicate information related to architectural elements and relations.70 Chapter 4. Multi-version connector [93] is a type of connector based on the C2 architectural style. It provides a view of the system.g. The iFTE adopts the basic principles of the iC2C. dedicated architectural elements with complex interactions among them (e. named as iFTComponent and iFTConnector. Our work addresses the same problem for recovery by introducing the recovery style. respectively. The iC2C NormalActivity component implements the normal behavior and error detection. This approach can be used for representing N-version .

In this style. 71 The notion of Web Service Composition Action (WSCA) [113] is introduced for structuring the composition of Web services in terms of coordinated atomic actions to enforce transaction processing. The approach is supported by a run-time environment. The main aim of the recovery style. architecture configuration exceptions and their handling are specified in the architecture description. Configuration exceptions are defined as conditions with respect to possible configurations of architectural elements and their interaction (communication patterns within the current configuration). which participate to perform the corresponding action. tune and select recovery design alternatives. A generic run-time adaptation framework is presented in [44] together with its specialization and realization for performance adaptation. Exception handlers are defined as a set of required changes to the architecture configuration (substitution of elements and alterations of bindings). Diagnosis facilities can be considered as gauges. The system is then dynamically reconfigured to utilize alternative variants of the crashed components. a system is composed of multiple variants of components. together with the standard behavior and their cooperative recovery actions in case of an exception. In the architecture-based exception handling [63] approach. Analysis is a complementary work to make trade-off decisions. As previously discussed in Chapter 3. According to the concepts in this framework. several software architecture analysis approaches have been introduced for addressing quality properties. In case of an error. which reconfigures the system dynamically based on the provided specification. we propose a style specific for fault tolerance. Another run-time architecture reconfiguration approach is presented in [24] for updating a running system to tolerate faults. The goal of these approaches is to assess whether or not a given architecture design satisfies desired quality requirements. which crash in case of a failure. which interpret monitored low-level events. the architecture description is expressed with xADL [23] and repair plans are specified as a set of elements to be added to or removed from the architecture. on the other hand. the associated repair plan is executed and the configuration of the architecture is changed accordingly. An XML-based specification defines the elements. In this approach.Chapter 4. Software Architecture Recovery Style programming in the software architecture design. An architectural style for tolerating faults in Web services is proposed in [91]. our monitoring mechanisms are error detectors. is to communicate and support the design of recovery. and recovery actions can be considered as architectural repairs. As a difference from the approach presented in [44]. .

The usage of these styles have been illustrated for defining recovery views of MPlayer. Chapter 5 explains the analysis of decomposition alternatives for local recovery. we have introduced the recovery style to document and analyze recovery properties of a software architecture.72 Chapter 4. They have also supported the detailed design and implementation of local recovery for MPlayer. whereas Chapter 6 explains the realization of recovery views.9 Conclusions In this chapter. Software Architecture Recovery Style 4. . The recovery style is a specialization of the module viewtype as defined in the V&B approach [18]. We have further specialized the recovery style to define the local recovery style. These views have formed the basis for analysis of design alternatives.

11 explain the steps of the top-level process and illustrate them for the open-source MPlayer software.5 through 5. To prevent the propagation of errors.Chapter 5 Quantitative Analysis and Optimization of Software Architecture Decomposition for Recovery Local recovery enables the recovery of the erroneous parts of a system while the other parts of the system are operational. we define the problem statement for selecting feasible decomposition alternatives. The chapter is organized as follows. In this chapter. the architecture needs to be decomposed into several parts that are isolated from each other. extensions and related work. It appears that each of these decomposition alternatives performs differently with respect to availability and performance metrics.3. we present the toplevel process of our approach and the architecture of the tool that supports this process. we introduce background information on program analysis and analytical models that are utilized by the proposed approach.4. In the following two sections. Sections 5. In section 5. 73 . which employs integrated set of analysis techniques to optimize the decomposition of software architecture for local recovery.12 we evaluate the analysis results. We conclude the chapter after discussing limitations. In section 5. There exist many alternatives to decompose a software architecture for local recovery. we propose a systematic approach. In section 5.

74 Chapter 5. Quantitative Analysis and Optimization of Software Architecture...

5.1

Source Code and Run-time Analysis

Program analysis techniques are used for automatically analyzing the structure and/or behavior of software for various goals such as the ones listed below. • Optimizing the generated code during compilation [1] • Automatically pinpointing (potential) errors and as such reducing debugging time [119] • Automatically detecting vulnerabilities and security threats [92] • Supporting the understanding and comprehension of large and complex programs [20] • Reverse engineering legacy software systems [87] Several types of program analysis techniques are used in practice as a part of existing software development tools and compilers, which are tailored for different execution platforms and programming languages. There are mainly two approaches for program analysis; static analysis and dynamic analysis. Static analysis approaches inspect the source code of programs and perform the analysis without actually executing these programs. The sophistication of the analysis varies depending on the granularity of the analysis (e.g. individual statements, functions/methods, source files) and the purpose (e.g. spotting potential errors, verifying specified properties) [9]. Although static analysis approaches are helpful for several aforementioned purposes, they are limited due to the lack of run-time information. For example, static analysis can reveal the dependency between function calls but not the frequency of calls and their execution times. Moreover, static analysis is not always practical and scalable. Depending on the type of analysis (e.g. data flow and especially pointer analysis [55]) and the size of the analyzed program, static analysis can become highly resource consuming, and very soon intractable. Dynamic analysis approaches make use of data gathered from a running program. The program is executed on a real or virtual processor according to a certain scenario. During the execution, various data are collected that are subject to analysis. The advantage of dynamic analysis approaches is their ability to capture detailed interactions at run-time (e.g. pointer operations, late binding). This leads to more accurate information about the analyzed program compared to what might be obtained with static analysis. On the other hand, dynamic analysis has also limitations and drawbacks. First, the collected data for analysis depends on the program input.

Chapter 5. Quantitative Analysis and Optimization of Software Architecture... 75 Use of techniques for estimating code coverage [127] can help to ensure that a sufficient subset of the program’s possible behaviors has taken into account. Second, depending on the type and granularity of the analyzed information, the amount of collected data may turn out to be huge and intractable. Third, instrumentation or probing has an effect on the execution of the target program. As a potential risk, this effect can influence the program behavior and analysis results. In this chapter, we utilize dynamic analysis techniques for estimating the performance overhead introduced by a recovery design. We utilize two different tools; GNU gprof [40] and Valgrind [88]. We use GNU gprof tool to obtain the function call graph of the system together with statistics about the frequency of performed function calls and the execution time of functions (i.e. function call profile). We use the Valgrind framework to obtain the data access profile of the system.

5.2

Analysis based on Analytical Models

The main goal of recovery is to achieve higher system availability to its users in case of component failures. Thus, to select an optimal recovery design, we need to quantitatively assess design alternatives with respect to availability. As defined in section 2.1.1, availability is a measure of readiness for correct service and it can be simply represented by the following formula.

Availability =

MT T F MT T F + MT T R

(5.1)

Hereby, MTTF and MTTR stand for the mean time to failure and the mean time to repair, respectively. These concepts are mainly inherited from hardware fault tolerance and adapted for software fault tolerance. To estimate the availability and related quality attributes for complex systems and fault-tolerant designs, several formal models, techniques and tools have been devised [121]. These techniques are mostly based on Markov models [51], queuing networks [22], fault trees [35], or a combination of these. Quantitative availability analysis methods usually employ so-called state-based models [61]. In the application of state-based models, it is in general assumed that the behavior of modules has a Markov property, where the future behavior is conditionally independent of the past behavior [48]. Usually discrete time Markov chains (DTMC) or continuous time Markov chains (CTMC) are used as the modeling formalism. DTMCs are used for representing applications that operate on demand, whereas CTMC formalism is well-suited for continuously operating software applications [48]. A CTMC is defined by a set of states and

76 Chapter 5. Quantitative Analysis and Optimization of Software Architecture... transitions between these states. The transitions are labeled with rates λ indicating that the transition is taken after a delay that is governed by an exponential distribution with parameter λ. In availability modeling, the states represent the degree of availability of the system [73]. Figure 5.1 depicts a CTMC as a simple availability model.

Figure 5.1: An example CTMC as a simple availability model Since the delay for taking a transition is governed by an exponential distribution with a constant rate, the mean time to take a transition is simply the reciprocal of the transition rate. Consequently, λ = 1/M T T F defines the failure rate, where a transition is made from the “up” state to the “failed” state. Similarly, µ = 1/M T T R defines the recovery rate, where a transition is made from an “failed” state to the “up” state. Markov model checker tools like CADP [42] can take a CTMC model as an input and compute the probability that a particular state will be visited in the long run (i.e. steady state). Since states represent availability degrees, this makes it possible to calculate the total availability of the system in the long run. In the example model shown in Figure 5.1, for instance, the availability degrees implied by the “up” and “failed” states are 1 and 0, respectively. Thus, the probability that the “up” state is visited in the long run determines the availability of the system. A potential problem with such analytical models is the state space explosion problem. Depending on the complexity of the system and the recovery design, the state space can get too large. As a result, it can take too long time for the model checker to calculate, if possible at all, the steady state probabilities. To overcome this problem, Input/output interactive Markov chain (I/O-IMC) formalism has been introduced [10] as explained in the following subsection.

there are 2 Markovian transitions: one from S1 to S2 with rate λ and another from S3 to S4 with rate µ. The I/O-IMC has 1 input action. Interactive transitions are identified with 3 types of actions. • Internal actions: These actions are not visible to the environment. Labels of input and output actions are appended with “?” and “!”. 77 5. which is labeled as a?. Markovian transitions are labeled with the corresponding transition rates just like in regular CTMCs. A transition with an input action is taken if and only if another I/O-IMC performs the corresponding output action (input and output actions are matched by labels). I/O-IMCsare extensions of CTMCs [57] having 2 main types of transitions. respectively.2: An example I/O-IMC An example I/O-IMC is depicted in Figure 5. Hence.. Hereby. interactive transitions and Markovian transitions. for each possible input action. Note .. Transitions labeled with output actions are taken immediately. Markovian transitions are depicted with dashed lines. Quantitative Analysis and Optimization of Software Architecture. λ S2 a? S1 a? S4 b! S5 a? a? S3 a? µ Figure 5.2. I/O-IMCs are input-enabled meaning that each state is ready to respond to any input action. every state has an outgoing transition (or self-transition by default). • Input actions: This type of actions are controlled by the environment. Essentially. • Output actions: These actions are controlled by the I/O-IMC itself. 11] are a combination of Input/Output automata (I/O automata) [86] and interactive Markov chains (IMCs) [54].1 The I/O-IMC formalism I/O-IMCs [10. They are controlled by the I/O-IMC itself and the corresponding transitions are taken immediately. In this example.2. and interactive transitions by solid lines.Chapter 5.

the behavior of system elements can be modeled independently and then combined with a compositional aggregation approach [10]. the system should be decomposed into a set of isolated recoverable units (RUs) ([59. This helps in managing large state spaces and model elements can be re-used in different configurations. Compositional aggregation is a technique to build an I/O-IMC by composing. To prevent the propagation of errors and recover from them locally. This transition might influence the behavior of other I/O-IMCs that have been waiting for an input action b?. In S1. which is labeled as b!. the I/O-IMC delays for a time that is governed by an exponential distribution with parameter λ. There is also only 1 output action specified in this example.. which can be then analyzed using standard methods to compute different dependability and performance measures. all the states have a transition for this input action to ensure that the I/O-IMC is input-enabled. This is only one alternative decomposition for .3 Software Architecture Decomposition for Local Recovery An error occurring in one part of the system can propagate and lead to errors in other parts. The behavior of an I/O-IMC depends on both its internally specified behavior and the environment (i.e. in successive iterations. recall the MPlayer case study presented in section 4. For example.e. The I/O-IMC formalism enables modularity in model building and analysis.3.. the state S1 in Figure 5. For example. which provides the functionality of Libao ii ) RU GUI. Multiple I/O-IMC models can be composed together based on the common inputs they consume and outputs they generate. 5. an input a? arrives (another I/O-IMC takes a transition with output action a!). one possible decomposition for the MPlayer software is to partition the system modules into 3 RUs. i ) RU AUDIO. If however. Quantitative Analysis and Optimization of Software Architecture. Instead of building a model for the system as a whole. S1 is the initial state of the I/O-IMC.2 has 2 outgoing transitions: a Markovian transition to S2 with rate λ and an interactive transition to S3 with input action a?. As soon as the state S4 is reached. the I/O-IMC takes a transition to S5 with output action b!. 16]). then the I/O-IMC moves to S3. that. a number of elementary I/O-IMCs and reducing (i.3.78 Chapter 5. behavior of the other I/O-IMCs). Composition and aggregation of I/O-IMCs results in a regular CTMC. which encapsulates the Gui functionality and iii ) RU MPCORE which comprises the rest of the system. and moves to S2. before that delay ends. As re-depicted in Figure 5. aggregating) the state-space of the generated I/O-IMC after each iteration.

. S(n. {c}}. b.3. {{a}.Chapter 5. The number of ways to partition a set of n elements into k nonempty subsets is computed by the so-called Stirling numbers of the second kind. . the partitioning can be done in many different ways. c}}. we first need to model the design space. In fact. 79 isolating the modules within RUs and as such introducing local recovery. the partitioning of architecture into a set of RUs can be generalized to the well-known set partitioning problem [50]. a partition of a set S is a collection of disjoint subsets of S whose union is S. For example. These alternatives are: {{a}.1 Design Space To reason about the number of decomposition alternatives. {{b}. b}}. {a.2) The total number of ways to partition a set of n elements into arbitrary number of nonempty sets is counted by the nth Bell number as follows. Quantitative Analysis and Optimization of Software Architecture. c}}. c}. c}}. It is calculated with the recursive formula shown in Equation 5.2. {b}. k) [50]. there exists 5 alternatives to partition the set {a. {{a. Obviously. {{c}.n ≥ 1 n−1 (5.. {a. Hereby. k =k n n−1 k + k − 1 . Figure 5. b.3: An example decomposition of the MPlayer software into RUs 5. {b.

which leads to an additional performance overhead. k) (5. Quantitative Analysis and Optimization of Software Architecture.1). B5 = 52.80 Chapter 5.382.545. As such. B15 = 1. Therefore.. MTTF of RUs must be kept high and MTTR of RUs must be kept low (See Equation 5.e. n Bn = k=1 S(n. To maximize the availability of the overall system. The function dependency is the number of function calls between modules across different RUs.. we outline the basic criteria that we consider for choosing a decomposition alternative. Bn is the total number of partitions of a system with n modules. The data dependencies are proportional to the size of the shared variables among modules across different RUs. 5. For example. searching for a feasible design alternative becomes quickly problematic as n (i. However. for selecting a decomposition alternative we should consider the number of function calls among modules across different RUs. • Availability: Local recovery aims at keeping the system available as much as possible. In the following. • Performance: Logically. B3 = 5. the number of modules) grows. B4 = 15. the optimal availability of the system is defined by the decomposition alternative in which all the modules in the system are separately isolated into RUs. B7 = 877 (The MPlayer case). In some of the previous work on local . For this reason.958. We distinguish between two important types of dependencies that cause a performance overhead. that is.2 Criteria for Selecting Decomposition Alternatives Each decomposition alternative in the large design space may have both pros and cons. The MTTF and MTTR of RUs on their turn depend on the MTTF and MTTR values of the contained modules.3) In theory. B1 = 1. i ) function dependency and ii ) data dependency.3. The total availability of a system depends on the availability of its individual recoverable units. increasing the number of RUs leads to a performance overhead due to the dependencies between the separated modules in different RUs. how we separate and isolate the modules into a set of RUs. For transparent recovery these function calls must be redirected. the overall value of MTTF and MTTR properties of the system depend on the decomposition alternative. Bn grows exponentially with n.

The activities of the overall process are categorized into five groups as follows. By increasing the number of RUs.. increasing the number of RUs can increase complexity and this. It appears that there exists an inherent trade-off between the availability and performance criteria. Obviously. On the one hand increasing availability will require to increase the number of RUs1 . flows) represent transitions between activities. Quantitative Analysis and Optimization of Software Architecture. The end of parallel activities are denoted with a black bar with several flows entering it and one leaving it. the filled circle and the filled circle with a border represent the starting point and the ending point. 5. The beginning of parallel activities are denoted with a black bar with one flow going into it and several leaving it.e. The rounded rectangles represent activities and arrows (i. However. the size of data dependencies complicate the recovery and create performance overhead because the shared data need to be kept synchronized after recovery.. This is based on the assumption that the fault tolerance mechanism remains simple and faultfree.4 depicts the main activities of the overall process in a UML [100] activity diagram. there have been also work focusing on saving component states in case of a component failure [68] and determining a global system state based on possibly interdependent component states [17].4 The Overall Process and the Analysis Tool In this section we define the approach that we use for selecting a decomposition alternative with respect to local recovery requirements. respectively. when an existing system is decomposed into RUs. Therefore. On the other hand increasing the number of RUs will introduce additional performance overhead because the amount of function dependencies and data dependencies will increase. while the other RUs can remain available. 81 recovery ([16. there might be shared state variables leading to data dependencies between RUs. However. an error of a module will lead to a failure of a smaller part of the system (the corresponding RU). Alternative activities are reached through different flows that are stemming from a single activity. This means that all the entering flows must reach the bar before the transition to the next activity is taken. This makes the amount of data dependency between RUs an important criteria for selecting RUs. depending on the fault assumptions and the system. selecting decomposition alternatives requires their evaluation with respect to these two criteria and making the desired trade-off. for example. In fact. Hereby. 1 . for stateless Internet service components [16] and stateless device drivers [53].Chapter 5. in turn can increase the probability of introducing additional faults. Figure 5. This assumption can hold. 53]) it has been assumed that the RUs are stateless and as such do not contain shared state variables.

Each decomposition alternative defines a possible partitioning of the initial architecture modules into RUs. In addition. . not all theoretically possible decompositions are practically possible because modules in the initial architecture cannot be freely allocated to RUs due to domain and stakeholder constraints. The activities of the Constraint Definition group involve the specification of such domain and stakeholder constraints. ii) prevent the allocation of some modules to same or separate RUs. These properties are utilized as an input for the analytical models and heuristics during the Decomposition Selection process. To prepare the analysis for finding the decomposition alternative each module in the architecture will be annotated with several properties. • Design Space Generation: After the domain constraints are specified the possible decomposition alternatives will be defined.82 Chapter 5.3. Therefore.. The type of constraints that we consider can i) limit the number of RUs. Quantitative Analysis and Optimization of Software Architecture. As we have seen in section 5.1.. iii) limit the performance overhead that can be introduced by a decomposition alternative based on provided measurements from the system. before generating the complete design space first the constraints will be specified.3.1 the design space can easily get very large and unmanageable. • Architecture Definition: The activities of this group are about the definition of the software architecture by specifying basic modules of the system. The given architecture will be utilized to define the feasible decomposition alternatives that show the isolated RUs consisting of the modules. The number of these alternatives is computed based on the Stirling numbers as explained in section 5. • Constraint Definition: Using the initial architecture we can in principle define the design space including the decomposition alternatives.

.Chapter 5.. 83 $ # # # ! " Figure 5.4: The Overall Process . Quantitative Analysis and Optimization of Software Architecture.

Quantitative Analysis and Optimization of Software Architecture. If the size is smaller than a maximum affordable amount. We have developed an analysis tool. In that case we use a set of optimization algorithms and related heuristics to find a feasible decomposition alternative.5 in which the Recovery Designer tool is activated. the Recovery Designer. At the bottom of the tool a button Recovery Designer can be Although we use the term ‘decomposition’ for this activity.84 Chapter 5. If the design space is too large the generation of analytical models will be more time consuming. A snapshot of the extended Arch-Studio can be seen in Figure 5. we do not elaborate on this and we do not populate the diagram with these iterations. the activities of this group select an alternative decomposition2 based on either of two approaches.4 only shows a simple flow of the main activities in the overall process. please note that the system is already decomposed into a set of modules. starting from the architecture definition. which includes a set of tools for specifying and analyzing software architectures. On top of this. we are introducing a decomposition by actually ‘composing’ the system modules into RUs. The activities of the Performance Measurement group deal with performing related dynamic analysis and measurements on the existing system. • Decomposition Selection: Based on the input of all the other groups of activities. constraint definition or decomposition selection. • Performance Measurement: Even though some alternatives are possible after considering the domain constraints they might in the end still not be feasible due to the performance overhead introduced by the particular allocation of modules to RUs. The tool is fully integrated in the Arch-Studio environment [23]. We have used the meta-modeling facilities of ArchStudio to extend the tool and integrate the Recovery Designer with the provided default set of tools. specified constraints and measurements in the previous activities. The selected approach depends on the size of the feasible design space derived from the architecture description. that implements the provided approach.. To estimate the real performance overhead of decomposition alternatives we need to know the amount of function and data dependencies among the system modules. Figure 5. 2 . Using the analytical models we select the RU decomposition alternative that is optimal with respect to availability and performance. There can be many iterations of these activities at different steps.. For brevity. we generate analytical models for each alternative in the design space to estimate the system availability. These measurements are utilized during the Decomposition Selection process. Arch-Studio is an open-source software and systems architecture development environment based on the Eclipse open development platform [6].

85 Figure 5. Valgrind. which opens the Recovery Designer tool consisting of a number of sub-tools. The architecture of Recovery Designer is shown in Figure 5.Chapter 5.. The tool boundary is defined by the large rectangle with dotted lines. Quantitative Analysis and Optimization of Software Architecture. The tool consist itself of six sub-tools Constraint Evaluator. we will explain the activities of the proposed approach in detail and we will refer to the related component(s) of the tool.6. Analytical Model Generator. Design Alternative Generator. In the subsequent sections. Data Dependency Analyzer.5: A Snapshot of the Arch-Studio Recovery Designer seen. gnuplot and CADP model checker. Function Dependency Analyzer. it uses five external tools Arch-Edit. and Optimizer. . Further. GNU-gprof..

definition of the software architecture and annotation of the software architecture. These properties are MTTF. MTTR and Criticality. Quantitative Analysis and Optimization of Software Architecture.e. the set of properties that can be specified per module) conforms to the xADL schema and as such can be interchanged with other XML-based tools. In the second activity of Architecture Definition a set of properties per module are defined to prepare the subsequent analysis. which includes basic implementation units and direct dependencies among them [18].6: Arch-Studio Recovery Designer Analysis Tool Architecture 5. This activity is carried out by using the Arch-Edit tool of the Arch-Studio [23].. Criticality property defines how critical the failure of a module is with respect to the overall functioning of the system. we use the module view of the system. . %' "# $ ! ! % & Figure 5.. Both the XML structure of the stored architectural description and the user interface of Arch-Edit (i.5 Software Architecture Definition The first step Architecture Definition in Figure 5.4 is composed of two activities. For describing the software architecture. Arch-Edit is a front-end that provides a graphical user interface for creating and modifying an architecture description.86 Chapter 5. ArchStudio specifies the software architecture with an XML-based architecture description language called xADL [23].

In our case. MTTR and Criticality need to be defined by the software architect.7: xADL schema extension for specifying analytical parameters related to reliability Note that the values for properties MTTF. The concept of an interface with analytical parameters has been borrowed from [49]. all values can be fixed to a certain value just to investigate the analysis results. a separate analytical interface could be defined for each property.7.e. where the values are varied and their effect is observed. In principle. Using Arch-Edit both the architecture and the properties per module can be defined.. ! " % & ) * + # ' ! # ! $ ' ! ( ' ' ! ( ' ( " % & ( ( ( ! ! ' ! ( Figure 5. reliability interface. Accordingly. In principle there are three strategies that can be used to determine these property values: • Using Fixed values: It can be assumed that all modules have the same properties. we have placed all the properties under the reliability interface. A part of the corresponding XML schema that introduces the new properties is shown in Figure 5. MTTR and Criticality 3 . The modules together with their parameters are provided as an input to analytical models and heuristics that are used in the Decomposition Selection process. • What-if analysis: A range of values can be considered. The analytic parameters are used for analysis purposes only. Notation-wise.. constructive parameter and analytic parameter. 3 . define an extension to the xADL language to be able to specify parameters for performance analysis. only MTTF is actually related to reliability. for modules. i. Quantitative Analysis and Optimization of Software Architecture. we have specified the reliability interface that consists of the three parameters MTTF. we have extended the existing xADL schema with a new type of interface.Chapter 5. where Grassi et al. However. 87 To be able to specify these properties in our tool Recovery Designer. They introduce a new type of interface that comprises two type of parameters.

Table 5. During the Constraint Definition activity the constraints are defined and as such the design space is reduced. For all the modules. either fixed values and/or a what-if analysis can be used. We have measured the MTTR values from the actual implementation by calculating the mean time it takes to restart and initialize each module in the MPlayer over 100 runs.1 shows the measured MTTR values.1.88 Chapter 5. The specification of values for some module properties can also be supported by special analysis methods. Criticality=1). In our MPlayer case study.6 Constraint Definition After defining the software architecture. we have used fixed MTTF and Criticality values but specified MTTR values based on measurements from the existing system.5hrs. respectively. • Measurement or estimation of actual values: Values can be specified based on actual measurements from the existing system or estimation based on historical data. Modules MTTR (ms) Libao 480 Libmpcodecs 500 Demuxer 540 Mplayer 800 Libvo 400 Stream 400 Gui 600 5. We distinguish among the following three types of constraints: . The selection of these strategies is dependent on the available knowledge about the domain and the analyzed system. Quantitative Analysis and Optimization of Software Architecture. the values for MTTF and Criticality are defined as 0.. this might not be possible due to lack of access to the existing system and historical data.5hrs and 1. For example. In that case. we can start searching for the possible decomposition alternatives that define the structure of RUs. However.1: Annotated MTTR values for the MPlayer case based on measurements (MTTF =0.. not all theoretically possible alternatives are practically possible. Table 5. tools and techniques. The measurement or estimation of actual values is usually the most accurate way to define the properties.3. As stated before in section 5. Criticality values for modules can be specified based on the outcome of an analysis method like SARAH (Chapter 3).

In the latter case. In the Deployment Constraints part. • Domain Constraints: Some modules can be required to be in the same RU while other modules must be separated. The decomposition alternatives that exceed these thresholds are infeasible because of the performance overhead they introduce and as such must be eliminated. In general. We can see a snapshot of the Constraint Evaluator in Figure 5. the total number of modules). Domain Constraints and Performance Feasibility Constraints. each decomposition alternative introduces a performance overhead due to function and data dependencies. the number of processes that can be created can be limited due to operating system constraints. The second list can be modified by multiple (de)selection to change the relationships. In the Domain Constraints part we specify requires/mutex relations. When a module is selected from the first list. which means that there is no limitation to the number of RUs. the number of RUs that can be defined will also be dependent on the context of the system that can limit the number of RUs in the system. For each of these relations. when there are many modules and all modules are separated from each other. respectively. the minimum and maximum number of RUs are specified as 1 and 7 (i. Usually such constraints are specified with mutex and require relations [66. For example. where these constraints can be specified. For example.. Quantitative Analysis and Optimization of Software Architecture. Constraint Evaluator gets as input the architecture description that is created with the Arch-Edit tool (See Figure 5.8.8.3. Usually. In Figure 5.e. in Figure 5. • Performance Feasibility Constraints: As explained in section 5..2. there are no selected elements). each corresponding to a type of constraint to be specified. for example. It might be the case that the resources are insufficient for creating a separate process for each RU. an untrusted 3rd party module can be required to be separated from the rest of the system. where the modules that are related are selected (initially. The user interface consists of three parts: Deployment Constraints. 89 • Deployment Constraints: An RU is a unit of recovery in the system and includes a set of modules. 21].Chapter 5. the second list is updated. The Arch-Studio Recovery Designer includes a tool called the Constraint Evaluator. if we employ multiple operating system processes to isolate RUs from each other. there are thresholds for the acceptable amounts of these dependencies based on the available resources.6). we specify the limits for the number of RUs. Constraint Evaluator provides two lists that include the name of the modules of the system.8 it has been specified that .

90 Chapter 5. Quantitative Analysis and Optimization of Software Architecture... Demuxer must be in the same RU as Stream and Libmpcodecs, whereas Gui must be in a separate RU than Mplayer, Libao and Libvo. Constraint Evaluator can also automatically check if there are any conflicts between the specified requires and mutex relations (respectively two modules must be kept together and separated at the same time).

Figure 5.8: Specification of Design Constraints In the Performance Feasibility Constraints part, we specify thresholds for the amount of function and data dependencies between the separated modules. The specified constraints are used for eliminating alternatives that exceed the given threshold values. The dynamic analysis and measurements from the existing system that are performed to evaluate the performance feasibility constraints will be explained in section 5.8. However, if there is no existing system available and we can not perform the necessary measurements, we can skip the analysis of the performance feasibility constraints. Arch-Studio Recovery Designer provides an option to enable/disable the performance feasibility analysis. In case this analysis is disabled, the design space will be evaluated based on only the deployment constraints and domain constraints so that it is still possible to generate the decomposition alternatives and depict the reduced design space.

Chapter 5. Quantitative Analysis and Optimization of Software Architecture... 91

5.7

Design Alternative Generation

The first activity in Generate Design Alternatives is to compute the size of the feasible design space based on the architecture description and the specified domain constraints. Design Alternative Generator first groups the set of modules that are required to be together based on the requires relations4 defined among the domain constraints. In the rest of the activity, Design Alternative Generator treats each such group of modules as a single entity to be assigned to a RU as a whole. Then, based on the number of resulting modules and the deployment constraints (i.e. the number of RUs), it uses the Stirling numbers of the second kind (section 5.3.1) to calculate the size of the reduced design space. Note that this information is provided to the user already during the Constraint Definition process (See the GUI of the Constraint Evaluator tool in Figure 5.8). For the evaluation of the other constraints, the Design Alternative Generator uses the restricted growth (RG) strings [106] to generate the design alternatives. A RG string s of length n specifies the partitioning of n elements, where s[i] defines the partition that ith element belongs to. For the MPlayer case, for instance, we can represent all possible decompositions with a RG string of length 7. The RG string 0000000 refers to the decomposition where all modules are placed in a single RU. Assume that the elements in the string correspond to the modules Mplayer, Libao, Libmpcodecs, Demuxer, Stream, Libvo and Gui, then the RG string 010002 defines the decomposition that is shown in Figure 5.3. A recursive lexicographic algorithm generates all valid RG strings for a given number of elements and partitions [105]. During the generation process of decomposition alternatives, the Design Alternative Generator eliminates the alternatives that violate the mutex relations defined among the domain constraints. These are the alternatives, where two modules that must be separated are placed in the same RU. For the MPlayer case there exists a total of 877 decomposition alternatives. The deployment and domain constraints that we have defined reduced the number of alternatives down to 20.

Note that the requires relation specifies a pair of modules that must be kept together in a RU. It does not specify a pair of modules that require services from each other. Such a dependency is not necessarily a reason for keeping modules together in a RU.

4

92 Chapter 5. Quantitative Analysis and Optimization of Software Architecture...

5.8
5.8.1

Performance Overhead Analysis
Function Dependency Analysis

Function calls among modules across different RUs impose an additional performance overhead due to the redirection of calls by RUs. In function dependency analysis, for each decomposition alternative, we analyze the frequency of function calls between the modules in different RUs. The overhead of a decomposition is calculated based on the ratio of the delayed calls to the total number of calls in the system. Function dependency analysis is performed with the Function Dependency Analyzer tool based on the inputs from the Design Alternative Generator and GNU gprof tools (See Figure 5.6). As shown in Figure 5.9, the Function Dependency Analyzer tool itself is composed of three main components; i ) Module Function Dependency Extractor (MFDE) ii ) Function Dependency Database and iii ) Function Dependency Query Generator.

Figure 5.9: Function Dependency Analysis MDFE uses the GNU gprof tool to obtain the function call graph of the system. GNU gprof also collects statistics about the frequency of performed function calls and the execution time of functions. Once the function call profile is available,

Function Dependency Analyzer becomes ready to calculate the function dependency overhead for a particular selection of RUs defined by the Design Alternative Generator. MFDE exploits the full path of the C object file. This is also reflected to the prefixes of the nodes of the MDG. which reveals its prefix (e. MFDE creates a so-called Module Dependency Graph (MDG) [83]. 93 MFDE uses an additional GNU tool called GNU nm (provides the symbol table of a C object file) to relate the function names to the C object files.10: A Partial Snapshot of the Generated Module Dependency Graph of MPlayer together with the boundaries of the Gui module with the modules Mplayer. As a result.. for example. Once the MDG is created and stored. package) structure of the source code to identify the module that a C object file belongs to. After the MDG is created. Figure 5.g.3) were processed.10. it is stored in the Function Dependency Database. we need to relate the set of nodes in the MDG to the system modules. MDG is a graph.10 shows some of the files that belong to the Gui module each of which has “Gui ” as the prefix. Function Dependency Query Generator accesses the Function Dependency Database. Libao and Libvo To be able to query the function dependencies between the system modules. where each node represents a C object file and edges represent the function dependencies between these files. . For the MPlayer case. For each alternative. which is a relational database.Chapter 5. the set of packages that corresponds to the provided module view (Figure 5. MFDE uses the folder (i. shows a partial snapshot of the generated MDG for MPlayer. and automatically creates and executes queries to estimate the function dependency overhead. Figure 5. “. the right hand side of Figure 5. The function dependency overhead is calculated based on the following equation./Gui*”) and the corresponding module.. Quantitative Analysis and Optimization of Software Architecture. For example.e.

we need to calculate the sum of function dependencies among all the modules that are part of different RUs. Quantitative Analysis and Optimization of Software Architecture. The end result is multiplied by the overhead that is introduced per such function call (tOV D ). k) end for 7: 8: end for end for 9: 10: end for To calculate the function dependency overhead. the number of times a function is called among different RUs are summed. calls(x → y) × tOV D RU x RU y∧ fd = (x=y) calls(f ) × time(f ) f × 100 (5. the Gui and Libao modules are encapsulated in separate RUs and separated from all the other modules of the system. Algorithm 1 shows the pseudo code for counting the number of function calls betwen different RUs.94 Chapter 5. Hereby.. For this reason. we use a measured worst case overhead. Consider. RU MPCORE. we need to calculate the sum of function dependencies between the Gui module and all the other modules plus the function dependencies between the Libao module and all the other modules. For calculating the function dependency overhead of this example decomposition. the overhead is related to the inter-process communication. where we isolate RUs with separate processes. In our case study. k) that is used in the algorithm creates and executes an SQL [123] query to get the number of calls between the specified modules m and . The procedure noOf Calls(m. for example..3. the denominator sums for all functions the number of times a function is called times the time spent for that function.4) In Equation 5. 100 ms as tOV D . the RU decomposition that is shown in Figure 5. Algorithm 1 Calculate the amount of function dependencies between the selected RUs 1: sum ← 0 2: for all RU x do 3: for all Module m ∈ x do 4: for all RU y ∧ x = y do 5: for all Module k ∈ y do 6: sum ← sum + noOf Calls(m. The other modules in the system are allocated to a third RU.4. In the nominator.

The worst case asymptotic complexity of Algorithm 1 is O(n2 ) with respect to the number of modules. A sample of this data access profile output can be seen in Figure 5.. i ) Module Data Dependency Extractor (MDDE) ii ) Data Dependency Database and iii ) Data Dependency Query Generator (See Figure 5.. To do this. MDDE uses the Valgrind tool to obtain the data access profile of the system. We use Valgrind to obtain the data access profile of the system.c”.4%.8. An example query is shown in Figure 5. which enables the development of dynamic binary analysis tools. In line 7 for instance. the time it takes for the calculation depends on the number of modules and the particular decomposition. where the number of calls from the module Gui to the module Libao is queried. Quantitative Analysis and Optimization of Software Architecture. where the threshold was specified as 15% for the case study.6). plus an environment for writing tools that plug into the core system [88]. Valgrind [88] is a dynamic binary instrumentation framework. It took the tool less than 100 ms. 95 k.13. Such tools perform analysis at run-time at the level of machine code. we can see that there was a memory access at address bea1d4a8 of size 4 from the file “audio out. to calculate the function dependency overhead for this example decomposition.11: The generated SQL query for calculating the number of calls from the module Gui to the module Libao For the example decomposition shown in Figure 5. 5.13. Note that this makes the decomposition a feasible alternative with respect to the function dependency overhead constraints. ! ! " #$ " # & %" %" Figure 5.3. which is composed of three main components.Chapter 5. which records the addresses and sizes of the memory locations that are accessed by the system. Valgrind provides a core system that can instrument and run the code. In Figure 5.11. DAI outputs the full paths .2 Data Dependency Analysis Data dependency analysis is performed with the Data Dependency Analyzer (See Figure 5. namely the Data Access Instrumentor (DAI). the function dependency overhead was calculated as 5. A Valgrind tool is basically composed of this core system plus the plug-in tool that is incorporated to the core system.12). In general. we can see the addresses and sizes of memory locations that have been accessed. we have written a plug-in tool for Valgrind.

13: A Sample Output of Valgrind + Data Access Instrumentor . we can identify the data dependencies. From this data access profile we can also observe that the same memory locations can be accessed by different modules.96 Chapter 5. The output of the MDDE is the total number and size of data dependencies between the modules of the system... respectively.2 shows the ! " ! #$ % %&% '% ! & #$ Figure 5. The related parts of the file paths are underlined in Figure 5. Figure 5.12: Data Dependency Analysis of the files.13. Quantitative Analysis and Optimization of Software Architecture. which distinguishes to which module these files belong. in line 9 we can see that the file “aclib template. For instance. Table 5.c” belongs to the Libvo module.13 shows a memory address bea97498 that is accessed by both the Gui and the Mplayer modules as highlighted in lines 2 and 6. The output of DAI is used by MDDE to search for all such memory addresses that are shared by multiple modules. Figure 5. For example. Based on this.

the number of common memory locations accessed (i. Data Dependency Analyzer becomes ready to calculate the data dependency size for a particular selection of RUs defined by the Design Alternative Generator. This table is stored in the Data Dependency Database.. Once the data dependencies are stored. Table 5.e.5) . Data Dependency Query Generator accesses the Data Dependency Database. count) and the total size of the shared memory.Chapter 5. creates and executes queries for each RU decomposition alternative to estimate the data dependency size.2 we see per pair of modules. 97 actual output for the MPlayer case.2: Measured data dependencies between the modules of the MPlayer M odule 1 M odule 2 count size (bytes) Gui Libao 64 256 Gui Libmpcodecs 360 1329 Gui Libvo 591 2217 Gui Demuxer 47 188 Gui MPlayer 36 144 Gui Stream 242 914 Libao Libmpcodecs 78 328 Libao Libvo 63 268 Libao Demuxer 3 12 Libao Mplayer 43 172 Libao Stream 54 232 Libmpcodecs Libvo 332 1344 Libmpcodecs Demuxer 29 116 Libmpcodecs Mplayer 101 408 Libmpcodecs Stream 201 812 Libvo Demuxer 53 212 Libvo Mplayer 76 304 Libvo Stream 246 995 Demuxer Mplayer 0 0 Demuxer Stream 23 92 Mplayer Stream 28 116 In Table 5. Quantitative Analysis and Optimization of Software Architecture. dd = RU x RU y∧ memsize(m. The size of the data dependency is calculated simply by summing up the shared data size between the modules of the selected RUs.. This calculation is depicted in the following equation. k) (x=y) m∈x k∈y (5.

Using the domain constraints we have seen that for the MPlayer case 20 alternatives were possible.9 Decomposition Alternative Selection After the software architecture is described. The final selection of the alternatives will be explained in the next section. The y-axis on the right hand side is used for showing the data dependency sizes of the decomposition alternatives as calculated by the Data Dependency Analyzer..14 the total design space with the data dependency size and function dependency overhead values been plotted. The tool allows to depict the generated design space at any time of the analysis process. to calculate this. The only difference is that the information being queried is the data size instead of the number of function calls.8. Note also that the decomposition turns out to be a feasible alternative with respect to the data dependency size constraints. Quantitative Analysis and Optimization of Software Architecture. In Figure 5.98 Chapter 5.4). The y-axis on the left hand side is used for showing the function dependency overheads of these alternatives as calculated by the Function Dependency Analyzer. design constraints are defined and the necessary measurements are performed on the system. the final set of decomposition alternatives can be selected as defined by the last group of activities (See Figure 5. For the MPlayer case we have set the function dependency overhead threshold to 15% and the data dependency .3 the data dependency size is approximately 5 KB. where the threshold was specified as 10 KB for the case study..15 shows the plot for the 20 decomposition alternatives that remain after applying the deployment and domain constraints. The decomposition alternatives are listed along the x-axis. The Arch-Studio Recovery Designer generates a graph corresponding to the generated design space and uses gnuplot to present it.8. Figure 5. For each of these alternatives we can now follow the approach of the previous two subsections to calculate the function dependency overhead and data dependency size values. The querying process is very similar to the querying of function dependencies as explained in section 5. 5. It took the tool less than 100 ms. 5. This set of alternatives is further evaluated with respect to the performance feasibility constraints based on the defined thresholds and the measurements performed on the existing system. Using these results the software architect can already have a first view on the feasible decomposition alternatives.3 Depicting Analysis Results Using the deployment and domain constraints we have seen that for the MPlayer case the total number of 877 decomposition alternatives was reduced to 20. For the example decomposition shown in Figure 5.1.

As can be seen in Table 5. the size of data that is shared among these modules are the highest among all. we generate analytical models for each alternative in the design space. where the two highly coupled modules Libvo and Libmpcodecs are separated from each other. The approach adopted for the evaluation of availability and the selection of a decomposition depends on the size of the feasible design space. this alternative corresponds to the decomposition. we use a set of optimization algorithms to select one of the feasible decomposition . this decomposition alternative separates the modules Gui.16. Hereby. Hereby. we need to evaluate and compare the feasible decomposition alternatives based on availability as well. Quantitative Analysis and Optimization of Software Architecture.14: Function Dependency Overhead and Data Dependency Sizes for all Decomposition Alternatives size threshold to 10 KB. If the size is smaller than a maximum affordable amount depending on the available computation resources.3. Libvo and Libmpcodecs from each other. Since the main goal of local recovery is to maximize the system availability..2. This is because. 99 Analysis Results For the Decomposition Alternatives 25 Function Dependency Overhead (%) Data Dependency Size (KB) 12 20 Function Dependency Overhead (%) Data Dependency Size (KB) 10 15 14 8 10 6 4 5 2 0 0 100 200 300 400 500 600 RU Decomposition Alternatives 700 800 0 Figure 5.. RUs are defined in the straight brackets ‘[’ and ‘]’. the alternative # 1 represents the alternative as defined in Figure 5. We can see that this alternative has also a distinctively high data dependency size. If the design space is too large.17 shows the function dependency overhead and data dependency size for the 6 feasible decomposition alternatives. These analytical models are utilized for estimating the system availability.Chapter 5. This is because. only 6 alternatives are left in the feasible design space as listed in Figure 5. It appears that after the application of performance feasibility constraints. For example. Figure 5. We select a RU decomposition based on the estimated availability and performance overhead of the feasible decomposition alternatives. we can see that the alternative # 4 has the highest value for the function dependency overhead.

100 Chapter 5. 5. we need to quantitatively assess these alternatives with respect to availability as well.17. To select an optimal decomposition. we can see the function dependency overhead and data dependency size for the 6 feasible decomposition alternatives.16: The feasible decomposition alternatives with respect to the specified constraints alternative based on heuristics.. The Analytical Model . Quantitative Analysis and Optimization of Software Architecture. Analysis Results For the Decomposition Alternatives 25 Function Dependency Overhead (%) Data Dependency Size (KB) 9 20 Function Dependency Overhead (%) Data Dependency Size (KB) 10 8 15 7 10 6 5 5 0 0 5 10 RU Decomposition Alternatives 15 20 4 Figure 5. These two approaches are explained in the following subsections.10 Availability Analysis In Figure 5.. We exploit the compositional semantics of the I/O-IMC formalism to build availability models for alternative decompositions for local recovery.15: Function Dependency Overhead and Data Dependency Sizes for Decomposition Alternatives after domain constraints are applied ! Figure 5.

In addition to the RM. and the coordination of recovery actions (See Appendix B). 101 Analysis Results For the Decomposition Alternatives 20 Function Dependency Overhead (%) Data Dependency Size (KB) 9 Function Dependency Overhead (%) 15 8 Data Dependency Size (KB) 10 10 7 6 5 5 0 0 1 2 3 4 RU Decomposition Alternatives 5 6 4 Figure 5. we explain our modeling approach and the analysis results. 2. namely the Recovery Manager (RM). The failure of any module within an RU causes the failure of the entire RU. These features can be implemented in several ways. pruning the state space in each iteration.17: Function Dependency Overhead and Data Dependency Sizes for 6 Decomposition Alternatives Generator automatically generates I/O-IMC models for the behavior of each module and RU of the system.10..Chapter 5. we have introduced a model element.. The communication control between RUs is considered to be a part of the RM and as such is not modeled separately. Quantitative Analysis and Optimization of Software Architecture. the realization of local recovery (Chapter 6) requires i ) control of communication between RUs and ii ) coordination of recovery actions. It also generates a script for the composition and analysis of these models. The recovery of an RU entails the recovery (i. In our modeling approach.e. The execution of this script merges all the generated models in several iterations. In the following. The result of this composition and aggregation is a regular CTMC. we have introduced model elements for representing the behavior of individual modules and RUs.1 Modeling Approach In addition to the decomposition of system modules into RUs. which is analyzed to compute system availability. 5. restart) of all of its modules . to represent the coordination of recovery actions. We have made the following assumptions in our modeling approach: 1.

5. Quantitative Analysis and Optimization of Software Architecture. 9. which approximates any distribution arbitrarily closely. A This assumption was made for analytic tractability. the RM always correctly detects a failing RU. As implied by the previous assumption.. The recovery always succeeds.18: Modeling approach based on the I/O-IMC formalism Figure 5. (even the ones that did not fail). 4.e. The failure of a module is governed by an exponential distribution (i. 7. The recovery of the modules inside a given RU is performed sequentially. An exponential distribution might not be. one module at a time (a particular recovery sequence has no significance). 8. !"# $% !"# ! $ Figure 5. in some cases. The RM does not fail.. Only one RU can be recovered at a time and RM recovers the RUs on a first-come-first-served (FCFS) basis. 5 . constant failure rate).18 depicts our modeling approach. however.102 Chapter 5. The RM only interfaces with RUs and is unaware of the modules within RUs. a realistic choice. it is also possible to use a phase-type distribution. 6. Each RU exhibits two interfaces. 3. The recovery of a module is governed by an exponential distribution5 and the error detection phase is ignored.

Upon the receipt of a start recover signal from the RM.19). The steady state probability of being at state 0 was computed by CADP as 0.. Figure 5. it then generates the corresponding I/O-IMCs for failure/recovery interfaces and the RM. An example I/O-IMC model generated for the RM that controls two RUs is depicted in Figure 5. That is why. Appendix B contains detailed explanations of example I/O-IMCs (including the one depicted in Figure 5.Chapter 5. all the 7 modules comprised by the RU must be recovered. All the other states represent the recovery stages of the RU. In Figure 5. and the RM correspond to an I/O-IMC model as depicted in Figure 5. reduces the final I/O-IMC into a CTMC (See Appendix B for details). Quantitative Analysis and Optimization of Software Architecture. and computes the steady state probabilities of being at states. where the labels of these transitions specify the recovery rates of the modules. Based on these probabilities. which conforms to the CADP SVL scripting language [42]. where the only RU of the system is available. The RM listens to the failure and up signals emitted by the failure interfaces of the RUs. 103 failure interface and a recovery interface. where all modules are placed in a single RU. in which the label of the transition denotes the failure rate of this RU. state 7 represents the system state. the initial state is state 0. there exist 7 states corresponding to the recovery of the RU. the failure interface outputs an RU up signal upon the successful recovery of the all the modules. recovery interface.985880.5880%. Each module. It represents the system state.20 shows the generated CTMC for global recovery. a composition and analysis script is also generated. There is a transition from state 0 to state 7. failure interface. Correspondingly. Analytical Model Generator first generates I/O-IMC models for each module. it starts a sequential recovery of the corresponding modules.19. there exist 7 transitions from state 7 back to state 0. which provides us the availability of the system as 98.. The recovery interface is responsible for the recovery of the modules within the RU. Upon its failure. Since the RU includes 7 modules that are recovered (initialized) sequentially. where the specified RUs are available. The failure interface essentially listens to the failure of the modules within the RU and outputs a RU failure signal upon the failure of a module. The exchanged signals are represented as input and output actions in these I/O-IMCs. . Based on the decomposition alternative provided by the Design Alternative Generator. where the RU is failed and non of its modules are recovered yet. The execution of this script within CADP composes/aggregates all the I/O-IMCs based on the decomposition alternative.20. It also gives details on the specification and generation process of I/O-IMC models. the availability of the system is calculated as a whole. In addition to generating all the necessary I/O-IMCs. For example.18.

Quantitative Analysis and Optimization of Software Architecture.2 Analysis Results We have generated CTMC models and calculated the availability for all the feasible decomposition alternatives that are listed in Figure 5.3. we assume that the availability of the system is determined by the availability of the RU that comprises the Mplayer module. In this case study..16.. Based on the results listed in Table 5. The results are listed in Table 5.104 Chapter 5.3.10. only the failure of this RU leads to the failure of the whole system regardless of the availability of the other RUs.19: The generated I/O-IMC model for the Recovery Manager that controls two RUs 5. we can evaluate and compare the feasible . failed_1? 8 7 failed_2? failed_1? failed_2? up_1? failed_1? failed_2? start_recover_2! failed_1? up_2? up_1? start_recover_1! up_2? failed_1? up_1? 6 failed_2? 5 4 failed_2? 3 up_2? failed_1? start_recover_2! up_1? up_1? failed_2? 2 failed_2? up_2? failed_2? up_2? up_1? 0 up_2? up_1? up_2? failed_2? start_recover_1! up_2? failed_1? up_1? 1 up_2? failed_1? up_1? failed_1? Figure 5.3 together with the corresponding function dependency overhead and data dependency size values as calculated before. The CTMC generated for this decomposition alternative comprises 60 states 122 transitions. The decomposition alternative # 1 represents the decomposition shown in Figure 5. This is because.

7915 6688 decomposition alternatives. As will be explained in section 5. 8589 5. 7915 5860 3 99. However. we have introduced local recovery to the MPlayer software for the decomposition alternatives # 0 and # 1.Chapter 5. 8358 6516 4 99. the alternative # 0 has also the lowest availability among all the alternatives. 2575 14..12. 2791 5. We can also notice from the results that the decomposition alternative # 4 does not lead to a significantly higher availability as opposed to the high overhead it introduces. 105 1 rate 250000 rate 250000 4 0 3 rate 385 rate 125000 7 5 6 rate 185185 rate 200000 rate 208333 rate 166666 2 Figure 5. 9555 7. 4350 4860 1 99.20: CTMC generated for global recovery. 9808 3. 9557 7.. where all modules are placed in a single RU Table 5. Function Dependency Overhead and Data Dependency Size for the Feasible Decomposition Alternatives F unction Data # Estimated Dependency Dependency Availability (%) Overhead (%) Size (bytes) 0 98. . 3907 5860 2 99. The decomposition alternatives # 0 and # 1 have the lowest function dependency overhead and data dependency size.3: Estimated Availability. Quantitative Analysis and Optimization of Software Architecture. 9803 7771 5 99. That means that they introduce relatively less performance overhead and it is easier to implement them.

106 Chapter 5. it might be infeasible for the model checker to evaluate the availability in an acceptable period of time. for some decomposition alternatives. For such cases.6. In such cases it can take too much time to evaluate all the alternatives with analytical models although the models are generated automatically.. we calculate for each RU the MTTR and MTTF. respectively. Quantitative Analysis and Optimization of Software Architecture. respectively.. there can be too many feasible decomposition alternatives.11 Optimization In our case study. we also provide an option to select a decomposition among the feasible alternatives with heuristic rules and optimization techniques (See Figure 5. As a result. + λn [101]. . the state space of the model can grow too big despite of the reductions. we have evaluated a total of 6 feasible decomposition alternatives. RU RUx CriticalityRUx × In equations 5.f unc. The following objective function is adopted by the optimization algorithms based on heuristics.. = min. Xn ) is also exponentially distributed with rate λ1 + λ2 + .7) (5.. Consequently. Depending on the specified constraints and the number of modules.6) (5. λ2 . the failure rate of a RU (1/M T T FRUx ) is equal to the sum of failure rates of the modules that are comprised by the corresponding RU. . Moreover... In Equation 5. we calculate the Criticality of a RU by simply summing up the Criticality values of the modules that are com- .9) 1/M T T FRUx = m∈RUx CriticalityRUx = m∈RUx Criticalitym M T T RRUx M T T FRUx obj. M T T RRUx = m∈RUx M T T Rm 1/M T T Fm (5. X2 . The Optimizer tool employs optimization techniques to automatically select an alternative in the design space.. This activity is carried out with the Optimizer tool of Arch-Studio Recovery Designer as shown in Figure 5.8) (5..4).. and λn . .. As a result. the number of RUs can be too large.7.. If X1 . X2 . That is why the MTTR for a RU is simply the addition of MTTR values of the modules that are comprised by the corresponding RU. The calculation of MTTF for a RU is based on the assumption that the failure probability follows an exponential distribution with rate λ = 1/M T T F . 5.6 and 5. than min(X1 . and Xn are independent exponentially distributed random variables with rates λ1 .8. The calculation of MTTR for a RU is based on the fact that all the modules comprised by a RU are recovered sequentially.

1243350. The Optimizer starts from the worst decomposition with respect to availability.. Hill-climbing algorithm starts with a random (potentially bad) solution to the problem. which correspond to the decompositions shown in Figure 5. exhaustive search and hill-climbing algorithm. By incrementing or decrementing one element in this string. Libvo] [Demuxer] [Stream] }. It took 89 seconds for the Optimizer to find this decomposition with exhaustive search.2.9 shows the objective function that is utilized by the optimization algorithms. the optimal decomposition (ignoring the deployment and domain constraints) based on the heuristic-based objective function (Equation 5. Demuxer. which are weighted based on the Criticality values of RUs. As explained in section 5. During the generation process a new RU can be created by moving a module to a new RU. It sequentially makes small changes to the solution. we end up with the RG strings 1242350. 1240352. The . Libmpcodecs. Then it systematically generates a set of neighbor decomposition by moving modules between RUs (called as a partition in [83]). respectively. 1240340. where the elements in the string correspond to the modules Mplayer. Libao.9) is { [Mplayer] [Gui] [Libao] [Libmpcodecs. 1245350. all the alternatives are evaluated and compared with each other based on the objective function. the objective function is to minimize the M T T R/M T T F ratio in total for all RUs. each time improving it a little bit. respectively. 107 prised by the RU.3. For the MPlayer case. MTTR of the system must be kept as low as possible and MTTF of the system must be as high as possible. where all modules of the system are placed in a single RU. Equation 5. Libvo and Gui. Currently it supports two optimization techniques. Then the decomposition { [Libmpcodecs. at which point the algorithm terminates. 1240353. 1240351. The Optimizer utilizes the hill-climbing algorithm in a similar way to the clustering algorithm in [83]. consider the set of RG strings of length 7. At some point the algorithm arrives at a point where it cannot see any improvement anymore. The Optimizer generates neighbor decompositions of a decomposition by systematically manipulating the corresponding RG string. The Optimizer tool can use different optimization techniques to find the design alternative that satisfies the objective function the most. 1240354.21. hill-climbing algorithm can be utilized to search the design space faster but ending up with possibly a sub-optimal result with respect to the objective function. For example. As a heuristic based on this fact. If the design space is too large for exhaustive search. Libvo] [Mplayer] [Gui] [Demuxer] [Libao] [Stream] } is represented by the RG string 1240350. 1246350. Stream. 1244350. In this case of exhaustive search. A decomposition N P is defined as a neighbor decomposition of P if N P is exactly the same as P except that a single module of a RU in P is in a different RU in N P ..Chapter 5. to maximize the availability. It is also possible to remove a RU when its only module is moved into another RU. Quantitative Analysis and Optimization of Software Architecture. 1240355 and 1240356.

For example. In this case. A total of 76 design alternatives had to be evaluated and compared by the hill-climbing algorithm.. This is indeed one of the best alternatives though the results in Table 5.16 has been selected as the optimal decomposition. ! " # $ % Figure 5.3 show that there is a small difference of availability especially compared to the alternative # 2. we have rather selected the alternative #2 because of its relatively less function dependency overhead and smaller data dependency size.21: The neighbor decompositions of the decomposition { [Libmpcodecs.108 Chapter 5. Libvo] [Mplayer] [Gui] [Demuxer] [Libao] [Stream] } hill-climbing algorithm terminated on the same machine in 8 seconds with the same result. instead of 877 alternatives in the case of exhaustive search. Selection of decomposition alternatives with heuristic rules and optimization techniques is a viable approach for evaluating large design spaces although the evaluation of alternatives might lead to different results than the evaluation based on analytical models. Quantitative Analysis and Optimization of Software Architecture. . under the consideration of the specified constraints in the MPlayer case study. the decomposition alternative # 5 as listed in Figure 5..

Demuxer. compare and select decompositions among many alternatives. To assess the correctness and accuracy of these techniques. where the module Gui. To be able to measure the availability achieved with these three implementations. we have performed real-time measurements from systems that are decomposed for local recovery. we have modified each module so that they fail with a specified failure rate (assuming an exponential distribution with mean MTTF ). each time it is activated. We have selected these decomposition alternatives because they have the lowest function dependency overhead and data dependency size.. We have implemented local recovery for a total of 3 decomposition alternatives of MPlayer. Quantitative Analysis and Optimization of Software Architecture. where all the modules are placed in a single RU ({ [Mplayer. Libao] [Gui] }) iii ) Local recovery with three RUs. Libao] [Gui] }). Gui. To implement local recovery.12 Evaluation The utilization of analytical models and optimization techniques helps us to analyze. where the module Gui is isolated from the rest of the modules ({ [Mplayer. The MTTF value of the corresponding module and the elapsed time is . Libao and the rest of the modules are isolated from each other ({ [Mplayer. Stream. Note that the 2nd and the 3rd implementations correspond to the decomposition alternatives # 0 and # 1 in Figure 5. the thread calculates the time elapsed since the initialization (Line 3).16. Libmpcodecs. After a module is initialized. Demuxer. Stream. Stream. we explain the performed measurements and the results of our assessments. i ) Global recovery. Libao] }) ii ) Local recovery with two RUs..Chapter 5. The operation of the thread is shown in Algorithm 2. Libvo. Libmpcodecs. Then. Libmpcodecs. Libvo. In this section. Algorithm 2 Periodically Activated Thread for Error Injection 1: time init ← currentT ime() 2: while TRUE do 3: time elapsed ← currentT ime() − time init 4: p ← 1 − 1/etime elapsed/M T T F 5: r ← random() 6: if p ≥ r then 7: injectError() 8: break 9: end if 10: end while The error injection thread first records the initialization time (Line 1). respectively. Demuxer. we have used a framework FLORA (Chapter 6). Libvo. it creates a thread that is periodically activated every second to inject errors. 109 5.

5732 98. Here.5.5. the design alternatives are ordered correctly with respect to their estimated availabilities. measured availabilities and the objective function used for optimization. As shown in Table 5.7537 99. Table 5. In Line 5. Table 5. To amplify the difference between the decomposition alternatives with respect to availability. Libao.5834 98. used for calculating the probability of error occurrence (Line 4). we have processed the log files to calculate the times. from a uniform distribution. The Recovery Manager component of FLORA (Chapter 6) logs the initialization and failure times of RUs to a file during the execution of the system. For the error injection. the rest 97. respectively. we have let the system run for 5 hours.4: Comparison of the estimated availability with analytical models. 2791 12. For each of the implemented alternatives. Decomposition Measured Estimated Heuristic Alternative Availability Availability Objective Function all modules in 1 RU 97. we have fixed MTTF to 1800s (= 0.4. Possibly an error is injected by basically creating a fatal error with an illegal memory operation. The second column shows the availability estimated with the analytical models for the corresponding decomposition alternatives. 5hrs) for all the modules but Libao and Gui. a sample value r ∈ [0. on which the module is running (Line 7).. random() returns. This value is compared to the calculated probability to decide whether or not to inject an error (Line 6).5hrs) as utilized by the analytical models and heuristics as explained in section 5. Then. This error crashes the process.57 In Table 5. when the core system module. the first column shows the availability measured from the running systems. For these two modules.4 lists the results for the three decomposition alternatives. Mplayer has been down.5880 6. and such evaluate the results based on analytical models better. The third column shows the objective function value calculated to compare these alternatives during the application of optimization techniques.5 .110 Chapter 5. We have calculated the availability of the system based on the total time that the system has been running ( 5 hours). we see that the measured availability and the estimated availability are quite close to each other. we have specified MTTF values as 60s and 30s.72 Gui. the rest 97. the heuristic outcome and the measured availability.38 Gui. Quantitative Analysis and Optimization of Software Architecture. 1]. Table 5. we have repeated our measurements and estimations for varying MTTF values. In addition.. we have used the same MTTF values (0.6 shows the results. when we have used the MTTF values shown in Table 5. 9808 9.

comparing and selecting decomposition alternatives for local recovery. . In fact. We can also see that the measured availability and the estimated availability values (in %) are quite close to each other.31 97. the various communications between the software modules. However. Modules MTTF (s) Libao 60 Libmpcodecs 1800 Demuxer 1800 Mplayer 1800 Libvo 1800 Stream 1800 Gui 30 both for the analytical models and the error injection threads. the heuristic outcome and the measured availability in the case of varying MTTF s.83 Again..25 98..5 1.75 Estimated Availability 83. the measured availability is lower than the estimated availability.6: Comparison of the estimated availability with analytical models. 111 Table 5. Quantitative Analysis and Optimization of Software Architecture. diagnosis and communication among multiple processes. measured availabilities and the objective function used for optimization. Table 5. This is due in part to the communication delays in the actual implementation which are not accounted for in the analytical models. the communication is instantaneous). we can conclude that our approach can be used for accurately analyzing. in reality.70 Heuristic Objective Function 0. the rest Measured Availability 83. abstract away any communicaion time delay (i.60 93. Decomposition Alternative all modules in 1 RU Gui. we observe that the design alternatives are ordered correctly with respect to their estimated availabilities. which are modeled using interactive transitions in the I/O-IMC models.Chapter 5.24 2. Based on these results. the recovery time includes the time for error detection.6.27 92. In general. Libao. in Table 5.e. the rest Gui.5: M T T F values that are used for the repeated estimations and measurements (MTTR and Criticality values are kept as same as before). which are subject to delays due to process context switching and inter-process communication overhead.

. There are also partial failures. Frequency of data access In data dependency analysis (Section 5. where data is accessed through function calls. In these cases. we have specified deployment constraints (number of possible recoverable units). the failure of the whole system would be the most annoying of all. on the other hand. The shared data size is important since such data should be synchronized back after each recovery. the corresponding parts of the system are refactored to fullfill this requirement. Quantitative Analysis and Optimization of Software Architecture. The number of data access (column count in Table 5. to take into account different type of partial failures. experiments are needed to be conducted to determine their actual effect on users [28].112 Chapter 5. The more and the better we can specify the corresponding constraints the more we can reduce the design space of decomposition alternatives. These are also failures from the user’s perspective but we have neglected these failures to have a common basis for comparison with global recovery..2). GUI panel vanishes and audio stops playing until the corresponding RUs are recovered. In this work. .8.15) shows this. For each direct data access without a function call. So. Specification of decomposition constraints For defining the decomposition alternatives an important step is the specification of constraints. we have evaluated the size of shared data among modules. In any case. if there are any. The correlation between function dependency overhead and data dependency size (Figure 5. is important due to redirection of the access through IPC. Also different users might have different perceptions. This is mostly covered by the function dependency analysis.2). In fact. where RU GUI and RU AUDIO are restarted. the user might get annoyed differently from partial failures. After this refactoring. the function dependency analysis can be repeated to get more accurate results with respect to the expected performance overhead. 5. functional importance of the failed module and the recovery time. Depending on the application. domain constraints and feasibility constraints (performance overhead thresholds). our implementation of local recovery (explained in Chapter 6) requires that all data access must be performed through explicit function calls.13 Discussion Partial failures We have calculated the availability of the system according to the proportion of time that the core system module. Mplayer has been down.

The part of our approach. Nevertheless. The goal of the clustering is to reverse engineer complex software systems by creating views of their structure. multiple scenarios can be repeated in a period of time and the overhead can be calculated based on the profile information collected during this time period. These are general-purpose approaches. we have performed our measurements for the videoplaying scenario. execution times of functions and the data access profile can vary depending on the usage scenario and the inputs that are provided to the system.Chapter 5. In principle. In our case study. a possible extension of the approach is to improve the expressiveness power of the constraint specification. we have proposed a dedicated analysis approach that is required for optimizing the decomposition of architecture for local recovery in particular.g. In [83] several optimization techniques are utilized to cluster the modules of a system. where we utilize optimization techniques as explained in section 5. These approaches help to identify the risks. where statistically combining the results would be wrong. which are not dedicated to evaluating the impact of software architecture decomposition on particular quality attributes. sensitivity points and trade-offs in the architecture. 113 The domain constraints are specified with requires and mutex relations. The analysis process will remain the same although the input profile data can be different. using firstorder logic can be an option though in certain cases. The results obtained from several system runs are statistically combined by the tools [40]. If there is a high variance among the execution of scenarios. For example. As such. Quantitative Analysis and Optimization of Software Architecture. 46]).. this would require selection and prioritization of a representative set of scenarios with respect to their importance from the user point of view (user profile) [28]. This has shown to be practical for designers who are not experts in defining complicated formal constraints. . 5. In this work. 31. However. Impact of usage scenarios on profiling results The frequency of calls. the evaluation of the logical expressions and checking for conflicts can become NP-complete (e.. it is possible to take different types of usage scenarios into account.11 is very similar to and inspired from [83]. which is a common usage scenario for a media player application. workarounds and simplifications can be required to keep the approach feasible and practical. the satisfiability problem).14 Related Work For analyzing software architectures a broad number of architecture analysis approaches have been introduced in the last two decades ([19.

As far as local-recovery strategies are concerned. Using this specification. Consequently. In general however. . Formal consistency analysis of error containment wrappers is performed in [65]. However.8) is discussed in [25]. that use different analytical models. these models are specified manually for specific system designs and the methodology lacks a comprehensive tool-support.. For instance. There are several techniques for estimating reliability and availability based on formal models [121]. [122] presents a 3-state semi-Markov model to analyze the impact of rejuvenation6 on software availability and in [79]. we focus on different qualities. In our tool. whereas modularity is the main focus of [83]. the propagation of data (wrong value) errors is analyzed and a globally consistent set of wrappers are generated based on assertion checks. Although the underlying problem (set partitioning) is similar. In this approach. we utilize different objective functions for the optimization algorithms.. [15] 6 Proactively restarting a software component to mitigate its aging and thus its failure. There are similar approaches that are based on different ADLs or. which are generated based on an architectural model expressed in an extended xADL. making these models less practical to use. In this work. Quantitative Analysis and Optimization of Software Architecture. where a method and a set of guidelines are presented for wrapping COTS (commercial off-the-shelf) components based on their undesired behavior to meet the dependability requirements of the system. architectural models that are expressed in AADL (Architecture Analysis and Design Language) has been used as a basis for generating Generalized Stochastic Petri Nets (GSPN) to estimate the availability of design alternatives [102]. their evaluation techniques are different: whereas we predict the availability at design time based on formal models. augmented UML diagrams are used in [80] for generating Timed Petri Nets (TPN). As a result. the work in [15] is similar to ours in the sense that several decomposition alternatives are evaluated for isolating system components from each other. [72] presents a Markov model to compute the availability of a redundant distributed hardware/software system comprised of N hosts.114 Chapter 5. the process rather slow and error-prone. the authors use stochastic Petri nets to model and analyze fault-tolerant CORBA (FT-CORBA) [89] applications. system components and their composition are expressed based on the concepts of category theory. A recoverable unit can be considered as a wrapper of a set of system modules to increase system dependability. For instance. quantitative availability analysis is performed with Markov models. we generate formal models for different decomposition alternatives for local recovery automatically. The utilization of the iC2C (See section 4. For example. UML based architectural models have also been used for generating analytical models to estimate dependability attributes. Our goal is to improve availability.

Chapter 5.10. Quantitative Analysis and Optimization of Software Architecture. porting or integration (e.1. with recovery mechanisms in this case). 5. 115 uses heuristics to choose a decomposition alternative and evaluate it by running experiments on the actual implementation. The approach is supported by an integrated set of tools.g. There exist many alternatives to select the set of recoverable units and each alternative has an impact on availability and performance of the system.g. These tools in general can be utilized to support the analysis of a legacy system that is subject to major modification. We have utilized implementations of three decomposition alternatives to validate our analysis approach and the measurements performed on these systems have confirmed the validity of the analysis results. The assumptions listed in section 5.. However. We have seen that the performance overhead introduced by decomposition alternatives can be easily observed. The outcome of the analysis is also dependent on the provided MTTF values for the modules. In this chapter. we have proposed a systematic approach to evaluate such decomposition alternatives and balance them with respect to availability and performance.. we have used fixed values and what-if analysis. The estimation of actual failure rates is usually the most accurate way to define the MTTF values. a problem database). .15 Conclusions Local recovery requires the decomposition of software architecture into a set of isolated recoverable units. this requires historical data (e. The proposed approach requires the existence of the source code of the system for performance overhead estimation. which can be missing or not accessible. This estimation is subject to the basic limitations of dynamic analysis approaches as described in section 5.1 constitute another limitation of the approach concerning the modelbased availability analysis. In our case study. We have illustrated our approach by analyzing decomposition alternatives of MPlayer.

116 Chapter 5.. Quantitative Analysis and Optimization of Software Architecture. ..

2. Section 6.4. new supplementary architectural elements and relations should be implemented to enable local recovery. we introduce a framework called FLORA to reduce this effort. we first outline the basic requirements for implementing local recovery. 117 .3 illustrates the application of FLORA with the MPlayer case study. introducing local recovery to a system leads to additional development and maintenance effort. In seciton 6. In the following section. After one of the design alternatives is selected. In addition. the software architecture should be structured accordingly. In this chapter. As a result.Chapter 6 Realization of Software Architecture Recovery Design In the previous two chapters. we introduce the framework. FLORA supports the decomposition and implementation of software architecture for local recovery. The chapter is organized as follows. limitations and related work. we have explained how to document and analyze design alternatives for local recovery. We evaluate the application of FLORA in section 6. We conclude the chapter after discussing possibilities.

To prevent this failure and support local recovery we need to be able to decompose the system into a set of recoverable unit (RU) that can be isolated (e. for example. for instance. Realization of Software Architecture Recovery Design 6.1 Requirements for Local Recovery Introducing local recovery to a system imposes certain requirements to its architecture. To prevent this interference. 6. • System-Recovery Coordination: In case recovery actions need to take place while the system is still operational. which do not directly communicate to each other. the required recovery actions should be coordinated and the communication control should be leaded according to the applied recovery actions. interference with the normal system functions can occur. Based on the literature and our experiences in the TRADER [120] project we have identified the following three basic requirements: • Isolation: An error (e.118 Chapter 6. Therefore. • Communication Control : Although a RU is unavailable during its recovery.g. In [16].g. Isolation is usually supported by either the operating system (e. by queuing and retrying messages or by generating exceptions. crash-only design) [16]. process isolation [59]) or a middleware (e. the other RUs might still need to access it in the mean time.2 FLORA: A Framework for Local Recovery To reduce the development and maintenance effort for introducing local recovery while preserving the existing decomposition we have developed the framework FLORA. the communication between RUs must be captured to deal with the unavailability of RUs. It relies on the capabilities of the platform to be able to . hierarchical or centralized approaches. Similar to communication control. RUs should not directly communicate with each other. The framework has been implemented in the C language on a Linux platform (Ubuntu version 7. illegal memory access) occurring in one part of the system can easily lead to a system failure (e. crash). Similar to loosely coupled components of the C2 architectural style [114]. FLORA assigns RUs to separate operating system (OS) processes.g.g. the communication is mediated by an application server.g.04) to provide a proof-of-concept for local recovery. isolation of the memory space). In general.g. various alternatives can be considered for realizing the communication control like completely distributed. coordination can also be realized in different ways ranging from completely distributed to completely centralized solutions. encapsulation of Enterprise Java Bean objects) and the existing design (e.

which mediates all inter-RU communication and employs a set of communication policies (e. drop. restart) in accordance with the recovery strategy being executed. each RU uses the CM.Chapter 6.g.g. queue. Realization of Software Architecture Recovery Design 119 ! ## $$ ## $$ ## $$ " Figure 6.1: A Conceptual View of FLORA control/monitor processes (kill. detect a crash) and focuses only on transient faults. The CM . Figure 6. illegal instruction. Recovery strategies can be composed of a combination of other primitive or composite strategies and they can have multiple states. The framework comprises three main components: Recoverable Unit (RU). Each RU can detect deadlock errors via its wrapper. The RM also controls RUs (e. Communication Manager (CM) and Recovery Manager (RM). which detects if an expected response to a message is not received within a configured timeout period. They are also coupled with error types. communication control between RUs. retry messages). The RM uses the CM to apply these policies based on the executed recovery strategy. while the other RUs can remain available.1 shows a conceptual view of FLORA as a UML class diagram. FLORA implements the detection of two type of errors: deadlock and fatal errors (e. The framework includes a set of reusable abstractions for introducing error detection. invalid data access). diagnosis. FLORA restarts the corresponding RU in a new process and integrates it back to the rest of the system. from which they can recover.g. kill. Note that a communication policy can be composed of a combination of other primitive or composite policies. We assume that a process crashes or hangs in case of a failure of the RU that runs on it. restart. To communicate with the other RUs.

2(a)).7(a) and 4. In the corresponding recovery design.3. Realization of Software Architecture Recovery Design can discover the RU that fails to provide the expected message. It is the parent process of all RUs and it receives a signal from the OS when a child process is dead. where the module Gui. the module Gui and the rest of the modules are placed in different RUs (Figure 6. In Chapter 4.120 Chapter 6. a RU wrapper template. in Figure 6. we can see the 3 RUs.3(a) and 6. RU MPCORE. error detection and diagnosis mechanisms.2(c)) Figure 6. In addition. the recovery style was used for depicting these design alternatives.3(a). In this chapter.3 depicts the design of MPlayer for each of the decomposition alternatives after local recovery is introduced.3(c) actually represent the same designs that were depicted before in the figures 4. Diagnosis information is conveyed to the RM. RM can detect fatal errors.2(a) comprises only one RU. In the second alternative. By handling this signal. The fist decomposition alternative places all the modules in a single RU (Figure 6.3(c). The third decomposition alternative comprises 3 RUs. message serialization / de-serialization primitives. FLORA applies global recovery and restarts the whole system. Hence. . If either of the CM or the RM fails. the components CM and RM have been introduced by the framework. respectively. In the following section. FLORA comprises Inter-Process Communication (IPC) utilities. Libao and the rest of the modules are separated from each other (Figure 6.6 and 5. we explain how FLORA can be applied to adapt a given architecture for local recovery. 6. The first decomposition alternative shown in Figure 6. the CM is not used in this case and it does not take place in the corresponding recovery design shown in Figure 6. Then it restarts the corresponding RU in a new process.2(b)). Note that the figures 6. one central RM and one central CM that communicate with one or more instances of RU. which kills and restarts the corresponding RU. The third decomposition alternative was previously discussed in the previous two chapters and it was presented in the figures 4. we use the local recovery style to represent the same design alternatives in more detail.2). the RM can also discover the failed RU based on the associated process identifier.3 Application of FLORA We have used FLORA to implement local recovery for 3 decomposition alternatives of MPlayer (See Figure 6. The implementations of these alternatives have been used in the evaluation of our analysis approach as presented in Chapter 5.7(b). RU GUI and RU AUDIO.

Realization of Software Architecture Recovery Design 121 (a) all the modules are placed in one RU (b) Gui and the rest of the modules are placed in different RUs (c) Gui.6) . Libao and the rest of the modules are placed in different RUs Figure 6.2: Realized decomposition alternatives of MPlayer (KEY: Figure 4.Chapter 6.

3: Recovery views of MPlayer based on the realized decomposition alternatives (KEY: Figure 4. Libao and the rest of the modules are placed in different RUs Figure 6.122 Chapter 6.4) . Realization of Software Architecture Recovery Design (a) all the modules are placed in one RU (b) Gui and the rest of the modules are placed in different RUs (c) Gui.

Figure 6. Hereby. function calls are redirected through IPC to the corresponding interface with C MACRO definitions. allocated resources) can be specified (lines 10-12) as a preparation for recovery. etc. Each interface defines a set of functions that are marshaled [32] and transferred through IPC. Later. .h” that is related to check-pointing. Realization of Software Architecture Recovery Design 123 Each RU can detect deadlock errors. In Figure 6. It wraps and isolates the Gui module of MPlayer. Post-recovery initialization (lines 14-20) by default includes maintaining the connection with the CM and the RM (line 15). On reception of these calls. we will show and explain a part of the utilities provided by “recunit.h” includes a set of standard utilities for RUs. Additional RU-specific initialization actions can also be specified here. To apply FLORA. which kills RUs and/or restarts dead RUs. To make use of this utility a set of state variables.g. FLORA enables a combination of check-pointing and log-based recovery to preserve state. The wrapper includes the necessary set of utilities for isolating and controlling a RU (lines 1-3). string. On the other hand. the corresponding functions are called and then the results are returned (lines 28-31 and 34-37). which are captured based on the specification in the wrapper (lines 22-26). their size. where all calls to the function guiInit are redirected to the function mpcore gui guiInit (line 1).5 a code section is shown from one of the modules of RU MPCORE. provided that an appropriate weaver for the C language is available. If needed.Chapter 6. obtaining the check-pointed state variables (line 17) and start processing incoming messages from other RUs (line 19). Each RU provides a set of interfaces. RUs can also use stable storage utilities provided by FLORA to preserve some of their internal state after recovery. which can control the communication accordingly. All error notifications are sent to the CM.4 shows a part of the RU wrapper that is used for RU GUI. each RU is wrapped using the RU wrapper template. Messages that are sent from RUs to the CM are stored (i. queued) by RUs in case the destination RU is not available and they are forwarded when the RU becomes operational again. In all other RUs where this function is declared. which activates the corresponding interface (IN T ERF ACE GU I) instead of performing the function call (lines 4-6).h” and “recunit.h” includes the list of function signatures. whereas “util. “rugui. In principle we could also use aspect-oriented programming techniques (AOP) [37] for this.e. The CM conveys diagnosis information to the RM.) must be declared in the wrapper (lines 5-7). So. and format (integer. The RM can detect fatal errors. cleanup specific to the RU (e.

3%: % 2 %9 ! : 9 * :'A !" 5 %* . Realization of Software Architecture Recovery Design ! # ) / 0 1 $ % & % *+%.'4+ .A 9 * :'A ! " B ! ! 'A "B ". % 2 !"> 3%:! : %* . . % 2 " # ) / 0 1 8 2 % 8 + .5: Function redirection through RU interfaces . & '( 2 67 8 2 : ' 34!" 5 %3 4 ' 4 76 9 ! %3! +. " & '( . !" 5 93 .! ! 'A "B "'A " .=! : :@ ' %* . !" 5 93 .<&+ : 8 Figure 6. 93 ..4: RU Wrapper code for RU GUI !"#$ ) + * " %& ' ( Figure 6.< & = .< & %$ .124 Chapter 6. !"> = % .9%- " 5 +..= .+::" # ) / 0 1 &% 4 8 2 ' ? 9 : .A " 2 % # ) / 0 1 + .

6: Standard RU utilities defined in recunit.0' )*) 1*)*". .4) based on the message exchanges through the RU interface. These utilities are imported by each RU wrapper by default (Line 2 in Figure 6. The internal state of the RU is then updated based on the specified format (Line 7 in Figure 6. + < -" : ". Realization of Software Architecture Recovery Design 125 Figure 6. RM. CM) of messages that will be sent through the defined interfaces (Lines 22-27) . Stable storage runs as a separate process in FLORA and it accepts get/set messages for retaining/saving data. The definitions that are shown in Figure 6.7. The check-pointing locations are defined in the RU wrapper by the designer (Line 38 in Figure 6. Figure 6. 589 88 9 )1*)*" . the following information is included in the configuration file presented in Figure 6.4). • the set of RUs (Lines 2-5) • the set of interfaces (Lines 8-11) • the identifiers and initialization functions of RUs (Lines 14-17) • for each RU. -.6 are related to the checkpointing (Lines 3-4) and preservation of state (Lines 7-16). Figure 6. the destination (the other RUs. For example.Chapter 6. '( )*) + .4).h concerning check-pointing and state preservation Besides the definition of RU wrappers.7 shows a part of the configuration file that was used for the recovery design shown in Figure 6. Basically. ! # $ % & / - " ! '( )*) + 0' )*) 1*)* . The PRESERVE STATE MACRO obtains the saved data from the stable storage. which defines a mapping of the saved data to the internal state variables. # $ % 0*38 :89 398 '( )*) = 0*38 :89 398 0' )*) 1*)* > 3 0' )*) 1*)*".4) to the stable storage. The CHECKPOINT MACRO sends the data that is defined in the RU wrapper (Line 6 in Figure 6. '2*3 4"5*667' -".6 shows a part of the utilities provided by “recunit. the designer should configure FLORA according to the recovery design.h” as a set of C MACRO definitions.3(c).

However. Table 6. FLORA guarantees the correct execution and recovery of these RUs.$ .$ - 0 = = 1** *+ + < 7 /= < 7 ! < 7 # & ' ( Figure 6. ! " $ %" # "*+ "*+ "*+ .$ .$ - . 3 3 ! "4 ! "4 8 5 ! "6 7 5 -. we had to write approximately 1K LOC to apply FLORA for the third decomposition alternative.$ .7: Configuration of FLORA 6.4 Evaluation If all the function calls that pass the boundaries of RUs are defined.1. The LOC written for wrappers is negligible compared to the corresponding system parts that are wrapped. LOC of its wrapper (LOCRU wrapper ) and their ratio ((LOCRU wrapper /LOCRU ) × 100). The main effort is spent due to the definition of the RU wrappers.$ ."*+ :"*%. For the decomposition shown in Figure 4. 0 *+ 1/ *+ / 2 # & ' ( ) . As we can see from Table 6. Realization of Software Architecture Recovery Design # & ' ( ) . the specification of the RU boundaries with the RU wrapper template requires an additional effort. 1 . ! :"*%.126 Chapter 6.6.1 shows the LOC for each RU (LOCRU ). which are encapsulated mostly in Libmpcodecs.! 6 7 9 : !"* ."*+ :"*%. The size We have excluded the source code for the various supported codecs."*+ *% 7 "< 7 . we have measured this effort based on the lines of code (LOC) written for RU wrappers and the actual size of the corresponding RUs1 .

This is because the wrapper captures only the interaction of a RU with the rest of the system. We have calculated Equation 6.61% 0.1 for each possible partition and we determined the minimum and maximum LOC estimations with respect to the number of RUs. Therefore.6). of course. the CM.1. In fact.42% of the wrapper becomes even less significant for bigger system parts. increases.22% 1.8. For this reason. the wrapper size is independent of the size and internal complexity of the system part that is wrapped. .1) for calculating the number of function calls between the selected RU boundaries. There should be a wrapper for each RU with some default settings (Figure 6.Chapter 6. To calculate Equation 6.1: LOC for the selected RUs (as shown in Figure sponding wrappers and their ratio LOCRU LOCRU wrapper RU MPCORE 214K 463 RU GUI 20K 345 RU AUDIO 8K 209 TOTAL 242K 1017 127 4. we have considered only transient faults.1 estimates the LOC needs to be written for wrappers based on the following assumptions. In FLORA. we can see the range of LOC estimations with respect to the number 2 The performance overhead of these components at run-time. LOC for these components remains the same when we increase the number of RUs2 . mainly the specification of RU wrappers causes the extra LOC that should be written to apply FLORA. In this scope. We have generated all the set of possible partitions for varying number of RUs. there is a standard implementation for the RM. To be able to estimate the LOC to be written for wrappers. capturing and processing its arguments and return values.4). For each such function call we add a fix amount of LOC (15) taking into account the code for redirection of the function. LOCtotal = 30 × |RU | + 15 × r∈RU calls(r → r) (6.72% 2. all function calls between RUs must be defined in the corresponding wrappers. The results can be seen in Figure 6. In Figure 6. we have used the following equation.8.8. Realization of Software Architecture Recovery Design Table 6. which lead to a crash or hanging of a OS process. we have used MFDE (Section 5. the equation includes a fix amount of LOC (30) times the number of RUs (|RU |).1) Equation 6. LOC for the correratio 0. Hence. In addition. and the RU wrapper utilities regardless of the number of and type of RUs.

while preserving the existing decomposition.8 also marks the LOC written in the actual implemented case with 3 RUs as presented in Table 6. FLORA itself includes source code of size 1. the minimum and the maximum values are equal for these cases. Realization of Software Architecture Recovery Design 4000 3500 3000 Lines of Code 2500 2000 1500 1000 500 0 1 2 3 4 Number of RUs 5 6 7 1017 Figure 6. Libvo and Libmpcodecs) compared to the other modules. we can see that several decompositions with 6 RUs require less overhead compared to the maximum overhead caused by a decomposition with 2 RUs. the analysis can point out exceptional decompositions. Thus. This means that there are certain modules that are exceptionally highly coupled (e. Depending on how homogeneous the coupling between system modules is. Therefore. When the number of RUs is equal to 1.h”). Figure 6.128 Chapter 6.1. For instance. in the analysis of the MPlayer case. we can .h” and “recunit. which was reused as is. When the number of RUs is equal to 7. which is the number of modules in the system. we can provide an approximate prediction of the effort for using the framework before the actual implementation. there also exists only one possible partitioning of the system (each RU comprises a module of the system). the RM and standard utilities of RU wrappers (“util. there exists logically only one possible partitioning of the system.5 Discussion Refactoring the software architecture FLORA aims at introducing local recovery to a system.8: Estimated LOC to be written for wrappers with respect to the number of RUs of RUs. To decide on the partitioning of modules into RUs. This includes the source code of the CM. 6.5K LOC.g.

Candea et al. which keeps the status of the media player (e. where local recovery is applied to increase the availability of Java-based Internet systems. stopped).Chapter 6. However. they provide a good ratio of cost over improvement of the dependability. Restructuring for decoupling shared variables and removing function dependencies can be considered in case the architecture is already being refactored to improve certain qualities like reusability. Declared state variables are check-pointed and automatically restored after recovery by FLORA. maintainability or evolvability. one can also consider to refactor the software architecture to remove existing dependencies and shared data among some of the modules. particular states that are critical for modules are application dependent. According to the categorization of this survey. likewise other SV libraries [26] it does not support local recovery. in the wrapper of RU GUI (See Figure 6. which collects reusable software components for backward recovery. State preservation State variables that are critical and need to be saved can be declared in RU wrappers. SV libraries are said to be limited in terms of separation of concerns. In fact. this is a viable approach if the architects/developers have very good insight in the system. a survey of approaches for application-level fault tolerance is presented. syntactical adequacy and adaptability [26]. where the designer can reuse existing. For instance. The system can be analyzed to select a partitioning that requires minimal effort to introduce local recovery as discussed in Chapter 5 and in the previous section. The value of this variable is restored after the Gui module is restarted.Playing is declared (Line 6). introduced the microreboot [16] approach. On the other hand. it would be better to treat modules as black boxes and wrap them with FLORA. Microreboot . Realization of Software Architecture Recovery Design 129 take coupling and shared data among the modules into account.6 Related Work In [26]. Otherwise.4).g. An example SV library is libft [58]. The effort needed to refactor the architecture depends very much on the nature and semantics of the architecture. guiInfStruct is a C structure (struct [67]) and Playing is a variable within this structure. Such states must be known and explicitly declared in the corresponding wrappers by the developers. paused. On the other hand. FLORA falls into the category of single-version software fault tolerance libraries (SV libraries). long-tested and sophisticated pieces of software [26]. the state variable guiInfStruct. playing. 6. However.

the application software must be partitioned to be run on multiple processes. Erlang/OTP (Open Telecoms Platform) [38] is a programming environment used by Ericsson to achieve highly available network switches by enabling local recovery. called workers. A supervision tree assigns system modules in separate processes. In [53]. We have implemented FLORA basically using macro-definitions in the C language. where components are isolated from each other and their state information is kept in state repositories. Realization of Software Architecture Recovery Design aims at recovering from errors by restarting a minimal subset of components of the system. Erlang is a functional programming language and Erlang/OTP is used for (re)structuring Erlang programs in so-called supervision trees. Designs of many existing systems do not have these properties and it might be too costly to redesign and implement the whole system from the start. crash-only design [16]). Mach kernel [96] also provides a micro-kernel architecture and flexible multiprocessing support.e. FLORA is used for restructuring C programs and partitioning them into several processes with a central supervisor (i. We consider this as a possible future work. a micro-kernel architecture is introduced. These processes are arranged in a tree structure together with hierarchical supervisor processes at different levels of the tree. where device drivers are executed on separate processes at user space to increase the failure resilience of an operating system. We could also implement FLORA using techniques [37] in which usually a distinction is made between base code on which additional so-called crosscutting concerns are woven. In case of a driver failure. the corresponding process can be restarted without affecting the kernel.e. Singularity [59] proposes multiprocessing support in particular to improve dependability and safety by introducing the concept of sealed process architecture. Each supervisor monitors the workers and the other supervisors among its children and restarts them when they crash. FLORA provides a set of reusable abstractions and mechanisms to support the refactoring of existing systems to introduce local recovery. To be able to exploit the multiprocessing support of the operating system for isolation. which can be exploited for failure resilience and isolation. The design of the operating system must support isolation between the core operating system and its extensions to enable such a recovery [53]. To employ microreboot. the RM). while making use of the multiprocessing support of the Linux operating system. a system has to meet a set of architectural requirements (i.130 Chapter 6. FLORA supports the partitioning of an application software and reduces the re-engineering effort. In particular the function redirection calls to IPC could be automatically woven using aspects. . Progressively larger subsets of components are restarted as long as the recovery is not successful. which limits the scopes of processes and their capabilities with respect to memory alteration for better isolation.

Chapter 6. AMW currently does not support the restructuring and partitioning of legacy software to introduce local recovery. Realization of Software Architecture Recovery Design 131 Some fault tolerance techniques for achieving high availability are provided by middlewares like FT CORBA [89]. we have presented the framework FLORA. data maintenance and invocations of check-pointing APIs similar to those specified within RU wrapper templates. . The realization effort for applying the framework appears to be relatively negligible. AMW generates code based on the desired fault tolerance behavior that is specified with a domain specific language. Aurora Management Workbench (AMW) [13] uses model-centric development to integrate a software system with an high availability middleware. AMW allows software components (or servers) to be assigned to separate RUs (capsule in AMW terminology) and restarted independently. However. it still requires additional development and maintenance efforts to utilize the provided techniques. while preserving their existing structure. FLORA provides reusable abstractions for introducing local recovery to software architectures. By this way. However. it aims at reducing the amount of hand-written code and as such reducing the developer effort to integrate fault tolerance mechanisms provided by the middleware with the system being developed. Developers should manually write code only for component-specific initialization. which supports the realization of local recovery. We have used FLORA to implement local recovery for 3 decomposition alternatives of MPlayer.7 Conclusions In this chapter. 6.

Realization of Software Architecture Recovery Design .132 Chapter 6.

the solutions that we have proposed and possible directions for future work. fault tolerance techniques have been introduced. we summarize the problems that we have addressed. which enable systems to recover from failures of its components and continue operation. constantly changing hardware technologies and integration with other systems in networked environments. these trends make it harder. to prevent or remove all possible faults in a system. Many fault tolerance techniques are available but incorporating them in a system is not always trivial. In the following sections. The size and complexity of software in such systems are increasing with every new product generation due to the increasing number of new requirements. Together with time pressure. 133 .Chapter 7 Conclusion Most home equipments today such as television sets are software-intensive embedded systems. if not impossible. To cope with this challenge.

Conclusion 7. This is because additional elements should be introduced and the existing system must be refactored. • Modeling Software Architecture for Recovery: It turns out that introducing recovery mechanisms to a system has a direct impact on its architecture design. Moreover. ii ) prioritize them. • Early Reliability Analysis: To select and apply appropriate fault tolerance techniques. complex interactions and even a particular decomposition for error containment. 69.134 Chapter 7. we need reliability analysis methods to i ) analyze potential failures. 70] mostly capture functional aspects of a system and they are limited for explicitly representing elements. Existing viewpoints [56. and iii ) identify sensitivity points of a system accordingly. because implementing the software architecture is a costly process it is important to apply the reliability analysis as early as possible before committing considerable amount of resources. • Optimization of Software Decomposition for Recovery: Local recovery is an effective approach for recovering from errors. One of the requirements for introducing local recovery to a system is isolation. To prevent the failure of the whole system in case of a failure of one of its components. It enables the recovery of the erroneous parts of a system while the other parts of the system are operational. We need an adequate integrated set of analysis techniques to optimize the decomposition of software architecture for local recovery. Each alternative has an impact on availability and performance of the system. which results in several new elements. we have addressed the following problems in designing a fault-tolerant system. In the consumer electronics domain. There exist many alternatives to decompose a software architecture for local recovery. • Realization of Local Recovery: It appears that the optimal decomposition for local recovery is usually not aligned with the decomposition based on functional concerns and the realization of local recovery requires substantial development and maintenance effort. the prioritization of failures must be based on user perception instead of the traditional approach based on safety. interactions and design choices related to recovery.1 Problems In this thesis. . the software architecture must be decomposed into a set of isolated recoverable units.

the process of doing such an explicit analysis has provided better insight in the potential risks of the system. Conventional reliability engineering includes mature techniques for failure analysis and evaluation techniques. MPlayer. The proposed techniques tackle different architecture design issues (modeling. SARAH is a specific purpose analysis method focusing on the reliability quality attribute. maintenance) and as such are complementary. analysis. Despite most scenario-based analysis methods which usually do not focus on specific quality factors.1 Early Reliability Analysis To select and apply appropriate fault tolerance techniques.2 Modeling Software Architecture for Recovery We have introduced the recovery style in Chapter 4 to document and analyze recovery properties of a software architecture. These views have formed the basis for analysis and support for the detailed design of the recovery mechanisms.Chapter 7. We have also introduced a local recovery style and illustrated its application on the open source media player. Conclusion 135 7. Software architecture analysis methods provide useful techniques for early analysis of the system at the architecture design phase. Recovery views of MPlayer based on the recovery style have been used within our project to communicate the design of our local recovery framework (FLORA) and its application to MPlayer. realization) at different stages of software development life cycle (early analysis. Besides the outcome of the analysis.2. unlike conventional reliability analysis techniques which tend to focus on safety requirements SARAH prioritizes and analyzes failures basically from the user perception because of the requirements of consumer electronics domain. 7.2. SARAH integrates the best practices of conventional reliability engineering techniques [34] with current scenario-based architectural analysis methods [31]. 7. we explain how we have addressed the aforementioned problems. To provide a reliability analysis based on user perception we have extended the notion of fault trees and refined the fault tree analysis approach. SARAH has been illustrated for identifying the sensitive modules for the DTV and it has provided an important input for the enhancement of the architecture. in Chapter 3 we have introduced the software architecture reliability analysis method (SARAH). .2 Solutions In this section. Further.

which supports the whole process of decomposition optimization. and iii ) balancing the feasible alternatives with respect to availability and performance. ii ) reducing the alternative space with respect to domain and stakeholder constraints. We have used FLORA to implement local recovery for the three decomposition alternatives of MPlayer that are used for validating the analysis results based on analytic models and optimization techniques. The measured results turn out to be closely matching the estimated availability. We have illustrated our approach by analyzing decomposition alternatives of MPlayer. .2.2.4 Realization of Local Recovery In Chapter 6 we have presented the framework FLORA that provides reusable abstractions to preserve the existing structure and support the realization of local recovery. We have implemented local recovery for three decomposition alternatives to validate the results that we obtained based on analytical models and optimization techniques. the realization effort for applying the framework and introducing local recovery appears to be relatively negligible. Conclusion 7. while the alternatives were ordered correctly with respect to their estimated and measured availabilities and the objective function used for optimization.3 Optimization of Software Decomposition for Recovery We have proposed a systematic approach in Chapter 5 for analyzing an existing system to decompose its software architecture to introduce local recovery. In addition. The application of the framework. 7. In total three recoverable units (RUs) have been defined.136 Chapter 7. Each tool automates a particular step of the process and it can be utilized for other types of analysis and evaluation as well. We have provided the complete integrated tool-set. which were overlaid on the existing structure without adapting the individual modules. as such. which are based on actual measurements regarding the isolated modules and their interaction. The approach enables i ) depicting the alternative space of the possible decomposition alternatives. provides a reusable and practical approach to introduce local recovery to software architectures. We have seen that the impact of decomposition alternatives can be easily observed.

13.1 and/or modify any of the four basic I/O-IMC models (Appendix B. As for future directions regarding model-based availability analysis. closer to the global optimum) selection of a decomposition alternative. We have introduced the recovery style based on the module viewtype.6. new heuristics and alternative optimization algorithms can be incorporated in the tool-set for a faster or better (i. We have made use of spreadsheets to perform the required calculations in SARAH. In addition. The other inputs of the analysis can also be improved like severity values for failures based on user perception and fault occurrence probabilities. Conclusion 137 7. Another improvement can be about the estimation of function and data dependencies. for which we have performed our measurements during a single usage scenario. we can increase the expressiveness power of the constraint specification. we can introduce additional architectural styles to represent particular fault tolerance mechanisms. Several points can be improved in the analysis and optimization of decomposition alternatives for recovery.7. This effort can be reduced by using model-driven engineering and aspect-oriented software development techniques as discussed in section 6.1) to more accurately represent the real system behavior. whether they correctly capture the error propagation). We have introduced the local recovery style as a specialization of the recovery style to represent a local recovery design in more detail.Chapter 7. we provide several future directions regarding each of the methods and tools presented in this thesis. possibly with a visual editor. further case studies or experiments can be conducted to evaluate the methods and tools within the context of various application domains. FLORA can be improved by reconsidering the fault assumptions and related fault tolerance mechanisms to be incorporated. Similarly.g. As a future work concerning all the proposed techniques. A better tool-support.e. one can revisit some or all of the assumptions made in section 5. FLORA is basically a software library and it still requires effort to utilize the library. The analysis can take different types of scenarios into account based on the usage profile of the system. The scenario derivation process in SARAH can be improved by utilizing a software architecture model with more semantics to be able to check properties of derived scenarios (e. we can also derive a recovery style for the component & connector viewtype or the allocation viewtype to represent recovery views of run-time or deployment elements. .10. As discussed in 5. As discussed in 4. can be provided to reduce the analysis effort.3 Future Work In the following.

Conclusion .138 Chapter 7.

respectively. Number of Failures. Figure A. Failure scenarios are labeled with the IDs of the associated architectural elements and the user perceived failures are annotated with severity values.1 shows the corresponding Architectural-Level analysis results. Hereby. PF and WPF stand for Architectural Element ID.Appendix A SARAH Fault Tree Set Calculations In the following.1 depicts the Fault Tree Set that was used for the analysis presented in Chapter 3. Table A. 139 . Weighted Number of Failures. AEID. NF. Percentage of Failures and Weighted Percentage of Failures. WNF.

64 TXT 4 27 10.06 AMR 5 49 13.70 2.41 4.41 4.140 Chapter A.64 CA 1 3 2.41 3.12 VP 1 4 2.41 3.1: Fault Tree Set Table A.70 2.70 2.51 25.58 PI 2 5 5.1: SARAH Architecture-Level Analysis results AEID NF WNF PF WPF AEID NF WNF PF WPF AC 2 6 5.73 VC 2 6 5.70 2. SARAH Fault Tree Set Calculations Figure A.26 GC 1 4 2.58 AP 1 4 2.09 EPG 2 8 5.41 4.41 7.70 2.81 13.11 7.06 PM 2 9 5.70 1.58 DDI 2 15 5.58 CH 2 9 5.41 4.70 2.06 .12 T 1 5 2.92 CMR 3 14 8.70 2.55 LSM 1 4 2.06 AO 1 5 2.06 CB 2 8 5.41 2.70 2.09 G 1 4 2.22 VO 1 5 2.

10.1.1. the starting state of any I/O-IMC is state 0 and the RUs are numbered starting from 1.1 Example I/O-IMC Models In this section.Appendix B I/O-IMC Model Specification and Generation B. we provide details on the 4 basic I/O-IMC models that we have used for availability analysis of decomposition alternatives for local recovery. As briefly explained in section 5. we use the running example with 3 modules and 2 RUs as depicted in Figure B. This example consists of a RM and two recoverable units (RUs). and the recovery manager (RM) I/O-IMC. these are the module I/O-IMC. the failure interface I/OIMC. By convention. the recovery interface I/O-IMC. To explain these models. 141 . RU 1 has one module A and RU 2 has two modules B and C.

In fact.2 The failure interface I/O-IMC Figure B.142 Chapter B.e. So. and once this happens it outputs an ‘up’ signal notifying the failure interface about its recovery (i. this interface behaves as an OR gate in a fault tree. transition from state 3 to state 0). The failure interface simply listens to the failure signals of modules B and C. B. ‘failed 2’) upon the receipt of any of these two signals. . this corresponds to modules B and C being initially operational. In state 1. I/O-IMC Model Specification and Generation & & & & ! "#$ % "#$ " Figure B. and outputs a ‘failure’ signal for the RU (i.e.e. ‘up 2’) when all the failed modules have output their ‘up’ signals.e. receiving signal ‘recovered B’ from the recovery interface). In state 2. the failure interface outputs an ‘up’ signal for the RU (i. 1. and it can fail with rate 0.2 shows the I/O-IMC of module B.3 shows the I/O-IMC model of RU 2 failure interface. consider the following sequence of states: 0. transition from state 2 to state 1).e. Signal ‘recovering 2’ is received from the recovery interface indicating that a recovery procedure of RU 2 has been initiated. the module awaits to be recovered (i. For instance. 7.1. The remaining input transitions are necessary to make the I/O-IMC input-enabled. and 0. 4.1 The module I/O-IMC Figure B.1: The running example with 3 modules and 2 RUs B. The module is initially operational in state 0.2 and move to state 2. the module notifies the failure interface of RU 2 about its failure (i. then B fails.1.

2 2 1 recovering_2? recovered_B? recovering_2? failed_B! Figure B. recovering 2 }. Output = {failed B. allowing it to start the RU’s recovery. The recovery interface receives a ‘start recover’ signal from the RM (transition from state 0 to state 1).3 The recovery interface I/O-IMC Figure B. then signal ‘up B’ is received from module B.4 shows the I/O-IMC model of RU 2 recovery interface. . A ‘recovering’ signal is then output (transition from state 1 to state 2) notifying all the modules within the RU that a recovery phase has started (essentially disallowing any remaining operational module to fail).Chapter B. of B and C ) take place both with rate 1 (transitions from state 2 to state 3 and from state 3 to state 4). Input = {recovered B. up B }. and finally RU 2 outputs its own ‘up’ signal.e. B.1. followed by RU 2 outputting its ‘failed’ signal. followed by two sequential ‘recovered’ notifications (transitions from state 4 to state 5 and from state 5 to state 0). Then two sequential recoveries (i.2: The module I/O-IMC model. I/O-IMC Model Specification and Generation recovered_B? 3 143 up_B! recovered_B? 0 recovering_2? recovered_B? recovering_2? rate 0.

up B. up C }.144 Chapter B. Output = {failed 2. . failed C. I/O-IMC Model Specification and Generation up_B? 7 up_C? up_C? failed_B? 6 failed_B? failed_2! failed_C? up_B? failed_C? failed_C? up_B? 5 failed_B? failed_C? failed_2! up_B? up_C? up_B? 2 failed_C? failed_C? up_C? up_B? 0 up_C? up_2! up_B? failed_B? 1 failed_C? failed_C? up_B? 3 failed_B? failed_B? up_C? failed_B? 4 up_C? failed_2! up_C? failed_B? Figure B.3: The failure interface I/O-IMC model. Input = {failed B. up 2 }.

one might modify the RM by implementing a different recovery policy. Input = {start recover 2 }. .5 shows the I/O-IMC model of the RM. to a certain extent. it is granted the ‘start recover’ signal (transition from state 4 to state 7). Output = {recovering 2. The RM monitors the failure of RU 1 and RU 2. the RM receives ‘up 2’ and both RUs are operational again.Chapter B. this is referred to as a first-come-first-served (FCFS) policy. recovered C }. this corresponds to both RUs being initially operational. The RM has internally a queue of failing RUs that keeps track of the order in which the RUs have failed. The recovery policy of RM is to grant a ‘start recover’ signal to the first failing RU. I/O-IMC Model Specification and Generation rate 1. locally modified without affecting the remaining models. 2. Finally. immediately followed by an RU 2 failure. the RM grants its recovery by outputting a ‘start recover’ signal. Since RU 1 failed first. RM grants the ‘start recover’ signal to RU 2 since RU 2 is still in the queue of failing RUs (transition from state 2 to state 6). For instance. then RU 1 fails. consider the following sequence of states: 0. B. and 0. 6. 7.4 The recovery manager I/O-IMC Figure B. In queuing theory literature. Note that any of the 4 I/O-IMC models presented here can be.1. 4. As an example.0 3 start_recover_2? recovering_2! start_recover_2? 5 start_recover_2? 1 start_recover_2? start_recover_2? 0 recovered_C! rate 1.Then. and when an RU failure is detected. recovered B.4: The recovery interface I/O-IMC model.0 2 start_recover_2? recovered_B! 4 145 Figure B. the RM waits for the ‘up’ signal of RU 1. 1. and once received.

to describe complex system model behaviors.5: The recovery manager I/O-IMC model. output actions. I/O-IMC Model Specification and Generation failed_2? 8 7 failed_1? failed_2? up_1? failed_1? failed_2? start_recover_2! failed_1? up_2? up_1? start_recover_1! up_2? failed_1? up_1? 6 failed_2? 5 4 failed_2? 3 up_2? failed_1? start_recover_2! up_1? up_1? failed_2? 2 failed_2? up_2? failed_2? up_2? up_1? 0 up_2? up_1? up_2? failed_2? start_recover_1! up_2? failed_1? up_1? 1 up_2? failed_1? up_1? failed_1? Figure B. Input = {failed 1.146 failed_1? Chapter B. failed 2. The MIOA language is used for describing concisely and formally an I/O-IMC. in [45] and for which. internal actions . up 1. input actions.6 and it is divided into 3 sections: i ) signature ii ) states. and iii ) transitions. In the signature section.} B. up 2 }. Lynch et al.2 I/O-IMC Model Specification with MIOA We use a formal language called MIOA [71] to describe I/O-IMC models. start recover 2 . such as control structures and data types. A MIOA specification is structured as shown in Figure B. Output = {start recover 1. in addition to interactive transitions. MIOA is based on the IOA language defined by N. Markovian transitions have been added. It provides programming language constructs.

The MIOA specification of the failure interface I/O-IMC model is shown in Figure B. the last transition (Lines 21-24) indicates that an RU ‘up’ signal (up RU!) is output if ‘set’ is empty (i. The set of data types specified in the states section determines the states of the I/O-IMC. Since I/O-IMCs are input-enabled. where ‘set’ (of size n) holds the names of the modules that have failed and ‘rufailed’ indicates if the RU has or has not failed. all modules 1 MIOA actions correspond to I/O-IMC signals. one output signal for the ‘failed’ signal of the RU and one output signal for the ‘up’ signal of the RU (Line 4). the precondition has to hold.e. for instance. ! " # $ ' ( && % & ! % && # $ ' ( % % && ! " % && % % Figure B. A precondition can be any expression that can be evaluated either to true or false. There are four kind of possible transitions. for example.Chapter B. The signature consists of actions that correspond to the failed/up signals of the n modules belonging to the RU (Line 3). Possible transitions of the I/O-IMC are specified in a precondition-effect style.6: Basic building blocks of a MIOA specification. there is no precondition specified for input actions. The initial state is also defined in the states section. The states of the failure interface I/O-IMC are defined (Lines 5-7) using Set and Bool data types.7 as an example. the failure interface initial state is composed of ‘set’ being empty (Line 6) and ‘rufailed’ being false (Line 7). In order for the transition to take place. . I/O-IMC Model Specification and Generation 147 and Markovian rates are specified1 .

3 for details). In fact. - # $ * % & & '( . ! " ) + . For instance. 285 transitions. and the module I/O-IMC size is constant. I/O-IMC Model Specification and Generation are operational) and the RU has indeed failed at some point (i. the RM I/O-IMC that coordinates 7 RUs has 27. . automatically deriving the I/O-IMC models becomes essential as the models grow in size. Once a MIOA specification/description of an I/O-IMC model has been laid down. In our modeling approach. and the effect of the transition is to set ‘rufailed’ to false. the RM I/O-IMC size depends on the number of RUs. ! " ) + . - / 0 .e. 1 & 1 & 2 & & 2 3 & Figure B.148 Chapter B.7: MIOA specification of the failure interface I/O-IMC model. 399 states and 397. ‘rufailed’ = true). an algorithm can explore the state-space and generate the corresponding I/O-IMC model (see Section B. the failure/recovery interface I/O-IMC size depends on the number of modules within the RU.

Chapter B.add(s.add(s) while statesT oBeP rocessed = {} do s ← statesT oBeP rocessed. Algorithm 3 shows how the states and transitions of an I/O-IMC are generated based on its MIOA specification.add(snew) end if transitionSet.precondition) = TRUE then snew ← s.id.states stateSet.id. An initial state is created which comprises the initialized state variables as defined in the MIOA specification (Line 4). on which the effects of the transition are reflected (Lines 10-11). I/O-IMC Model Specification and Generation 149 B. Then.transition t do if s. a new state is created. t.signal) else if t. the algorithm iterates over the states in the set of states to be processed until there is no state left to be processed (Lines 7-8).id. For each state.add(s) statesT oBeP rocessed. all the transitions that are specified in the MIOA specification are evaluated (Line 9).3 I/O-IMC Model Generation We have implemented the I/O-IMC model generation based on the MIOA specifications. Algorithm 3 State space exploration and I/O-IMC generation based on MIOA specification 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: stateSet ← {} statesT oBeP rocessed ← {} transitionSet ← {} s ← intialize mioa. If the precondition of the transition holds. The initial state is added to the state set and the set of states to be processed (Lines 5-6).check(t. snew. it is added to the set of . the states that are yet to be evaluated and the set of transitions generated.apply(t.add(s.signal) end if end if end for end while The algorithm keeps track of the set of states. which are all initialized as empty sets (Lines 1-3). t. s.id.signal = input then transitionSet.removeLast() for each mioa.effect) if snew ∈ stateSet then / stateSet.add(snew) statesT oBeP rocessed. If the resulting state does not already exist.

. B. -%# . the module A I/O-IMC 150 .1 states and the set of states to be processed (Lines 12-15).4 Composition and Analysis Script Generation All the generated I/O-IMC models are output in the Aldebaran .1).! " & ( ) * + ..8: SVL script that composes the I/O-IMC models according to the example decomposition as shown in Figure B. If the precondition of the transition does not hold for an input signal. then a self transition is added to the set of transitions to ensure that the generated I/O-IMC is input-enabled (Lines 17-22). the I/O-IMCs of modules B and C are composed with the recovery interface I/O-IMC of RU 1 (Lines 6-10). in Figure B.8. The generated composition script first composes the recovery interface I/O-IMCs of RUs with the module I/O-IMCs that are comprised by the corresponding RUs. a composition and analysis script is also generated. #$ ! " & ( ) * + #$ ' #$ #$ %# ' %# . #$ ' . %# ' -%# - %# Figure B. Figure B.. a new transition from the original state to the resulting state is added to the set of transitions (Line 16). For instance.aut file format so that they can be processed with the CADP tool-set [42]. Also.8 shows a part of the generated SVL script. %# #$ #$ . which conforms to the CADP SVL scripting language. Similarly.. In addition to all the necessary I/O-IMCs. which composes the I/O-IMC models for the example decomposition with 3 modules and 2 RUs (See Figure B.

Finally. The execution of the generated SVL script within CADP composes and aggregates all the I/O-IMCs based on the modules decomposition.is composed with the recovery interface of RU 2 (Lines 16-18). these actions are used for the composition of I/O-IMCs and eliminated in the resulting I/O-IMC. the common input/output actions that are only relevant for the I/O-IMCs being composed are “hidden”. That is. 151 . The resulting I/OIMCs are then composed with the failure interface I/O-IMCs of the corresponding RUs. At each composition step. and computes the steady-state availability. all the resulting I/O-IMCs are composed with the RM I/O-IMC. reduces the final I/O-IMC into a CTMC.

152 .

N. Prieto-Diaz. Addison-Wesley. Journal of Systems Software. [9] D. Amsterdam. Pittsburgh. Washington. Architecture-level modifiability analysis (ALMA). Lassing. DC. 1(1):11–33. Beaton and J. In Proceedings of the Conference on Future of Software Engineering (FOSE). 2006. Deriving Architectural Tactics: A Step Toward Methodical Architectural Design. [5] L. J. Software Architecture in Practice. 1998. Technical report. Aho. Compilers. USA. IEEE Transactions on Dependable and Secure Computing. [8] P. Architecture level prediction of software maintenance. R. USA. and Tools. Eclipse platform technical overview. B. Bengtsson and J. Bass. Software Reusability. pages 139–147. and R. 2004.D. editors. des Rivieres. Avizienis. SEI. Bosch. and M. 2004. 2007. Basic concepts and taxonomy of dependable and secure computing. and M. Landwehr. Laprie. 1986. 1999. Techniques. Klein. IBM. Domain analysis methods.V.Bibliography [1] A. [6] W. J. and C. Bas. In Proceedings of the Third European Conference Software Maintenance and Reengineering (CSMR). Bengtsson. Clements. Ullman. In Schafer. P. [7] P. Principles. and H.-C.O. Randell. Binkley. pages 104–119. Sethi.eclipse. Matsumoto. http://www. The Netherlands. and J. Arrango. 153 . Kazman. 69(1-2):129–147. [3] A. [2] G. Bachman. Source code analysis: A road map. Ellis Horwood. Addison-Wesley. R. L. [4] F. Technical Report CMU/SEI-2003TR-004. PA. pages 17–49. 2003. 1994. Bosch.org/articles/. van Vliet.

In Proceedings of the 14th Working Conference on Reverse 154 . Japan. R. Haverkort. G. [15] G. Boudali. Little. Addison-Wesley.J. and M. Sensitivity and Uncertainty Analysis: Theory. Improving availability with recursive microreboots: A soft-state system case study. In Proceedings of the 38th IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). B. 2002. Gonzalez. Chandy and L. R. 1996. 2002. Chapman & Hall. In Proceedings of the 6th Symposium on Operating Systems Design and Implementation (OSDI). [14] D. Stal. Ivers. R. Dynamic analysis techniques for the reconstruction of architectural views. 2004. F. and J. Sommerlad. Architectural dependability evaluation with arcade. Crouzen. P. R. de Lemos. A System of Patterns. volume 4762 of Lecture Notes in Computer Science. M. Evaluating software architectures: methods and case studies. Lamport. [19] P. D. H. P. Buschmann. 2007. J.I. editors. Candea. Tokyo. In R. Alaska. Nord. and M. 2003. San Francisco. Garlan. [13] R. Springer-Verlag. Cacuci. Crouzen. Candea. and M. Clements. Meunier. Cutler. Friedman. pages 441–456. [12] F. Fox. Anchorage. Wiley. pages 31–44. J. CA. pages 512–521. Kazman. Microreboot: A technique for cheap recovery. PatternOriented Software Architecture. [16] G. Kuntz. Buskens and O. Distributed snapshots: determining global states of distributed systems. [11] H. Fox. Bass. 3(1):63–75. Fujiki. Model-centric development of highly available software systems. Documenting Software Architectures: Views and Beyond. volume 1. 1985. Addison-Wesley. Rohnert. 2008. C. [17] K. [20] B. volume 4615 of Lecture Notes in Computer Science. and A. Springer-Verlag. S. and A. and A. Bachmann. 2007. Romanovsky. In Proceedings of the 5th International Symposium on Automated Technology for Verification and Analysis (ATVA). Stafford.[10] F. 56(1-4):213– 248. USA. [18] P. R. Y. 2004. Stoelinga. Klein. pages 409–433. Cornelissen. Performance Evaluation. Gacek.A. USA. Clements. Stoelinga.M.G. Architecting Dependable Systems IV. Boudali. ACM Transactions on Computer Systems. P. L. A compositional semantics for dynamic fault trees in terms of interactive markov chains. and M. Kawamoto.

Charleston. volume 2667.Engineering (WCRE). Standard MIL-STD-1629A. FL. [28] I. Romanovsky. Tools. Architecting Dependable Systems. Addison-Wesley. Department of Defense.A. IEEE Computer Society. PA. and A. A fault-tolerant architectural approach for dependable systems.N. Lauwereins. [30] Department of Defense. ACM. Vancouver. IEEE Software. The Netherlands. and R. NC. Gacek. Woodside.A. pages 144–166. USA. BC.M. Survey of backward error recovery techniques for multicomputers based on checkpointing and rollback. Towards architecturebased self-healing systems. An infrastructure for the rapid development of XML-based architecture description languages. A dependable architecture for cots-based software systems using protective wrappers. Dashofy. pages 132–141. The fault-tolerant layered queueing network model for performability of distributed systems. de Visser. [21] K. 2004. Analyzing User Perceived Failure Severity in Consumer Electronics Products. Peperstraete. pages 266–276. and R. 2002. A survey of linguistic structures for applicationlevel fault tolerance. 2000. Washington DC. [26] V. C. P.M. In Proceedings of the first workshop on Self-healing systems (WOSS). and J. [23] E. USA. 2007. pages 21–26. de Lemos. Czarnecki and U. effects and criticality analysis. PhD thesis. Generative Programming: Methods. USA. In Proceedings of the International Performance and Dependability Symposium (IPDS). Das and C. Orlando. Romanovsky.F.N. [24] E. 23(2):80–87. pages 281–284. Durham. USA. Springer-Verlag. Dashofy. 2008. In R. [27] R. Taylor. van der Hoek. C. de Castro Guerra. Military standard: Procedures for performing a failure modes. Taylor. [29] G. de Florio and C. 2002. R. J. pages 262–265. Rubira. SC. 1993. ACM Computing Surveys. Technische Universiteit Eindhoven. Blondia. and Applications. Deconinck. In Proceedings of the 22rd International Conference on Software Engineering (ICSE). Eisenecker. de Lemos. 1984. A. 2003. Rubira. [25] P. editors. de Lemos. Eindhoven. A. Canada. Vounckx. 2008. A. 155 . Guerra. [22] O. In Proceedings of the IASTED International Conference on Modeling and Simulation. Pittsburgh. USA.M. van der Hoek. and R. 40(2):1–37. and C.M. 1998.

Mateescu. http://www. In M. [40] J. In Proceedings of the ASME Design Theory and Methodology Conference (DETC). and A. 156 . Lyu. Communications of the ACM. Dugan. [35] J.L. [34] J. Ishil. Spain. F. R. Vlissides. L. 1997. A survey on software architecture analysis methods. GNU gprof: the GNU profiler. Bader. number DTM-02 in 97-DETC. Johnson. R. Dependability modeling for fault-tolerant software and systems.B. Garavel. Free Software Foundation. 34(3):375–408. Software Fault Tolerance. chapter 15. 1998. Las Palmas de Gran Canaria. Serwe. Advanced failure modes and effects analysis using behavior modeling. Lyu. McGraw-Hill.R.B. Aspect-oriented programming. Wang. A survey of rollbackrecovery protocols in message passing systems. volume 4590 of Lecture Notes in Computer Science. Eubanks. In Proceedings of the 19th International Conference on Computer Aided Verification (CAV). 2009. Dollimore. and G. pages 148–157.org/. Stallman. Addison Wesley. R. Distributed Systems: Concepts and Design. de Oliveira. Handbook of Software Reliability Engineering. Kmenta. Addison-Wesley. T. Lyu. [38] Erlang/OTP design principles. Fillman. 2005. 1995. Johnson. Sacramento. pages 109–138. Design Patterns: Elements of Reusable Object-Oriented Design. and W. and D.org/doc/. [33] J. Coulouris. Germany.[31] L. [37] T. [41] E. In Proceedings of the Second International ESPRIT ARES Workshop. Software system analysis using fault trees. Y.gnu.R. Helm. 2002. pages 158–163. Dugan and M. [36] M. ACM Computing Surveys. and J. Lang. pages 615–659. Springer-Verlag. 1995. CA. R.F. 2001. and J. Elrad. 1996. Fenlason and R. and K. [32] J. editor.B. Berlin. W. [39] C. John Wiley & Sons. Kindberg. Alvisi. S. 2000. Duenas.C. R. Dobrica and E. USA. In M. Gamma. de la Puente. A software architecture evaluation model. CADP 2006: A toolbox for the construction and analysis of distributed processes. 2002. http://www. Elnozahy. IEEE Transactions on Software Engineering. 2007.erlang. Niemela. chapter 5. [42] H. editor. 28(7):638–654.A. 44(10):29–32.

H. Kharchenko. and M.A. Architectural mismatch: why reuse is so hard. Schmerl. Herder. OckerBloom. Bos. MIT CSAI Laboratory. and A. 2000. Homburg. Increasing system dependability through architecture-based self-repair. Allen. Romanovsky. Architecture-based software reliability analysis: Overview and limitations. Springer-Verlag. Trivedi. dependability and performability: A user-centered view. Trivedi. Springer-Verlag. B. [45] S. Failure resilience for device drivers. 2005. editors. Gorbenko. and A. Discrete Event Dynamic Systems. Goseva-Popstojanova and K. Tarasyuk. Garland. and P. IEEE Transactions on Dependable and Secure Computing. J. [48] K. Sabetta. R. R. [53] J. J. Cheng. 2001.J. Gras.[43] D. Butler. and B. Schroeder. [51] B. 2006. A. Overhage. S. Mayer. and J. In R. Cambridge. Garlan.S. Boston University. volume 4157 of Lecture Notes in Computer Science. C. 4(1):32–40. FMEA. Hirst. Grassi. C. and M. 2004. Romanovsky. Architecting Dependable Systems. 3(2-3):219–247. and O. Technical Report BU-CS-97-011. 1995. Mirandola. MA. Tanenbaum. [52] A. Helal. [46] S. Mossinghoff. Tauber. Rigorous Development of Complex Fault-Tolerant Systems.J. volume 3712 of Lecture Notes in Computer Science. 1997. Springer-Verlag. IEEE Software. Specification techniques for markov reward models. V. pages 153–167. [44] D. pages 71–87. USA. Jones. [50] J. Performance Evaluation. Reussner. 2006. S. [47] A.R.S. editors. editors. 45(2-3):179– 204. In R.N. volume 2677 of Lecture Notes in Computer Science. Troubitsyna. and A. Springer-Verlag. N. An XML-based language to support performance and reliability modeling and analysis in software architectures. and E. QoSA/SOQUA. 2007. In M. [49] V. Gacek. J. Reliability. Combinatorics and Graph Theory. Architecture-based approach to reliability assessment of software systems.S. P. In Proceedings of the 37th Annual IEEE/IFIP 157 . Harris.technique of web services analysis and dependability ensuring. Stafford. J. 12:17–26. Heddaya and A. Technical Report MIT-LCS-TR-961. de Lemos. availability.M. Becker.L. Garlan. Vaziri. Lynch. 2003. Haverkort and K. IOA user guide and reference manual. S. Gokhale. pages 61–89.

Steensgaard. Software Fault Tolerance. Hind. Maui. and J. Snowbird. page 9058. Rumbaugh. 2001. Nissanke. [58] Y. [62] U. volume 2428 of Lecture Notes in Computer Science. 1971. [54] H. IEEE. Standard 610. [64] I. UT. 1990. 2007. Dynamic probability systems. [56] C. Booch.C. Interactive Markov Chains.12-1990.P. Fhndrich. Isaksen. Niemel¨. pages 41–50. 2008. Kintala. Nord. [57] R. Volume 1: Markov models. Howard. G. Jacobson. IEEE standard glossary of software engineering terminology. Applied Software Architecture. Addison-Wesley. 1995. System and software safety in critical systems. Immonen and E. 2007. 2002. Bowen. [55] M. The Unified Software Development Process. Hunt. 158 . R. John Wiley & Sons. Banatre. Huang and C. USA. and D. M. and T. Pointer analysis havent we solved this problem yet? In Proceedings of the ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE). R. S. Aiken. Software and System Modeling. UK. UK. SIGOPS Operating Systems Review. Larus. Tarditi. In Proceedings of the 34th Annual Hawaii International Conference on System Sciences (HICSS). USA. Wobber. J. Sealing OS processes to improve dependability and safety. Hawblitzel. B. [61] A. 1997. M. Survey of reliability and availability prediction a methods from the viewpoint of software architecture. pages 231–248. Lyu. editor. Hodson. Springer-Verlag. Hermanns. Software fault tolerance in the application layer. [59] G. J. 1999. Levi. Edinburgh. [63] V. D. 7(1):49–65. pages 54–61. Soni. 1999. John Wiley & Sons. O. and N. 2001. 41(3):341–354. chapter 10.A. Hofmeister. AddisonWesley. Architecture-based exception handling. [60] IEEE. C. The University of Reading. In M.International Conference on Dependable Systems and Networks (DSN). Technical Report RUCS/97/TR/062/A. Issarny and J. Hawaii.

[65] A. Safety verification of ada programs using software fault trees. [66] K. and K. pages 111–128. Lai. Feature-oriented domain analysis (FODA) feasibility study.M. [73] M. Hiller. The 4+1 view model of architecture. Kruchten. [68] S. [69] P. pages 1103–1581. Suri. Oldenburg. Formal dependability engineering with MIOA.M. Y. The Rational Unified Process: An Introduction. Addison-Wesley. Kruchten. and N. Cohen. Shimeall. and T. IEEE Software. C. Washington.D. Gannon. Ronneby. M. 44(6):343–350. 1988. [76] N. L. A model for availability analysis of distributed software/hardware systems. 52(1):44–52. Laprie. 159 . SEI. Kuntz and B. Cha.S.S. Yin. and K. University of Twente. [67] B. Technical Report CMU/SEI-90TR-21. 1991. and A. complexity of changes: Size isn’t everything. Leveson. In Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing (GRID).L. S. Beounes. Kang. chapter 3. USA. R. Lanus. Jhumka. Lassing. 2008. 2004. Technical Report TR-CTIT-08-39. Hess. 2003. [75] N. M.J. editor. IEEE Software. 8(4):48–59. Kanoun. Sweden. Kernighan and D. In M.C. Information and Software Technology. DC. Software Fault Tolerance. Yang. Haverkort. D. K. Peterson. Krishnan and D. Rijsenbrij.R. J. [70] P. pages 281–288.G. pages 47–80. 2002. Second Edition. J. 1999. Poh. Hierarchical composition and aggregation of state-based availability and performability models. In Proceedings of the 7th International Symposium on Formal Techniques in Real-Time and Fault-Tolerant Systems (FTRTFT). Ritchie. The C Programming Language. W.-C. Component-based synthesis of dependable embedded software. and P. Germany. van Vliet. Xie. 1990. Arlat. Trivedi. Dai.S. Lyu. S. [72] C. On software architecture analysis of flexibility. Architectural issues in software fault tolerance. 2000. [71] G. Checkpoint and restart for distributed components. In Proceedings of the Second Nordic Software Architecture Workshop (NOSA). Prentice Hall. John Wiley & Sons. 2002. 1995. [74] J. IEEE Transactions on Reliability.H. Novak. and H. 12(6):42– 50.W. 1995.

IEEE Computer.hu/. Kalaichelvan. Tuttle. 34(4):107–109. and A. Valgrind: a framework for heavyweight dynamic binary instrumentation. Majzik. 1997. [83] B. 2006. chapter 14. http://arxiv. 2(3):219–246. Emery.. pages 1103–1581. pages 121–139. [82] N. 2005. Mitchell and S. Taylor. 2003. Bot. and R. S. volume 2485 of Lecture Notes in Computer Science. In Proceedings of the 4th European Dependable Computing Conference. [86] N. McGraw-Hill.F. 1989. Kazman. Nethercote and J. 2007. 1996. A classification and comparison framework for software architecture description languages. and R. Lyu. Toronto. 1999. Canada. Seward. [85] MPlayer official website. Huszerl. Springer-Verlag. Molter. Pataricza. Architecting Dependable Systems. Computing Research Repository (CoRR). 2002. editor. An approach to software architecture analysis for evolution and reusability. abs/cs/0503068. In Proceedings of the Conference of the Center for Advanced Studies on Collaborative Research (CASCON). 160 .R. [80] I. 32(3):193– 208. Maier. A. Sweden. IEEE Transactions on Software Engineering. 42(6):89–100.mplayerhq. Handbook of Software Reliability Engineering. de Lemos. Lung. Ronneby. Majzik and G.[77] C. Mancoridis. A survey of reverse engineering and program comprehension. and A. 2000. CWI Quarterly. Nelson. Stochastic dependability analysis of system architecture based on UML models. editors. [84] G.W. pages 219–244. pages 144–154. 26(1):70–93. N. Fault-tolerant software reliability engineering.org/abs/cs/0503068. Towards dependability modeling of FT-CORBA architectures. In Proceedings of the Second Nordic Workshop Software Architecture (NOSA). Lynch and M. volume 2667 of Lecture Notes in Computer Science. Software architecture: Introducing IEEE Standard 1471. SIGPLAN Notices. Bondavalli. C. Springer-Verlag. Vouk. Gacek. K. [81] D. S. pages 567–613. Medvidovic and R. [79] I. [78] M. D. On the automatic modularization of software systems using the bunch tool. McAllister and M. In M. [87] M. An Introduction to Input/output Automata. 2009. Romanovsky. http://www. Integrating saam in domain-centric and reuse-based development processes.L. In R. [88] N. IEEE Transactions on Software Eng. 2001.A. Hilliard.

UK. Fink. [101] S. Roland and B. 43(2):95–106. R-28(3):247–249. Addison-Wesley. Object Management Group. John Wiley & Sons. 1999. Scott. 2000. 2001. In Proceedings of the Workshop on Architecting Dependable Systems (WADS). [96] R. 1979. San Francisco. 2004. 12(3):139–144. D. S. Technical Report OMG Document formal/2001-09-29. An architectural approach for improving availability in web services. 46(2):265–288. In Proceedings of the 34th Computer Society International Conference (COMPCON). [91] E. 161 . Julin. FL. Rakic and N. Colub.[89] Object Management Group.. 26(3):11–18. Moriarty. Ross. Introduction to Probability Models. CA. R.J. Jones.M. Forin. [95] B. [99] E. Parker. ACM SIGSOFT Software Engineering Notes. [94] B. USA. CA. Edinburgh. Orr. R. USA. [98] D. 1990. A. System structure for software fault tolerance. Fault tolerant CORBA. Baron. 2007. Exploring subjectivity in hazard analysis. D. Randell. USA. S.J. D. A survey of static analysis methods for identifying security vulnerabilities in software systems. 2001. pages 310–311. Reifer. [90] Y. Mach: A system software kernel. and M. Chandra. [92] M. Turing memorial lecture: Facing up to faults. Redmill. 2004. Yahav. pages 37–41. Use Case Driven Object Modeling with UML: A Practical Approach. Failure mode and effects analysis. Computer Journal. [100] D. IBM Systems Journal. Papadopoulos. Rosenberg and K. pages 437–449. D. Medvidovic. 2002. In Proceedings of the international conference on Reliable software. Randell. Grante. IEEE Transactions on Reliability. In System Safety Engineering and Management. Rashid. Parchas and R. In Proceedings of the 8th IEEE International Symposium on High Assurance Systems Engineering (HASE). Elsevier Inc. Increasing the confidence in off-the-shelf components: a software connector-based approach. Pistoia. [93] M. and E. IEE Engineering Management Journal. Sanzi. Tampa. pages 176–178. Software failure modes and effects analysis. Los Angeles. 1975. [97] F. Automating the failure modes and effects analysis of safety critical systems. de Lemos. 2007. 1989. and C.

Sozer. In R. Sozer and B. Saridakis. The Unified Modeling Language reference manual. 1996. Jacobson. Tekinerdogan. 1995. and G. Rugina. Rumbaugh. Rumbaugh. 2003. Combinatorial Generation. Garlan. pages 167–176. Extending failure modes and effects analysis approach for reliability analysis at the software architecture design level. and A. Addison-Wesley. [110] H. Kaniche. Canada. On-line monitoring: A tutorial. Springer-Verlag. [109] M. de Lemos. Germany. volume 762 of Lecture Notes in Computer Science. Kanoun. A. Shaw and D.-E. pages 409–433. editors. Jacobson. Booch. C. Prentice Hall. Tekinerdogan. Canada. Vancouver. Victoria. pages 201–208. Software Architecture: Perspectives on an Emerging Discipline. Architecting Dependable Systems IV. editors. Tekinerdogan. BC. In R. [108] B. Software: Practice and Experience. IEEE Computer. [106] F. and M. Romanovsky. K. BC. Introducing recovery style for modeling and analyzing system recovery. Aksit. Manuscript CSC-425/520. editors. Aksit. [104] J. Springer-Verlag. Accepted.[102] A. Architecting Dependable Systems IV. and M. The Unified Modeling Language Reference Manual. 1993. 1998. [111] H. Addison-Wesley. 2009. Booch. I. and A. 2008. Sozer. 2005. In Proceedings of the 10th European Conference on Pattern Languages and Programs (EuroPloP). and G. 1999. volume 4615 of Lecture Notes in Computer Science. 162 . Design patterns for graceful degradation. Ruskey. [112] H. Irsee. Ruskey. B. Gacek. 28:72–78. volume 4615 of Lecture Notes in Computer Science. [105] F. FLORA: A framework for decomposing software architecture to introduce local recovery. Schroeder. B. In Proceedings of the 4th International Symposium on Algorithms and Computation (ISAAC). de Lemos. pages 14–38. pages E7/1–18. and M. 2007. Romanovsky. In Proceedings of the 7th Working IEEE/IFIP Conference on Software Architecture (WICSA). C. University of Victoria. I. Simple combinatorial gray codes constructed by reversing sublists. A system dependability modeling framework using AADL and GSPNs. Gacek. [107] T. [103] J. Springer-Verlag. 2006.

University of Twente. IEEE Transactions on Software Engineering. ESI. Asaam: Aspectual software architecture analysis method. J. 2005. A comprehensive model for software rejuvenation. Taylor. Dubrow.S. H. Trivedi and M.and messagebased architectural style for GUI software. 22(6):390–406. 18(8):155–158. Anderson. H. pages 5–14. [117] B. Vaidyanathan and K.N. Robbins. R. [120] Trader project.[113] F.F. Software architecture reliability analysis using failure scenarios. Jr. and M. Malhotra.S. [121] K. Schaufler. Aachen. PA. and M. Pittsburgh. pages 27–48. Architecting Dependable Systems. K. E. In In Proceedings of the 7th Conference on Measurement. Sozer. Springer-Verlag. C. and A. [115] B. [119] R. Reliability and performability techniques and tools: A survey.A. IEEE Transactions on Dependable and Secure Computing. Medvidovic. 1989. Whitehead. K. Sozer. Tekinerdogan. Tekinerdogan. In Proceedings of the 4th Working IEEE/IFIP Conference on Software Architecture (WICSA). volume 2677 of Lecture Notes in Computer Science. 163 . Levy. Enschede. Aksit. Prentice Hall. 2005. 2009. Dependability in the web services architectures. Aksit. van der Lans. and D. Payne.L. [122] K. 1993. Romanovsky. Journal of Systems and Software. pages 90–109. Nies.M.nl. [114] R. [116] B. Germany. Tischler. http://www.esi. and Evaluation of Computer and Communication Systems (MMB). 1996. The Netherlands. Tekinerdogan. The SQL Standard: a Complete Guide Reference. 1983. 81(4):558– 575. Trivedi. Software architecture reliability analysis using failure scenarios. pages 203–204. Static analysis of programs as an aid to debugging. Oreizy. 2003. USA. SIGPLAN Notices. and C. [123] R. de Lemos. Synthesis-Based Software Architecture Design. 2000. 2008. Norway. Tartanoglu. In R. editors. Issarny. Tekinerdogan. In Proceedings of the 5th Working IEEE/IFIP Conference on Software Architecture (WICSA). A. and N. Modeling. PhD thesis.E. Romanovsky. N. 2(2):124– 137. [118] B. P. Oslo. 2004. V.J. Gacek. A component.

[128] J.M. 2006. pages 28–29. 2004. A scenario-based reliability analysis approach for component-based software. In Proceedings of the 5th Working IEEE/IFIP Conference on Software Architecture (WICSA). Modular architectural representation and analysis of fault propagation and transformation. 2005. pages 53–71. pages 99–103. Woods and N. and D. China. UK. Wallace. In Proceedings of the Second International Workshop on Formal Foundations of Embedded Software and Component-based Software Architectures (FESCA). Ammar. A survey of coverage based testing tools. 164 . Edinburgh.H. 2005. Shanghai. [126] S. Pittsburgh. [125] E. In Proceedings of the International Workshop on Automation of Software Test (AST). B. Li. [127] Q. USA. Cukic. China.J. Stalhane.[124] M. 53(14):465–480. Yang. 2004. PA. J. Yacoub. pages 25–35. In Proceedings of the 28th International Computer Software and Applications Conference (COMPSAC). Weiss. Hong Kong. Using architectural perspectives. Using FMEA for early robustness analysis of webbased systems. and H. IEEE Transactions on Reliability. Rozanski. Zhou and T.

communiceren en analyseren van ontwerpbeslissingen die direct zijn gerelateerd aan fouten tolerantie. vereist het grote inspanning om een fouten-tolerante ontwerp te realiseren en te onderhouden. Ten derde zijn er geen adequate analyse technieken die de impact van de introductie van foutentolerantie technieken op de functionele decompositie van de software architectuur kunnen meten. zonder hierbij een implementatie van het systeem te vereisen. Ten slotte. Ten tweede zijn bestaande architectuur stijlen niet geschikt voor het specificeren. Ten eerste behandelen de huidige fouten-tolerantietechnieken niet expliciet fouten die door de eindgebruiker waarneembaar zijn. en staan daarom toleranter tegenover mogelijke fouten in het systeem. Daarbij zorgen deze technieken ervoor dat het systeem zich kan herstellen indien fouten in het systeem zijn gedetecteerd.Samenvatting De toenemende grootte en de complexiteit van softwaresystemen vermoeilijken dat alle mogelijke fouten kunnen worden verhinderd dan wel tijdig worden verwijderd. Om het eerste probleem aan te pakken introduceren we de methode SARAH die gebaseerd is op scenario-gebaseerde analyse technieken en de bestaande betrouwbaarheidstechnieken (FMEA en FTA) ten einde een software architectuur te analyseren met betrekking tot betrouwbaarheid. Deze technieken gaan ervan uit dat niet alle fouten verhinderd kunnen worden. richt SARAH zich uitdrukkelijk op het identificeren van mogelijke fouten die waarneembaar zijn door de eindgebruiker. De fouten die in het systeem blijven bestaan kunnen er uiteindelijk toe leiden dat het gehele systeem faalt. Naast het integreren van architectuur analyse methoden met bestaande betrouwbaarheidstechnieken. Hoewel deze fouten-tolerantietechnieken zeer bruikbaar kunnen zijn is het echter niet altijd eenvoudig ze in het systeem te integreren. In dit proefschrift richten we onze aandacht op de volgende problemen bij het ontwerpen van fouten-tolerante systemen. Teneinde hiermee om te kunnen gaan zijn er in de literatuur verscheidene zogenaamde fouten-tolerantietechnieken ge¨ ıntroduceerd. 165 . en hiermee het identificeren van gevoelige punten.

Om de methode te ondersteunen. hebben wij een gentegreerd reeks van gereedschappen ontwikkeld die gebruik maken van optimalisatietechnieken. Voor het oplossen van het derde probleem introduceren we een systematische methode voor het optimaliseren van de software architectuur voor lokale herstelling van fouten (local recovery). opdat de beschikbaarheid van het systeem wordt verruimd. Om de inspanningen voor de ontwikkeling en onderhoud van fouten-tolerante aspecten te reduceren is er een framework. 166 . Deze wordt gebruikt voor het communiceren en analyseren van architectuurontwerpbeslissingen en voor het ondersteunen van gedetailleerd ontwerp met betrekking tot het herstellen van het systeem. De methode ondersteunt de volgende aspecten: modelleren van de ontwerpruimte van de mogelijke architectuur decompositie alternatieven. het reduceren van de ontwerpruimte op basis van de domein en stakeholder constraints. Het framework verschaft herbruikbare abstracties voor het defini¨ren van herstelbare e eenheden en voor het integreren van de noodzakelijke co¨rdinatie en communicatie o protocollen die vereist zijn voor herstelling van fouten.Voor het specificeren van fouten-tolerantie aspecten van de software architectuur introduceren wij de zogenaamde Recovery Style. toestandsgebaseerde analytische modellen (CTMCs) en dynamische analyse van het systeem. ontwikkeld die de decompositie en de implementatie van softwarearchitectuur voor lokale herstelling ondersteunt. en het balanceren en selectie van de alternatieven met betrekking tot de beschikbaarheid en prestatie (performance) metrieken. FLORA.

23. 109. 85. 13. 75 Compensation. 5 availability. 123 log-based recovery. 85 SVL scripting language. 1 backward recovery. 123.Index AADL. 24 dynamic analysis. 80 dependability. 101 data dependency. 125 FMEA. 1. 92 GNU nm. 22 FMECA. 6. 110. 9 DTMC. 110. 4. 86–88 CTMC. 14. 12 error handling. 13 CADP. 86. 131 FTA. 12 fault tree. 11 fault tolerance. 23 function dependency. 87 architectural tactics. 22 failure cause. 85. 9 fault forecasting. 54 severity. 13 Criticality. 13 graceful degradation. 123 recovery blocks. 86 Arch-Edit. 2. 10. 142 fault-tolerant design. 22 failure effect. 11 fault removal. 87 reliability interface. 11 error detection. 109 GNU gprof. 12 fault handling. 75. 14 FT-CORBA. 150 code coverage. 1 FCFS. 119 stable storage. 102. 87 constructive parameter. 13 check-pointing. 114 AOP. 80 available. 5. 22. 13 stable storage. 145 feature diagram. 103. 130 Arch-Studio. 74 error. 119 Recoverable Unit. 22 failure mode. 33. 1. 13. 80 global recovery. 13 exception handling. 11 error recovery. 118 Communication Manager. 75. 9 failure. 11. 119 Recovery Manager. 86 xADL. 87 analytic parameter. 9 failure interface. 141 fault. 1. 84. 102. 93 167 . 75 DTV. 12 fault prevention. 22 forward recovery. 1. 11 FLORA. 114. 75. 76.

89 Domain Constraints.. 10 MIOA. 11 program analysis. 109 local recovery style. 63 local recovery style. 95 MDDE. 99. overhead. 149 input action. 107 neighbor decompositions. 60 MTTF. 60 RU AUDIO. 93–95. 66 RU MPCORE. 89. 4. 60 Gui. 97. 61 Libao. 150 internal action. 84. 141 Module Dependency Graph. 146 transitions. 97. 95 data access profile. 92 MFDE. 141 recovery style. 88. 66 Stream. 127 Optimizer. 85. 57 maintainability. 98 Analytical Model Generator. Feasibility Constraints. 90 Data Dependency Analyzer. 69 applies-recovery-action-to. 92. 64 recovery manager. 77 output actions. 64 provides-diagnosis-to. 6. 141 recovery manager. 85. 77 input actions. 75. 118 Recovery Designer. 98 I/O-IMC. 85. depend. 109 function call profile. 93 MPlayer. 123 local recovery. 62. 75. 85. 107 hill-climbing algorithm. 142. 103 Function Dependency Analyzer. 75. 98 func. 14. 61 Libmpcodecs. 146. 77. 103. 86–88 MTTR. 77. 92. 64 sends-queued message-to. 89. 66. 85. 100 Constraint Evaluator. 60 Libvo. 146 module. 85. 146 signature. 86–88 N-version programming. 95. 76. 92. 63 conveys-information-to. 90 Deployment Constraints. 77 integrity. 78. 146 Markovian rate. 146 states. 64–66 restarts. 77 Markovian transitions. 120. 85. 13 performability. 77 internal actions. 106 exhaustive search. 101. 75. 60. 10 IPC. 5. 107 recovery interface.gnuplot. 150 interactive transitions. 91. 95 Design Alternative Gen. 60 Mplayer. 64–66 kills. 147. 64 notifies-error-to. 64 168 . 5. 78 input-enabled. 98. 146 input signal. 100 compositional aggregation. 109 Demuxer. 74 recoverable unit. 60. 60. 66 RU GUI. 64 communication manager. 4. 57. 147 output action. 89 Perf. 98 DAI.

58. 17. 28 TRADER. 48 state space explosion. 59 module viewtype. 79 cardinality. 2 direct scenarios. scenario-based analysis. 39 UML. 10 software architecture analysis safety-critical systems. 63. 95 partition. 24. 85. 64 recovery view. 14 ADL. 59. 15. 66 reliability. 69 Views and Beyond approach. 91 Valgrind. 17 failure severity. 59 view. 59 allocation viewtype. 18 architectural style. 20 42 SQL. 2. 20 SARAH. 2. 17 architectural element level analysis. 39 state-based models. 39 association. 59 component & connector viewtype. 69 safety. 5. 70 stakeholder. 45 static analysis. 15. 64 recoverable unit. 74 failure domain model.non-recoverable unit. 75 failure analysis report. 1. 37 aggregation. 14. 58 viewtype. 79 restricted growth strings. 63. 27 indirect scenarios. 69 concern. 35 questioning techniques. 76 architecture-level analysis. 10 reliable. 70 architecture description. 20 measuring techniques. 24. 16 IEEE 1471. 63. 58 169 . 29 failure scenario. 75. 15 lexicographic algorithm. 33 sensitive elements. 16 sensitivity analysis. 59. 15 Bell number. 60. 2 fault tree set. 79 role. 91 Stirling numbers of the second kind. 15 perspectives. 79 software architecture. 15 style. 18 architectural tactics. 14 architectural pattern. 1 viewpoint. 58. 94 architectural element spot. 24. 15 set partitioning problem. 14. 5. 15 WPF.

2002-04 R. Luttik. Faculty of Science. Bos and J. UL. UL. Faculty of Mathematics and Natural Sciences. and Computer Science.I. Choice Quantification in Process Algebra. Faculty of Mathematics and Computer Science and Faculty of Mechanical Engineering. Probabilistic Extensions of Semantical Models. van Hemert. UvA. 2002-12 L. Models and Logics for Process Algebra. TU/e.A. UL. UvA. 2002-13 J. and Computer Science. Tauritz. TU/e. Mathematics. UvA. Mathematics and Computer Science. Faculty of Mathematics and Computer Science. van der Zwaag. Faculty of Mathematics and Natural Sciences. Formal Specification and Analysis of Industrial Systems. KUN. Faculty of Science. UvA. TU/e.J. Andova. 2002-02 T. Faculty of Natural Sciences. TU/e. and Computer Science. 2002-05 M. Adaptive Information Filtering: Concepts and Algorithms. Vilius. Stoelinga.J. KUN. 2002-14 S.T. Faculty of Mathematics and Natural Sciences. Applying Evolutionary Computation to Constraint Satisfaction and Data Mining. UL. On-line Scheduling and Bin Packing. Melius: Guiding and Cost-Optimality in Model Checking of Timed and Hybrid Systems. Probabilistic Process Algebra. Fehnker. Faculty of Mathematics and Natural Sciences. Mathematics. UL. Willemen. Faculty of Natural Sciences. Faculty of Sciences.S. Models of Molecular Computing. Citius.I. 2002-06 N. Moonen. Faculty of Mathematics and Computer Science. 2002-08 R. VUA. Faculty of Mathematics and Natural Sciences. Exploring Software Systems. Kuipers. Mathematics. 2002-11 J. Techniques for Understanding Legacy Software Systems. Faculty of Natural Sciences. Mathematics and Computer Science. Division of Mathematics and Computer Science. Linearization in µCRL. 2002-07 A. School Timetable Construction: Algorithms and Complexity. Alea Jacta Est: Verification of Probabilistic. 2002-03 S. van Stee. Mathematics and Computer Science. Neural Networks for Intelligent Data Analysis: theoretical and experimental aspects. Faculty of Natural Sciences.P. 2002-09 D. 2002-01 V.Titles in the IPA Dissertation Series since 2002 M. Usenko.C.I. van Vugt. van Wezel. 2002-16 . 2002-10 M. 2002-15 Y. Kleijn. Faculty of Mathematics and Computer Science.B. Real-time and Parametric Systems. den Hartog.

Nedea. Mathematics. Qian. Leijen.A. Visser.J. UL.W. Analysis and Simulations of Catalytic Reactions.J. 2004-03 S.J.H. TU/e. 2004-02 P. ter Beek. Faculty of Electrical Engineering. Faculty of Natural Sciences. Lijding. 2003-02 J. Jojgov. Faculty of Mathematics and Computer Science. TU/e. Mathematics & Computer Science. Mathematics & Computer Science. UL. 2004-04 Y.M. de Jonge. Team Automata – A Formal Approach to the Modeling of Collaboration Between System Components. 2003-03 S. Frisco. 2003-09 M. Michiels. Faculty of Mathematics and Computer Science.M. TU/e.A. Faculty of Mathematics and Natural Sciences. Maneth. 2004-05 . 2003-10 D. UvA. UvA. 2003-06 M.E. To Reuse or To Be Reused: Techniques for component composition and construction. The λ Abroad – A Functional Approach to Software Components.M. UT. 2003-05 S. 2003-08 D. Mathematics & Computer Science. UT. Incomplete Proofs and Terms and Their Use in Interactive Theorem Proving. Generic Traversal over Typed Source Code Representations. 2003-11 W.V. On Modelchecking the Dynamics of Object-based Software: a Foundational Approach. Faculty of Mathematics and Computer Science. 2003-04 T. Faculty of Electrical Engineering.I. Random Redundant Storage for Video on Demand. Data Synchronization and Browsing for Home Environments. Faculty of Mathematics and Computer Science and Faculty of Industrial Design. UT. Faculty of Electrical Engineering.P. Faculty of Mathematics and Computer Science. 2003-01 M. Willemse.P. Faculty of Natural Sciences. Models of Tree Translation.P. Spiking Neural Networks. 2004-01 G. Theory of Molecular Computing – Splicing and Membrane systems. and Computer Science. Real-time Scheduling of Tertiary Storage. TU/e. Performance Ratios for the Differencing Method. Faculty of Mathematics and Natural Sciences. and Computer Science. TU/e. Mathematics. Faculty of Mathematics and Natural Sciences. TU/e. Aerts. Faculty of Mathematics and Computer Science. Faculty of Mathematics and Computer Science. Casual Multimedia Process Annotation – CoMPAs.D. UU. UL. UL. Distefano.J. 2003-07 H. Faculty of Mathematics and Natural Sciences. Bohte. Semantics and Verification in Process Algebras with Data and Timing. Benz.C.

C. 2004-12 R.O. TU/e. Faculty of Sciences. Bril. TU/e. 2004-18 E.J. VUA. 2004-07 E. Mathematics and Computer Science. Formalising Exact Arithmetic: Representations. 2004-14 F. Faculty of Math- ematics and Computer Science. Eskenazi and A. Division of Mathematics and Computer Science. 2004-08 N. Faculty of Mathematics and Computer Science. Faculty of Mathematics and Computer Science. Faculty of Technology Management. Bartels. TU/e. Faculty of Science. Goga. Constructive Real Analysis: a Type-Theoretical Formalization and Applications. and Business Applications.A Presentation-oriented Editor for Structured Documents. TU/e. KUN. Route Planning Algorithms for Car Navigation. Schrage. Niqui. Algorithms and Proofs. Formal Verification of Distributed Systems. TU/e. Orzan. TU/e.F. Faculty of Mathematics and Computer Science.H. Control and Selection Techniques for the Automated Testing of Reactive Systems.J. 2004-20 N. Cuijpers. UU. Real-time Scheduling for Media Processing Using Conditionally Guaranteed Budgets. Faculty of Sciences. TU/e. On Generalised Coinduction and Probabilistic Specification Formats. 200417 M. 2004-11 I. Dijk. 2004-13 J. Autonomous Agents in Bargaining Games: An Evolutionary Investigation of Fundamentals. Faculty of Mathematics and Computer Science. Quantitative Prediction of Quality Attributes for Component-Based Software Architectures. 2004-10 A. Faculty of Mathematics and Computer Science. Faculty of Technology Management. Evolutionary AgentBased Economics. Division of Mathematics and Computer Science. Mathematics and Computer Science. 2004-19 P.M. Gerding. Fyukov. Cruz-Filipe. 2004-15 E. Alkemade. VUA. Faculty of Sciences. Faculty of Science. On Distributed Verification and Verified Distribution. 2004-09 M. L¨h. 2004-16 S.M. Pang. Strategies. Supervisory Machine Control by Predictive- . o Faculty of Mathematics and Computer Science. RU.M. 2004-06 L. VUA. Hybrid Process Algebra.J. Indoor Ultrasonic Position Estimation Using a Single Base Station.L. Faculty of Mathematics and Computer Science. Exploring Generic Haskell. Division of Mathematics and Computer Science. van den Nieuwelaar. Flinsenberg. TU/e. Proxima . UU.M.

Kurtev. Faculty of Mathematics and Natural Sciences. Faculty of Mathematics and Computing Sciences. Faculty of Science. Ruimerman. Mathematics & Computer Science. An Assertional Proof a System for Multithreaded Java -Theory and Tool Support. UL. UL. Ionita. Faculty of Mathematics and Computing Sciences. Compositional Verification of Hybrid Systems using Simulation Relations. Scenario-Based System Architecting .Expression and Enforcement.A. Design and Verification of Lock-free Parallel Algorithms. Mathematics and Computer Science. Faculty of Electrical Engineering. Experiments in Rights Control .M.M. TU/e.R. Decision Procedures for Equality Logic with Uninterpreted Functions. 2005-12 B. van Beek. 2005-15 . Mathematics & Computer Science. 2005-09 O.. RU.L. Liekens.A Systematic Approach to Developing Future-Proof System Architectures. 2005-08 T. Data Mining using Genetic Programming: Classification and Symbolic Regression. Faculty of Biomedical Engineering. UU. Mousavi. Faculty of Mathematics and Computer Science. Wolle. Eggermont. Evolution of Finite Populations in Dynamic Environments. 2005-10 A. Faculty of Mechanical Engineering. TU/e. TU/e.T. UT. 2005-13 G. Faculty of Science. 2005-03 H. Faculty of Electrical Engineering. 2005-04 H. Adaptability of Model Transformations. Mathematics & Computer Science. Faculty of Biomedical Engineering. Lenzini. RUG. TU/e. 2005-07 I. Heeren. 200506 G. TU/e. 2005-14 M. Top Quality Type Error Messages. Gao. 200502 C. Frehse. Specification and Analysis of Internet Applications. 200501 R. Chong.J.F. Abrah´m. UT. Faculty of Mathematics and Computer Science. Tveretina. 2005-05 M. Structuring Structural Operational Semantics. Integration of Analysis Techniques in Security and FaultTolerance. TU/e. Modeling and Remodeling in Bone Tissue. Faculty of Mathematics and Computer Science.Lower Bounds and Network Reliability. UU.Reactive Scheduling.N. 2004-21 ´ E. 2005-11 J. Faculty of Science. Faculty of Mathematics and Natural Sciences. UT. Computational Aspects of Treewidth . TU/e. Faculty of Electrical Engineering.

TU/e. Faculty of Science. Corin. Effective Models for the Structure of pi-Calculus Processes with Replication. 2005-16 T.J. Faculty of Sciences. Division of Mathematics and Computer Science. Faculty of Mathematics and Natural Sciences. Faculty of Mathematics and Computer Science and Faculty of Mechanical Engineering. VUA. Nijssen. Composing Constraint Solvers. Gelsema. TU/e. RU. Man and R.L. TU/e. Mathematics. Faculty of Science. 2006-08 B. 2005-20 A. The Purely Functional Software Deployment Model. B¨hm-Like Trees for o Rewriting. Zoeteweij. UL. UL.G. 2005-18 J.W.R. Towards Hybrid Molecular Simulations. Analysis Models for Se- curity Protocols. VUA. Faculty of Natural Sciences. Mathematics.R. Formal Specification and Analysis of Hybrid Systems. Law.Techniques and Applications. Stepping through Haskell. 2006-01 R. Hendriks. Sokolova. Markvoort. 200609 S. Vinju.A. Kyas. Faculty of Science. UT. Modal Abstraction and Replication of Processes with Data. Faculty of Electrical Engineering. UU. 2006-04 M. Mathematics and Computer Science.J. UvA. Key management and link-layer security of wireless sensor networks: energy-efficient attack and defense. Ketema. 2006-06 J. Mathematics & Computer Science. 2006-02 P. Dijkstra. Faculty of Science. Verbaan. 2006-03 K.R. 2005-19 M.-B. Faculty of Electrical Engineering. 2006-07 C. UvA. 2005-22 E. Analysis and Transformation of Source Code by Parsing and Rewriting. RU. Mathematics and Computer Science. On JML: topics in tool-assisted verification of JML programs. Faculty of Science. Schiffelers. UU. The Computational Complexity of Evolving Systems.H. and Computer Science. Model Checking Timed Automata . Faculty of Biomedical Engineering. and Computer Science.A. Faculty of Mathematics and Natural Sciences. Coalgebraic Analysis of Probabilistic Systems. Faculty of Mathematics and Computer Science. Verifying OCL Specifications of UML Models: Tool Support and Compositionality. Dolstra. Faculty of Sciences. 2005-21 Y. Breunesse. Mining Structured . Faculty of Natural Sciences. 200605 M. UU. Mathematics & Computer Science. 2005-17 P. UT.Valero Espada.

Semantics and Applications of Process and Program Algebra. Hybrid Techniques for Hybrid Systems. Faculty of Mathematics and Computing Sciences. Separation and Adaptation of Concerns in a Shared Data Space. Faculty of Electrical Engineering. Faculty of Natural Sciences. 2006-12 B. Mathematics. Reconciling Nondeterministic and Probabilistic Choices. Guillen Scholten. UL. Mathematics and Computer Science. TU/e. Mathematics & Computer Science. 2006-19 C. Faculty of Mathematics and Computer Science.D. UT. Faculty of Science. Mathematics and Computer Science. Faculty of Science. UL. Theories for a Model-based Testing: Real-time and . Cremers. Kavaldjiev. Brand´n Briones.A. VUA. Faculty of Mathematics and Natural Sciences. TU/e. RU. Faculty of Mathematics and Natural Sciences. UT. and Computer Science. 2007-04 L. Mathematics. Faculty of Electrical Engineering. 2006-21 H. Faculty of Mathematics and Computer Science. TU/e. Formalising Interface Specifications. RU. and Computer Science. 2006-14 T. Mathematics and Computer Science. 2006-10 G. RUG. Language Based Security for Java and JML. Mobile Channels for Exogenous Coordination of Distributed Systems: Semantics. Mathematics & Computer Science. A run-time reconfigurable Network-on-Chip for streaming DSP applications. 2006-16 V. Faculty of Mathematics and Computer Science. Gebremichael. Scyther . Mooij. UT. 2006-20 J. Flexible Heterogeneous Software Systems. 2006-13 A. van Veelen. Implementation and Composition. Faculty of Electrical Engineering.F. UvA. 2006-18 L. 2006-17 B. Mathematics & Computer Science. Faculty of Natural Sciences. Vu. 2006-11 L. de Jong. TU/e.J. Division of Mathematics and Computer Science.M. Cheung. Faculty of Mathematics and Computer Science. Warnier.E. Expressivity of Timed Automata Models.K.Semantics and Verification of Security Protocols. UvA.V. Faculty of Science. Sundramoorthy. Faculty of Sciences. 2007-02 M.Data. Constructive formal methods and protocol standardization. Russello. At Home In Service Discovery. 2007-03 T. van Gool. 2007-01 N. Krilavicius. 2006-15 M.C. Verification techniques for Extensions of Equality Logic. RU.J. Badban. Considerations on Modeling for Early Detection of Abnormalities in Locally Autonomous Distributed Systems.

Brinkman. Faculty of Mechanical Engineering. What to do Next?: Analysing and Optimising System Be- haviour in Time. TU/e.F. 2007-19 W. New Data Structures and Algorithms for Mobile Data. Faculty of Natural Sciences. Faculty of Electrical Engineering. UT. Jarnikov. Faculty of Mathematics and Computer Science. TU/e. TU/e. Searching in encrypted data. Natural Deduction: Sharing by Presentation. Mathematics & Computer Science. UT. Loeb. Mathematics. Faculty of Electrical Engineering. Imperfect Information in Software Development Processes.J.S.R. Mathematics and Computer Science. Logical Calculi for Reasoning with Binding. Faculty of Mathematics and Computer Science. Noppen. RU. Pieters. Assessing and Improving the Quality of Modeling: A Series of Empirical Studies about the UML. A.A. Streppel. Faculty of Science. Multifunctional Geometric Data Structures. VUA. TU/e.Coverage. Mathematics. Trˇka. Lange. van Weelden. Silent Steps in Transition c Systems and Markov Chains. TUD. 2007-16 A.UvA. 2007-12 A. 2008-01 . Mathematics and Computer Science. Integration and Test plans for Complex Manufacturing Systems.J. Wijs. 2007-11 R. 2007-08 R.H. Faculty of Sciences. TU/e. 2007-07 N. 2007-13 C. Graaf. 2007-18 M. Integration and Delivery. RU. Faculty of Mathematics and Computer Science. Mathijssen. and Computer Science. Model-Driven Evolution of Software Architectures. 2007-14 T. La Volont´ Machinale: e Understanding the Electronic Voting Controversy. Faculty of Mathematics and Computer Science. Abam. 2007-17 D. TU/e.A. Division of Mathematics and Computer Science. Faculty of Science.J. Mathematics & Computer Science. Faculty of Mathematics and Computer Science. TU/e. Mathematics and Computer Science. 2007-06 M. 2007-15 B. and Computer Science. Boumen. QoS framework for Video Streaming in Home Networks. RU. Putting types to good use. van der Storm. Faculty of Electrical Engineering. 2007-09 A.W. Mathematics & Computer Science. Faculty of Science. 2007-10 J. 2007-05 I. Faculty of Mathematics and Computer Science. Component-based Configuration. UT. Faculty of Electrical Engineering.

Farshi. Tracing Anonymity with Coalgebras. UT. Parsing. A. Faculty of Science. TUD. 2008-05 M. Mak.S. 2008-07 I. Gulesir. de Jong. 2008-14 P. Mathematics & Computer Science. Mathematics. 2008-03 A. Faculty of Mechanical Engineering. D¨ rr. and Computer Science. Mathematics and Computer Science. UT. TU/e. 2008-11 M. Faculty of Electrical Engineering. Mathematics & Computer Science. Faculty of Mathematics and Computer Science. Formal and Computational Cryptography: Protocols. Bravenboer. Practical Automaton Proofs in PVS. Hashes and Commitments. Marin. Keeping Fairness Alive: Design and Formal Verification of Optimistic Fair Exchange Protocols. Zapreev.S. 2008-13 F. Resource-based Veriu fication for Robust Composition of Aspects. 2008-15 E. TU/e. Faculty of Mechanical Engineering. Exercises in Free Syntax: Syntax Definition. Faculty of Science. A Theoretical and Experimental Study of Geometric Networks.W. Faculty of Electrical Engineering. Design and Performance Analysis of Data-Independent Stream .M.A.H.W.L. Faculty of Mechanical Engineering.M. 2008-09 L. 2008-02 M. Modelbased Integration and Testing of Hightech Multi-disciplinary Systems. Renovation of Idiomatic Crosscutting Concerns in Embedded Systems. Mathematics & Computer Science. 2008-04 N. 2008-10 I. Cleophas. TU/e. Bruntink. E. Faculty of Electrical Engineering. Division of Mathematics and Computer Science.G. and Assimilation of Language Conglomerates.M. Evolvable Behavior Specifications Using Context-Sensitive Wildcards. Faculty of Science. Faculty of Mathematics and Computer Science. Faculty of Science. An Integrated System to Manage Crosscutting Concerns in Source Code. Math- ematics and Computer Science. Braspenning. VUA. Faculty of Electrical Engineering. RU. Bortnik.D. Formal Methods in Support of SMC Design. UU. Mathematics. 2008-16 R. Model Checking Markov Chains: Techniques and Tools. TUD. UT. Tree Algorithms: Two Taxonomies and a Toolkit. RU. 2008-08 I.A. Faculty of Sciences. Garcia. Mathematics and Computer Science. RU. TU/e. Faculty of Electrical Engineering. Torabi Dashti. TU/e. de Groot.M.C. Hasuo. and Computer Science. 2008-12 G. 2008-06 M. Integration and Test Strategies for Complex Manufacturing Machines.

Faculty of Electrical Engineering. TU/e. a proof assistant for Clean. Algorithms for Fat Objects: Decompositions and Applications. Drawing Graphs for Cartographic Applications. Khadim. 2009-01 M. van der Horst. 2008-23 A. Lormans. Faculty of Electrical Engineering. TU/e. TU/e. UT.M. 2008-29 M. Mathematics & Computer Science. Faculty of Mathematics and Computer Science. Mathematics & Computer Science. Mathematics and Computer Science. 2008-17 M. Faculty of Mathematics and Computer Science. Faculty of Mathematics and Computer Science. Faculty of Mathematics and Computer Science. Faculty of Mathematics and Computer Science. Modeling and Validating Distributed Embedded Real-Time Control Systems. Cryptographic Keys from Noisy Data Theory and Applications. Buhan. Models of Natural Computation: Gene Assembly and Membrane Systems. UT. Mathematics & Computer Science.G. 2008-24 U.H. Scalable Block Processing Algorithms. 2008-18 C. UT. Faculty of Science. TU/e. Faculty of Mathematics and Computer Science. Real and Stochastic Time in Process Algebras for Performance Evaluation. Faculty of Mathematics and Natural Sciences. 2008-19 J. Faculty of Electrical Engineering. 2008-25 J. TU/e.R. RU. UT. Faculty of Electrical Engineering. 2008-27 I. Marin-Perianu. UL. Managing Requirements . Reasoning about Functional Programs: Sparkle. Gray. Testing Reactive Syse tems with Data . Calam´. de Mol. Faculty of Mathematics and Natural Sciences. UL. Faculty of Mathematics and Computer Science. Mining Semistructured Data. 2009-02 M.R. 2008-20 E. RU. Mumford. 2008-22 R. 2008-26 H.Enumerative Methods and Constraint Solving. TU/e.H. Theoretical and Experimental Aspects of Pattern Evaluation. 2008-21 E. de Graaf. Wireless Sensor Networks in Motion: Clustering Algorithms for Service Discovery and Provisioning. Process Algebras for Hy- brid Systems: Comparison and Development. Koprowski. Kastenberg. Brijder. Markovski. 2008-28 R. TU/e. Graph-Based Software Specification and Verification. Termination of Rewriting and Its Certification.Processing Systems. Mathematics & Computer Science. Mathematics and Computer Science. Faculty of Science.S. Verhoef.

Faculty of Electrical Engineering. Automated Model-based Testing of Hybrid Systems.Evolution.W. UT. Architecting Fault-Tolerant Software Systems.J. Faculty of Electrical Engineering. 2009-04 H. 2009-03 M. and Computer Science.P. TU/e. Mathematics. Sozer. Mathematics & Computer Science. TUD. van Osch. Faculty of Mathematics and Computer Science. 2009-05 .

Sign up to vote on this title
UsefulNot useful

Master Your Semester with Scribd & The New York Times

Special offer for students: Only $4.99/month.

Master Your Semester with a Special Offer from Scribd & The New York Times

Cancel anytime.