Professional Documents
Culture Documents
Abstract—A system that is highly dependable under hostile condi- change) — and hence S is now much less than 70%-capable in
tions but whose dependability cannot be easily evaluated prior to the meeting the latency specs. The system capability may however
deployment of applications is less desirable than a system with lower be enhanced by installing additional proxy server nodes along
but predictable dependability. This is because a decision-making
on the deployment of high assurance systems is often based on a the content distribution topology. Concomitant with this notion
risk analysis of application failures. For systems implemented on a of system dependability is a safety aspect, namely, S should
cloud, the problem of system certification assumes added importance not reach unsafe states while meeting its objectives: say, in this
because of third-party control of cloud resources and the attendant example, the client connectivity to a content getting disrupted.
problems of faults, QoS degradations, and security violations. In this The goal of our paper is to design the software engineering
light, our paper focuses on: i) formulating metrics to quantify the
dependability of cloud-based applications; and ii) identifying tech- methods and tools to quantify dependability, and identify the
niques to measure these metrics prior to deployment of applications. system-level techniques therein to assess the dependability
The paper treats system dependability as an application-level QoS of S. Analyzing the dependability of S involves verifying
for management purposes, and advocates a probabilistic evaluation that the safety requirements are met as S strives to meet its
of dependability. Our approach is corroborated by measurements on QoS objectives under various external environment conditions
system-level prototypes and simulation analysis of system models in
the face of hostile environment conditions. A case study of replicated incident on the underlying cloud services.
data service anchored on cloud infrastructures is also described. Our notion of dependability of S is at the confluence
of QoS, timeliness, and fault-tolerance attributes of S. It
I. I NTRODUCTION is divorced from a traditional view where the dependability
We consider an application running on top of the com- of S is rigidly tied to the fault-tolerance of S. The QoS
putational and communication services realized over one or feature depicts an ability of S to control its performance in
more cloud infrastructures. The system as a whole implements response to an underlying infrastructure resource allocation or
a core functionality, with augmentations from the service a change in the external environment conditions. The QoS-
provider to support a variety of para-functional behaviors. to-resource mapping relationship should be established in a
For instance, data replication and content distribution may quantitative manner under specific environment conditions, in
be offered as core services that are often associated with, order to meet the performance objectives in predictable way.
say, performance, security, and timeliness attributes. Here, An example is the determination of content delivery latency
the problem of system certification (i.e., reasoning about over a distribution network set up on a geographically spread-
whether a system behaves in the way it is supposed to) has out cloud of content storage nodes, in the presence of node
become important because of the third-party control of cloud failures. Virtualization, which allows realizing the distribution
resources and the attendant issues of fault-handling, security, network as a core service from the cloud provider, does not
maintenance, availability, and the like [1]. by itself prevent fluctuations in the latency behavior (e.g.,
The dependability of a cloud-based application system S is jitter) induced by node failures and outages. Here, a para-
a measure of how good S meets its intended QoS objectives functional goal is to reduce the latency jitter by resorting to
under uncontrolled external environment conditions incident content caching techniques, thereby assuring a stable behavior
on S. Say, for example, S is a content distribution network of applications. The mapping between the output of S and
(CDN) that advertises content delivery to clients within 5 sec platform resources should be known with reasonable accuracy:
of a request: say, News download. Suppose S achieves the either as a closed-form model of S or through a series of
best QoS of 5 sec guarantee only with a probability of 0.4, incremental allocate-and-observe invocations on S [2].
and achieves a latency distributed between 5 and 15 sec in The domain-specific core adaptation function in a cloud-
other cases. Under simplifying assumptions, the dependability based application system S is viewed as a control-theoretic
of S in meeting latency specs is estimated as 0.7 on a feedback loop acting on a reference input Pref : say, for cloud
normalized scale [0, 1]. If the content storage/delivery backlog resource management (such a view is also advocated in [3]).
becomes severe, the 5 sec latency is less sustainable (assuming Figure 1 concretizes this view . The controller C generates
that other parameters of storage/delivery mechanism do not its actions I based on a computational model of S, denoted
system output
system dependability by
system input
(supplied as
(observed as
External observing system behavior final QoS specs of
detection). These core layers together constitute g ∗ (· · ·), with
Pref)
management
P’)
actually desired
entity H achieved QoS A0p housed in the application layer for behavior monitoring &
mapping between internal-state Service interface control. The latter involves exercising the underlying algorithm
visible
and interface events
state layer (and in turn, the infrastructure layer) via signaling points
defined in the service interfaces. The dependability of S
computational
System components
software/hardware system
function (e.g., errors, threat, outage) a case study of cloud-based content distribution networks, with
data storage,
g*(I,O*,s*,E*)
[parametric representation E*] a focus on system-level dependability.
I: input The paper is organized as follows. Section II treats system
O*: output
s*: internal state Cloud dependability as compliance to the stated non-functional at-
infrastructure tributes (e.g., QoS). Section III describes a sample application:
replicated data service, from a standpoint of the QoS compli-
Fig. 1. Structure of adaptation processes in a network application system ance issues in cloud-based realizations. Section IV presents our
model-based approach to assess the dependability of cloud-
based systems, with a focus on the adaptative fault-tolerance
as: g(I, O, s, E) — where O is the plant output in response of replicated data servers. Section V discusses the relation-
to the trigger I, s is the plant state prior to the incidence ship to existing system management frameworks. Section VI
of I, and E depicts the environment condition. Since the concludes the paper.
true plant model g ∗ (I, O∗ , s∗ , E ∗ ) is not completely known,
C refines its input action in a next iteration based on the II. D EPENDABILITY OF CLOUD - BASED SYSTEMS
deviation in observed output O∗ from the expected output O Suppose G depicts the goal to be met by a cloud-based
when action I occurs. Upon S reaching a steady-state (over network system S. The goal G may include a prescription of
multiple control iterations) with output P 0 , the output tracking one or more non-functional attributes associated with the QoS
error |Pref − P 0 | is analyzed to reason about the system delivered by S to an application: such as resilience to external
dependability. Our approach is guided by the concepts and disturbances, stability against resource fluctuations, and re-
taxonomy of dependable computing presented in [4]. sponsiveness to user-triggered requirements. The dependability
We treat the dependability as a management attribute of of S is prescribed in terms of such non-functional attributes
cloud-based adaptive systems. An external management entity — and is hence distinct from the correctness goal of S which
H views the system S as made up of adaptation processes A0p is a functional attribute (yielding a YES/NO result). This
wrapped around a core system g ∗ (· · ·), i.e., S ≡ A0p ⊗ g ∗ (· · ·) section provides a non-functional characterization of system
— where ’⊗’ denotes the composition in an object-oriented dependability. Here, the dependability of S is a measure of
software view. A0p is embodied in a distributed agent-based how well the adaptation functions programmed into S adjust
software module that forms the building-block to structure to the changing external environment conditions in meeting an
S ([5] provides an architecture for distributed realization of application-level goal G specified for S.
the adaptation logic of A0p ). S interacts with its (hidden)
external environment through the core elements g ∗ (· · ·): e.g., A. Dependability as application-level QoS
responding to client queries on a web server, and delivering
content over a network transport connection. Here, the meta- We treat the dependability of a cloud-based system S as a
level signal flows between A0p and g ∗ (· · ·) are visible to H. meta-attribute of application-level QoS achievable under the
The layered software structure of S intrinsic to cloud-based current operating conditions. An example is how stable is the
systems: viz., the infrastructure, service-oriented algorithms, content delivery latency achieved on a distribution link in the
and adaptive application, stacked in that hierarchy and sep- face of bursty demands and resource costs. Another example
arated across well-defined interfaces1 , lends itself well for is how good is a web service availability in the presence of
failures of one or more server replicas. Dependability assess-
1 The system layers correspond to IaaS, PaaS, and SaaS, in the cloud ment is anchored on three interwoven properties associated
computing arena [6]. with the behavior of S:
88
• Measurability, wherein S can map its output/state vari- E ∗ includes the errors in QoS specification (e.g., incomplete
ables onto to high-level metrics over a wide range of input specs), algorithm formulation (e.g., unspecified events), and
conditions (for analysis and reasoning); resources/components (e.g., a node crash) respectively.
• Predictability, wherein S can compute its expected output We cast the notion of system dependability in a broader
under various input and environment conditions (with context, as advocated in [4], than the currently prevalent
reasonable accuracy); approaches that rigidly tie to the system-level mechanisms for
• Adaptability, wherein S can adjust the utility extracted recovery from component-level faults occurring in a system.
from its output as the environment conditions become
B. System dependability, trust, and fault-tolerance
harsher, thereby heavily dwindling its resources.
These properties depict that S is programmable, i.e., an [4] characterizes dependability as a form of trust bestowed
external management entity H can exercise control over the on a system S by its users across the service interface to S
behavior of S in a concrete manner. For instance, even if S is — where a user may be a human entity or a physical world
100% fault-tolerant, an inability to reason about this property process or a computational sub-system that acts as the client
in various fault scenarios reduces the dependability of S from of S. It depicts how trustable is the service exported by S, as
an application standpoint. In the earlier example of replicated perceived by those who depend on S (i.e., the users of S).
web service, a 95%-availability depicts a verifiable assurance, We project the output tracking error ² = |Pref − P 0 | onto
over a suitable time-scale, that no more than 5% of the client a dependability measure (note: P 0 is the final converged O∗ ).
queries incur a latency higher than a set limit, say, 12 sec, even The following correspondences can be readily established:
in the presence of server crashes. Thus, a characterization of • ² > ²m depicts a service failure (i.e., S is of no value);
S in terms of behavioral properties improves the reasoning • δ < ² ≤ ²m depicts a service degradation (but S may
functionalities: i) core components and resources set up over representation of s∗ in the model programmed into C).
the cloud infrastructure that collectively export a service to The environment E ∗ covers anything outside the control
the application, and ii) a controller C that exercises the core regime of S but which impacts the operations of S: e.g., the
components/resources to meet the application needs Pref . An abrupt shut-down of a business unit in an enterprise system
application interacts with the system core through a service- causing delays in the servicing of customer demands (or, even
oriented interface (made up of APIs), with a certain trust on scaling down system operations).
the service delivery vis-a-vis its quality. A model of S captures Under the above framework, the conventional system-level
the following elements — c.f. Figure 1: faults (e.g., node crashes) can be subsumed as part of an
∗ ∗ ∗ ∗
• Plant g (I, O , E , s ) made up of computational algo- uncontrolled external event space E ∗ . QoS degradations can
rithms that map the infrastructure resources and compo- also be viewed as faults that constitute a part of E ∗ (e.g.,
nents onto a unified service delivered to applications; packet loss along a network path). How S reacts to the hostility
• Controller C that decides on an input I to g (.): say,
∗ of an external event e ∈ E ∗ and shields the application from
resource allocation and/or component assignment in the the effects e allows us to delineate the hitherto qualitative
infrastructure, based on the current service state s ≈ s∗ notions of system robustness and resilience.
and environment condition e ∈ E ∗ , to2 make the ob- Robustness depicts the ability of S to present a service
served output O reach close to Pref in a steady-state; behavior to applications as if there are no failures or hostile
• Sensor that maps the service state s onto the observed events. Here, the application does not see the impact of hostile
service output O to report back to C; events, as captured by: P 0 ≈ Pref . This is enabled by strong
• Actuator to realize a resource/component allocation de- service-layer guarantees provided by S using, say, resource
cision I as domain-specific actions on the infrastructure. reservations and/or redundant component deployments in the
The service interface basically wraps around the core system infrastructure, and the underlying algorithms to correctly man-
g ∗ (· · ·). A function g(· · ·) that approximates g ∗ (· · ·) and the age these system-level mechanisms. The signaling interaction
errors in sensing/actuation processes are factored in a system between the application and service layers occurs only when
model programmed into the controller C. an application starts and when it terminates.
A dependability metric, denoted as DS (e, G), is a measure In contrast, resilience depicts the ability of S to reconfigure
of how good the system S meets its goal G under the its internal functions and/or parameters, in a way to continue
environment condition e ∈ E ∗ . Here, G is prescribed as a operations in a degraded mode. Here, the application does
behavior of S that moves the physical plant to a desired state. see the impact of hostile events, al beit, indirectly, by scaling
Note that a certification authority quantifying DS (e, G) may down its operations to a reduced level, which is captured as:
be external to the cloud provider and the application-level user. 0 ¿ (Pref − P 0 ) ≤ ²m . The service-layer is programmable,
The external events E ∗ incident on S can emanate from the with S employing parameterizable system-level mechanisms
application, service-layer algorithms, and cloud infrastructure. that are invoked by the application at various time-scales
during run-time as part of its reconfiguration strategies. Ac-
2 Observed service interface state is mapped from service-internal state s∗ . cordingly, a richer API is required to support the signaling
89
op1, op2, op3: Different operating points of the plant
interaction needed between the application and service layers: op3 is more hostile than op2; op2 is more hostile than op1
namely, notifying the quality degradation |Pref − P 0 | to the Î op3 sustains a lower e than op2, and op2 sustains a lower e than op1
application and controlling the parametric and/or algorithm [e(3m) < e(2m) < e(1m)]
changes needed in the service-layer3 . Pref
e
3}-
{op
-e
C. Tracking error based dependability specification
2}
p
{o
Consider an application system S ≡ A0p ⊗ g ∗ (I, s∗ , E ∗ , O∗ )
— c.f. Figure 1. A controller C embodied in A0p operates
on the physical plant g ∗ (· · ·) to generate to an output close to
-e
Pref . E ∗ depicts the external environment, modeled as a set of {o
p 1}
90
Trusted, to produce a correct result in a case when fa ≥ d N2 e. Thus, the
neutral query failure probability Pqf may be estimated in terms of the
certifier H probability that at least fm + 1 of the servlets get attacked and
reason about
CLIENT APPLICATIONS exhibit faulty behavior (Pqf = 0 when fa ≤ fm ). Depicting
Log of meta-data
for QoS analysis
CONTROL
QoS specs
server task observed µ K−f −1 ¶
PLANE
PLANE
DATA
requests (response time; (response time desired: T d ' m
fault-tolerance) fault-tolerance desired)
XK fa − fm − 1
s)
Pqf = rfa × µ ¶
cti on
service interface
funlecti
K
fa =fm +1 f a
s & col
(provider of
Server replication DATA SERVICE value-added
ter ta
me da
redundancy,
security, . .)
example of N = 5, K = 8, fm = 2, r = 0.2, and uniform
(pa me
SM K > N t 2fm+1
mo(syst I (assuming distribution of hAT T (fa ) over the interval [1, 8]. The estimated
int nito em server instantiation on VMs leasing of VMs no byzantine query failure probability is: Pqf = 223.3×10−5 . By increasing
erf rin (# of VMs running server: from cloud
ac g failures)
e) N=3) (K=6) the degree of replication, say as N = 7, Pqf = 36 × 10−5 .
Vc, . . Equation (1) is based on a stochastic analysis of the voting
infrastructure interface
system in terms of its internal state variables. Under certain
VM running server instance
CLOUD assumptions about the statistical independence of failures, the
X INFRASTRUCTURE results allow a quick estimate of the extent of faults occurring
Idle VM node (available cycles: Vc) X (manages resources:
VM, storage, . .) in the system and the impact on client-perceived QoS: namely,
difficult-to-measure parameters: query failure probability.
# of VMs suffering failure: fa = 2; (provider of
VM failure probability: q X: attack on VM raw service) IV. E VALUATING ADAPTATION IN CLOUD - BASED SYSTEMS
Fig. 3. Cloud-based replicated data repository service We employ model-based analysis of the control-theoretic
adaptation loops in a complex system S [9] to reason about
system dependability in a cloud setting. Recall the system
replicas fa , it is reasonable to assume that a probability compositional view: S ≡ A0p ⊗ g ∗ (I, O∗ , s∗ , E ∗ ), where A0p
distribution hAT T (fa ) is known. This statistical information and g ∗ (I, O∗ , s∗ , E ∗ ) depict the adaptation and core compu-
can guide a more accurate choice of fm such that fa ≤ fm tational processes respectively. The reasoning functionality is
with a high probability – and hence N . The QoS is prescribed embodied in the management module H — c.f. Figure 1.
in terms of non-functional attributes of the replicated web
A. Model-based reasoning about system behavior
system: availability, integrity, performance, and resilience.
Consider the return of a query result by the voting module. H embodies the functionality to assess the effectiveness
Here, an integrity violation may possibly occur if more than of A0p by monitoring the tracking error ² = |Pref − P 0 |. It
fm server modules are attacked, i.e., fa > fm . Because, is possible that A0p embodies autonomic elements to morph
there is a chance that the query result is incorrect because the controller algorithms in C and/or the plant parameters, if
the result proposed by a faulty servlet may have generated needed, to reduce ² (and hence increase dependability). This
enough consents for the voting module to declare the faulty involves predicting the future plant conditions and environ-
proposal as a correct one. In another scenario, it is possible ment, and determining the anticipated control actions therein:
that none of the first (2fm + 1) iterations produce a result such as dynamic switching of the push/pull algorithm and/or
enjoying fm + 1 consents. Here, no result will be delivered to its parameters in a CDN to meet the changes in client de-
the client — which reduces the service availability. mands [11]. The control-theorectic elements needed for a self-
The case of a hostile event e ≡ [fa > fm ] when the voting managing behavior of S [12], if any, and the computational
algorithm is itself vulnerable to failure is our interest. It may intelligence aspects of A0p therein [13], are6 captured in an
d²
manifest as, say, a non-completion of client queries and a assessment of ² vis-a-vis e (say, de ) — c.f. Figure 2.
return of incorrect results — which lowers the availability and In the absence of exact knowledge about E ∗ in the controller
integrity respectively of the ’data service’ provided by S. C, we capture the variation of tracking error TE with respect
to the unknown inputs of the plant (E ∗ −E) by treating TE as
B. Analytical modeling of server replication a random variable with probability distribution that is derived
The algorithm designer for the voting-based replicated pro- from the native distributions of (E ∗ − E). This is combined
cessing method may conservatively set fm = d N2 e − 1 for a with an application-specific characterization of how TE = ²
given N — even though the probability distribution hAT T (fa ) captures the usefulness of S in various operating conditions.
may instill confidence in the designer to set fm optimistically
6 We treat failures as causing system-level output errors that are indistin-
to a lower value. Though the probability of a correct decision
guishable from the errors arising due to system modeling inaccuracies. This
is increased in the conservative case (by virtue of the increased unifying view allows S to employ feedback-based control techniques to deal
margin of consent voting), the voting algorithm may still fail with the failure-induced issues as well.
91
We surmount the issue of combinatorial complexity faced in
analyzing a networked system S by resorting to a model-based tracking error
engineering (MBE) approach that combines a tractable per- Model-based H=[J- J’]
modeling error
formance analysis with a control-theoretic feedback process, controller C
\=[J”- J’]
in order to converge to a (sub-)optimal resource allocation actual traffic /*
QoS specs (based on combined
in the underlying infrastructure ([14] describes the role of J application requests and
actual environment changes)
feedback loops in self-adaptive systems). The MBE approach uted network
takes into account the cross-layer dependencies by featuring compallocation system
r c e schedule tasks
the functional elements of control loops residing in various resou based ondel 2 to resources at
m mo system nodes
layers with a holistic system-level model of S [15]. syste/,J,s,E) 1
8 performance J’
G(
T(V’,E’)
5
(subject to
G(V,E)
3 4 7 observation error
92
pricing) [6]. The service layer cost is tied to the development external environment E* { [fa,r]
and running of distributed algorithm software, and hence is servlet
domain-specific. In a push/pull algorithm for CDN, N proxy B: system failure
observe application-level
interconnects and N 0 content storages (where 1 ≤ N 0 ≤ N ) —
system dependability
l
tro
control module
con B
QoS parameters
besides the base software itself to infuse the proxy capability
replication
N , Replicated fm: Max. # of servlet failures
respon
in the N nodes. In a replicated web service layer, the needed proto web Service assumed by voting protocol
software diversity among N servlet replicas (to deter failures) para col r: degree of faulty behavior of
ses
me
plug ter a failed servlet (0 < r d 1)
que uest
incurs O(N 2 ) cost — besides the base software itself running
req
-in voting protocol
fa: Actual # of servlet failures
ry
on the (N + 1) nodes and N interconnects. The service layer [fm,'
]
sub-system
ma K: Total servlet pool size(K t N)
cost often exhibits a monotonic convex behavior with respect res jorit
po y
to N , and it dominates over the infrastructure costs because QoS-ada ns client
ptive e
web-base x
of the specialized software development efforts involved. informati d
The combined cost incurred at the infrastructure and service on system
layers gets assigned to the applications in one form or the
Prototype system results
other. The incurrence of per-node costs may compel a service
(on UNIX-based LAN)
time-to-complete query
provider to limit the number of nodes N participating in an delayed query result (TTC > ') is less useful
algorithm execution, thereby lowering the attainable QoS. The Î reduces “service availability”
93
• Software engineering for the verification of application R EFERENCES
requirements (including para-functional ones) [17], [18]. [1] Panel Discussion (K. R. Joshi, G. Bunker, F. Jahanian, A. V. Moorsel, J.
In contrast, our work focuses on model-based assessment of Weinman) Dependability in the Cloud: Challenges and Opportunities. In
IEEE/IFIP Intl. Conf. on Dependable Systems and Networks, June 2009.
the dependability of cloud-based systems (our earlier work is [2] B. Li, K. Nahrstedt. A Control-based Middleware Framework for
on an assessment of Internet-type network systems [19]). Quality of Service Adaptations. In IEEE Journal on Selected Areas in
[20] provides a QoS-aware network service composition and Communications, vol.17, no.9 Sept.1999.
[3] C. C. Lamb, P.A. Jamkhedkar, G. L. Heileman, and C. T. Abdallah.
adaptation to infuse a core self-management intelligence for Managed Control of Composite Cloud Systems. In proc. IEEE Intl. Symp.
autonomic operations across heterogeneous networks. How- on Service-oriented System Engineering, Irvine (CA), Dec.2011.
ever, cloud components are under third-party control with less- [4] A. Avizienis, J. C. Laprie, B. Randell, C. Landwehr. Basic Concepts and
Taxonomy of Dependable and Secure Computing. In IEEE Transactions
than-100% trust — which poses challenges in guaranteeing on Dependable and Secure Computing, 1(1), pp.11-33, Jan. 2004.
system-level reliability and robustness. [5] P. G. Bridges, M. Hiltunen, and R. D. Schlichting. Cholla: A Framework
Chisel [21] provides an adaptation framework by separating for Composing and Coordinating Adaptations in Networked Systems. In
IEEE Transactions on Computers, 58(11), pp.1456-1469, Nov. 2009.
the core system functionality from those dealing with non- [6] Sun Microsystems. Introduction to Cloud Computing architecture. In
functional behaviors. This allows a seamless and transparent White Paper, 1st ed., June 2009.
infusion of new behaviors at run-time to deal with unantici- [7] T. Metsch. Open Cloud Computing Interface — Use Cases and Require-
ments for a Cloud API. Open Grid Forum, Sept. 2009.
pated external events e ∈ E ∗ , and a removal of those behaviors [8] K. Ravindran, K. A. Kwiat, A. Sabbir, and B. Cao. Replica Voting: a
when the conditions causing e cease. The dynamic frameworks Distributed Middleware Service for Real-time Dependable Systems. In
discussed therein are applicable to cloud-based systems, given proc. IEEE/ACM COMSWARE’06, New Delhi (India), January 2006.
[9] M. Cordier, P. Dague, M. Dumas, F. Levy, A. Montmain, M. Staroswiecki,
the uncontrolled nature of third-party administered cloud re- and L. Trave-massuyes. A comparative analysis of AI and control theory
sources. Our emphasis however is on the decision-theoretic approaches to model-based diagnosis. In proc. 14th European Conference
mechanisms to control the behavior of cloud-based systems. on Artificial Intelligence, pp.136-140, 2000.
[10] F. S. Hillier and G. J. Lieberman. ”Non-linear Programming” and ”Meta-
[22] advocates dependability differentiation to identify dis- heuristics”. Chap. 12, 13, Introduction to Operations Research, McGraw-
tinct classes of cloud-based applications (with appropriate Hill publ. (8th ed.), pp.547-616, 2005.
scheduling of the underlying cloud resources). Here, system [11] K. Ravindran. Dynamic Protocol-level Adaptations for Performance and
Availability of Distributed Network Services. in Modeling Autonomic
assessment is feasible with our probabilistic treatment of a Communication Environments, Multicon Lecture Notes, Oct. 2007.
control error vis-a-vis its impact on system dependability [23]. [12] Y. Diao, J. L. Hellerstein, G. Kaiser, S. Parekh, and D. Phung. Self-
Managing Systems: A Control Theory Foundation. In IBM Research
VI. C ONCLUSIONS Report, RC23374 (W0410-080), Oct.2004.
[13] R. C. Eberhart and Y. Shi. Computational Intelligence. In chap.
We treat system dependability as application-level QoS at a 2, Computational Intelligence: Concepts to Implementations, Morgan
meta-level (regardless of the system complexity). Dependabil- Kaufman Publ, 2007.
ity assessment of a cloud-based system involves three aspects: [14] Y. Brun, G.D.M. Serugendo, C. Gacek, H. Giese, H. Kienle, M. Litoiu,
H. Muller, M. Pezze, and M. Shaw. Engineering Self-Adaptive Systems
measurability, predictability, and adaptability of system behav- Through Feedback Loops. Book Chapter, Self-Adaptive Systems, LNCS
ior — which are enabled by the programmability of cloud- 5525, Springer-Verlag, pp.48-70, 2009.
based infrastructures and services. Guided by the concepts [15] C. G. Cassandras and S. Lafortune. Introduction to Discrete Event
Systems, Springer-Verlag publ., 2007.
provided in [4], the paper studied model-based engineering [16] C. Lu, Y. Lu, T. F. Abdelzaher, J. A. Stankovic, S. H.. Son. Feedback
techniques to quantify system dependability, and certify cloud- Control Architecture and Design Methodology for Service Delay Guaran-
based systems therein. System-level fault-tolerance mecha- tees in Web Servers. In IEEE Trans. on Parallel and Distributed Systems,
17(7), Sept. 2006.
nisms, hitherto treated separately, are subsumed in the broader [17] I. Schaefer and A. P. Heffter. Slicing for Model Reduction in Adaptive
notion of system dependability. Embedded Systems Development. In Workshop on Software Engineering
Given a system S realized on a cloud-based infrastructure, for Adaptive and Self-managing Systems (SEAMS 2008), May 2008.
[18] J. Yi, H. Woo, J. C. Browne, A. K. Mok, F. Xie, E. Atkins, and C. G. Lee.
the paper employed model-based engineering techniques to Incorporating Resource Safety Verification to Executable Model-based
analyze the output behavior of S relative to what is expected Development for Embedded Systems. In IEEE Real-time and Embedded
of S under uncontrolled environment conditions incident on S Technology and Applications Symp., pp.137-146, 2008.
[19] K. Ravindran. Model-based Engineering for Certification of Complex
(such as attacks and failures in the infrastructure). A manage- Adaptive Network Systems. In proc. IEEE Workshop on Cyber-Physical
ment entity H external to S incorporates the observation logic Networking Systems, ICDCS’12, Macau (China), June 2012.
to reason about capability of S by quantification of control [20] J. Xiao and R. Boutaba. QoS-Aware Service Composition and Adapta-
tion in Autonomic Communication. In IEEE Journal on Selected Areas
errors and probabilistic QoS assessment. We also undertook a in Communications, 23(12), Dec. 2005.
case study of replicated data servers to assess the quality of [21] J. Keeney and V. Cahill. Chisel: A Policy-Driven, Context-Aware,
fault-tolerance. The focus of study was on the assessment of Dynamic Adaptation Framework. In proc. IEEE Intl. Workshop on
Policies for Distributed Systems and Networks (POLICY’03), June 2003.
how good the adaptation processes are against the imperfect [22] Ameen Chilwan. Dependability Differentiation in Cloud Services.
knowledge about system operating conditions: such as server Master’s Thesis, Dept. of Telematics, Norwegian University of Science
failures and available network bandwidth. and Technology, Advisor: P.E. Heegaard, July 2011.
[23] K. Ravindran. Managing Robustness of Distributed Applications Under
An advantage of our management structure for the assess- Uncertainties: An Information Assurance Perspective. in proc. Cyber
ment of cloud-based systems is the reduction in development Security and Information Intelligence Research Workshop (CSIIRW),
cost of distributed control software for system management ACM-ICPS, Oak Ridge National Laboratory (TN), April 2010.
via model reuse and service-level programming.
94