You are on page 1of 8

17th IEEE/ACM International Symposium on Distributed Simulation and Real Time Applications

Self-Assessment and Reconfiguration Methods for


Autonomous Cloud-based Network Systems
Kaliappa Ravindran
Department of Computer Science
The City University of New York
New York, NY 10031, USA.
Email: ravi@cs.ccny.cuny.edu

Abstract—A system that is highly dependable under hostile condi- change) — and hence S is now much less than 70%-capable in
tions but whose dependability cannot be easily evaluated prior to the meeting the latency specs. The system capability may however
deployment of applications is less desirable than a system with lower be enhanced by installing additional proxy server nodes along
but predictable dependability. This is because a decision-making
on the deployment of high assurance systems is often based on a the content distribution topology. Concomitant with this notion
risk analysis of application failures. For systems implemented on a of system dependability is a safety aspect, namely, S should
cloud, the problem of system certification assumes added importance not reach unsafe states while meeting its objectives: say, in this
because of third-party control of cloud resources and the attendant example, the client connectivity to a content getting disrupted.
problems of faults, QoS degradations, and security violations. In this The goal of our paper is to design the software engineering
light, our paper focuses on: i) formulating metrics to quantify the
dependability of cloud-based applications; and ii) identifying tech- methods and tools to quantify dependability, and identify the
niques to measure these metrics prior to deployment of applications. system-level techniques therein to assess the dependability
The paper treats system dependability as an application-level QoS of S. Analyzing the dependability of S involves verifying
for management purposes, and advocates a probabilistic evaluation that the safety requirements are met as S strives to meet its
of dependability. Our approach is corroborated by measurements on QoS objectives under various external environment conditions
system-level prototypes and simulation analysis of system models in
the face of hostile environment conditions. A case study of replicated incident on the underlying cloud services.
data service anchored on cloud infrastructures is also described. Our notion of dependability of S is at the confluence
of QoS, timeliness, and fault-tolerance attributes of S. It
I. I NTRODUCTION is divorced from a traditional view where the dependability
We consider an application running on top of the com- of S is rigidly tied to the fault-tolerance of S. The QoS
putational and communication services realized over one or feature depicts an ability of S to control its performance in
more cloud infrastructures. The system as a whole implements response to an underlying infrastructure resource allocation or
a core functionality, with augmentations from the service a change in the external environment conditions. The QoS-
provider to support a variety of para-functional behaviors. to-resource mapping relationship should be established in a
For instance, data replication and content distribution may quantitative manner under specific environment conditions, in
be offered as core services that are often associated with, order to meet the performance objectives in predictable way.
say, performance, security, and timeliness attributes. Here, An example is the determination of content delivery latency
the problem of system certification (i.e., reasoning about over a distribution network set up on a geographically spread-
whether a system behaves in the way it is supposed to) has out cloud of content storage nodes, in the presence of node
become important because of the third-party control of cloud failures. Virtualization, which allows realizing the distribution
resources and the attendant issues of fault-handling, security, network as a core service from the cloud provider, does not
maintenance, availability, and the like [1]. by itself prevent fluctuations in the latency behavior (e.g.,
The dependability of a cloud-based application system S is jitter) induced by node failures and outages. Here, a para-
a measure of how good S meets its intended QoS objectives functional goal is to reduce the latency jitter by resorting to
under uncontrolled external environment conditions incident content caching techniques, thereby assuring a stable behavior
on S. Say, for example, S is a content distribution network of applications. The mapping between the output of S and
(CDN) that advertises content delivery to clients within 5 sec platform resources should be known with reasonable accuracy:
of a request: say, News download. Suppose S achieves the either as a closed-form model of S or through a series of
best QoS of 5 sec guarantee only with a probability of 0.4, incremental allocate-and-observe invocations on S [2].
and achieves a latency distributed between 5 and 15 sec in The domain-specific core adaptation function in a cloud-
other cases. Under simplifying assumptions, the dependability based application system S is viewed as a control-theoretic
of S in meeting latency specs is estimated as 0.7 on a feedback loop acting on a reference input Pref : say, for cloud
normalized scale [0, 1]. If the content storage/delivery backlog resource management (such a view is also advocated in [3]).
becomes severe, the 5 sec latency is less sustainable (assuming Figure 1 concretizes this view . The controller C generates
that other parameters of storage/delivery mechanism do not its actions I based on a computational model of S, denoted

1550-6525/13 $26.00 © 2013 IEEE 87


DOI 10.1109/DS-RT.2013.37
treats observed system as a composition the dependability analysis by H. Here, the service attribute
S { Ap’ g*(..) CLOUD-BASED APPLICATION
exported by the computational algorithm to its applications
adaptation logic Ap’ rests upon the quality of lower-level service received from
reason about
the underlying infrastructure (e.g., resource allocation, failure

system output
system dependability by

system input
(supplied as
(observed as
External observing system behavior final QoS specs of
detection). These core layers together constitute g ∗ (· · ·), with

Pref)
management

P’)
actually desired
entity H achieved QoS A0p housed in the application layer for behavior monitoring &
mapping between internal-state Service interface control. The latter involves exercising the underlying algorithm
visible
and interface events
state layer (and in turn, the infrastructure layer) via signaling points
defined in the service interfaces. The dependability of S

adaptive application system S


Processes, procedures
and their interactions
internal state s*
may be quantified, with suitable metrics (for certification and
control purposes), by analyzing the external state-machine
level signal flows by H. Basically, H reasons about the
sub-system providing
core services Service-oriented output tracking error |Pref − P 0 | under various environment
algorithms conditions, and maps it onto a measure of the dependability
resource/component
modeled as a external environment virtualization interface of S. Say, H is a cloud management station, employing a
black-box conditions incident on dashboard-based supervisory controller. The paper embarks on
compute servers, . . .)
(VMs, network links,

computational
System components

software/hardware system
function (e.g., errors, threat, outage) a case study of cloud-based content distribution networks, with
data storage,

g*(I,O*,s*,E*)
[parametric representation E*] a focus on system-level dependability.
I: input The paper is organized as follows. Section II treats system
O*: output
s*: internal state Cloud dependability as compliance to the stated non-functional at-
infrastructure tributes (e.g., QoS). Section III describes a sample application:
replicated data service, from a standpoint of the QoS compli-
Fig. 1. Structure of adaptation processes in a network application system ance issues in cloud-based realizations. Section IV presents our
model-based approach to assess the dependability of cloud-
based systems, with a focus on the adaptative fault-tolerance
as: g(I, O, s, E) — where O is the plant output in response of replicated data servers. Section V discusses the relation-
to the trigger I, s is the plant state prior to the incidence ship to existing system management frameworks. Section VI
of I, and E depicts the environment condition. Since the concludes the paper.
true plant model g ∗ (I, O∗ , s∗ , E ∗ ) is not completely known,
C refines its input action in a next iteration based on the II. D EPENDABILITY OF CLOUD - BASED SYSTEMS
deviation in observed output O∗ from the expected output O Suppose G depicts the goal to be met by a cloud-based
when action I occurs. Upon S reaching a steady-state (over network system S. The goal G may include a prescription of
multiple control iterations) with output P 0 , the output tracking one or more non-functional attributes associated with the QoS
error |Pref − P 0 | is analyzed to reason about the system delivered by S to an application: such as resilience to external
dependability. Our approach is guided by the concepts and disturbances, stability against resource fluctuations, and re-
taxonomy of dependable computing presented in [4]. sponsiveness to user-triggered requirements. The dependability
We treat the dependability as a management attribute of of S is prescribed in terms of such non-functional attributes
cloud-based adaptive systems. An external management entity — and is hence distinct from the correctness goal of S which
H views the system S as made up of adaptation processes A0p is a functional attribute (yielding a YES/NO result). This
wrapped around a core system g ∗ (· · ·), i.e., S ≡ A0p ⊗ g ∗ (· · ·) section provides a non-functional characterization of system
— where ’⊗’ denotes the composition in an object-oriented dependability. Here, the dependability of S is a measure of
software view. A0p is embodied in a distributed agent-based how well the adaptation functions programmed into S adjust
software module that forms the building-block to structure to the changing external environment conditions in meeting an
S ([5] provides an architecture for distributed realization of application-level goal G specified for S.
the adaptation logic of A0p ). S interacts with its (hidden)
external environment through the core elements g ∗ (· · ·): e.g., A. Dependability as application-level QoS
responding to client queries on a web server, and delivering
content over a network transport connection. Here, the meta- We treat the dependability of a cloud-based system S as a
level signal flows between A0p and g ∗ (· · ·) are visible to H. meta-attribute of application-level QoS achievable under the
The layered software structure of S intrinsic to cloud-based current operating conditions. An example is how stable is the
systems: viz., the infrastructure, service-oriented algorithms, content delivery latency achieved on a distribution link in the
and adaptive application, stacked in that hierarchy and sep- face of bursty demands and resource costs. Another example
arated across well-defined interfaces1 , lends itself well for is how good is a web service availability in the presence of
failures of one or more server replicas. Dependability assess-
1 The system layers correspond to IaaS, PaaS, and SaaS, in the cloud ment is anchored on three interwoven properties associated
computing arena [6]. with the behavior of S:

88
• Measurability, wherein S can map its output/state vari- E ∗ includes the errors in QoS specification (e.g., incomplete
ables onto to high-level metrics over a wide range of input specs), algorithm formulation (e.g., unspecified events), and
conditions (for analysis and reasoning); resources/components (e.g., a node crash) respectively.
• Predictability, wherein S can compute its expected output We cast the notion of system dependability in a broader
under various input and environment conditions (with context, as advocated in [4], than the currently prevalent
reasonable accuracy); approaches that rigidly tie to the system-level mechanisms for
• Adaptability, wherein S can adjust the utility extracted recovery from component-level faults occurring in a system.
from its output as the environment conditions become
B. System dependability, trust, and fault-tolerance
harsher, thereby heavily dwindling its resources.
These properties depict that S is programmable, i.e., an [4] characterizes dependability as a form of trust bestowed
external management entity H can exercise control over the on a system S by its users across the service interface to S
behavior of S in a concrete manner. For instance, even if S is — where a user may be a human entity or a physical world
100% fault-tolerant, an inability to reason about this property process or a computational sub-system that acts as the client
in various fault scenarios reduces the dependability of S from of S. It depicts how trustable is the service exported by S, as
an application standpoint. In the earlier example of replicated perceived by those who depend on S (i.e., the users of S).
web service, a 95%-availability depicts a verifiable assurance, We project the output tracking error ² = |Pref − P 0 | onto
over a suitable time-scale, that no more than 5% of the client a dependability measure (note: P 0 is the final converged O∗ ).
queries incur a latency higher than a set limit, say, 12 sec, even The following correspondences can be readily established:
in the presence of server crashes. Thus, a characterization of • ² > ²m depicts a service failure (i.e., S is of no value);

S in terms of behavioral properties improves the reasoning • δ < ² ≤ ²m depicts a service degradation (but S may

about how dependable S is. still be usable, al beit, in a degraded form);



The adaptation processes in system S embody two main • s depicts the service-external state (s is a computational

functionalities: i) core components and resources set up over representation of s∗ in the model programmed into C).
the cloud infrastructure that collectively export a service to The environment E ∗ covers anything outside the control
the application, and ii) a controller C that exercises the core regime of S but which impacts the operations of S: e.g., the
components/resources to meet the application needs Pref . An abrupt shut-down of a business unit in an enterprise system
application interacts with the system core through a service- causing delays in the servicing of customer demands (or, even
oriented interface (made up of APIs), with a certain trust on scaling down system operations).
the service delivery vis-a-vis its quality. A model of S captures Under the above framework, the conventional system-level
the following elements — c.f. Figure 1: faults (e.g., node crashes) can be subsumed as part of an
∗ ∗ ∗ ∗
• Plant g (I, O , E , s ) made up of computational algo- uncontrolled external event space E ∗ . QoS degradations can
rithms that map the infrastructure resources and compo- also be viewed as faults that constitute a part of E ∗ (e.g.,
nents onto a unified service delivered to applications; packet loss along a network path). How S reacts to the hostility
• Controller C that decides on an input I to g (.): say,
∗ of an external event e ∈ E ∗ and shields the application from
resource allocation and/or component assignment in the the effects e allows us to delineate the hitherto qualitative
infrastructure, based on the current service state s ≈ s∗ notions of system robustness and resilience.
and environment condition e ∈ E ∗ , to2 make the ob- Robustness depicts the ability of S to present a service
served output O reach close to Pref in a steady-state; behavior to applications as if there are no failures or hostile
• Sensor that maps the service state s onto the observed events. Here, the application does not see the impact of hostile
service output O to report back to C; events, as captured by: P 0 ≈ Pref . This is enabled by strong
• Actuator to realize a resource/component allocation de- service-layer guarantees provided by S using, say, resource
cision I as domain-specific actions on the infrastructure. reservations and/or redundant component deployments in the
The service interface basically wraps around the core system infrastructure, and the underlying algorithms to correctly man-
g ∗ (· · ·). A function g(· · ·) that approximates g ∗ (· · ·) and the age these system-level mechanisms. The signaling interaction
errors in sensing/actuation processes are factored in a system between the application and service layers occurs only when
model programmed into the controller C. an application starts and when it terminates.
A dependability metric, denoted as DS (e, G), is a measure In contrast, resilience depicts the ability of S to reconfigure
of how good the system S meets its goal G under the its internal functions and/or parameters, in a way to continue
environment condition e ∈ E ∗ . Here, G is prescribed as a operations in a degraded mode. Here, the application does
behavior of S that moves the physical plant to a desired state. see the impact of hostile events, al beit, indirectly, by scaling
Note that a certification authority quantifying DS (e, G) may down its operations to a reduced level, which is captured as:
be external to the cloud provider and the application-level user. 0 ¿ (Pref − P 0 ) ≤ ²m . The service-layer is programmable,
The external events E ∗ incident on S can emanate from the with S employing parameterizable system-level mechanisms
application, service-layer algorithms, and cloud infrastructure. that are invoked by the application at various time-scales
during run-time as part of its reconfiguration strategies. Ac-
2 Observed service interface state is mapped from service-internal state s∗ . cordingly, a richer API is required to support the signaling

89
op1, op2, op3: Different operating points of the plant
interaction needed between the application and service layers: op3 is more hostile than op2; op2 is more hostile than op1
namely, notifying the quality degradation |Pref − P 0 | to the Î op3 sustains a lower e than op2, and op2 sustains a lower e than op1
application and controlling the parametric and/or algorithm [e(3m) < e(2m) < e(1m)]
changes needed in the service-layer3 . Pref

output tracking error (TE) of A’p


We project the management-oriented view of adaptive ap-
Hm

during control of g*(..)


plications through a prism of model-based system assessment,

useful range of system behavior


and characterize therein the notion of system dependability.

e
3}-
{op

-e
C. Tracking error based dependability specification

2}
p
{o
Consider an application system S ≡ A0p ⊗ g ∗ (I, s∗ , E ∗ , O∗ )
— c.f. Figure 1. A controller C embodied in A0p operates
on the physical plant g ∗ (· · ·) to generate to an output close to
-e
Pref . E ∗ depicts the external environment, modeled as a set of {o
p 1}

parameters, incident on g ∗ (· · ·) to disturb the output behavior


0
of S (|E ∗ | À 1). C is programmed with a plant model 0 e(3m) e(2m) e(1m)
Hostility of external environment condition e  E*
g(I, s, E, O) that is an approximate representation of the true Ap’: Control-theoretic adaptation process incident on physical system g*(..)
model g ∗ (· · ·), where E ⊂ E ∗ . C uses the model g(· · ·) to exercised by application agents
g*(..): Target network system (i.e., plant)
(higher e means more hostile condition)

compute the plant input I. If P 0 is the output sustainable


by A0p (i.e., stable converged value of O∗ ) in response to a Fig. 2. Output tracking error relative to external environment conditions
trigger input Pref , then ² = |Pref − P 0 | denotes the output
tracking error of S under the current environment conditions
E ∗ incident on the plant g ∗ (· · ·) — e.g., failures and outages4 . variables Pref and P 0 — denoted as G ≡ Φ({|Pref − P 0 |}).
For adaptive QoS management systems, P 0 depicts the A sample specs of G may take an arbitrary form, such as:
stable QoS sustainable by S when attempting to achieve a
G ≡ {|Pref (1) − P 0 (1)| ≤ ²0 (1) ∧ |Pref (2) − P 0 (2)| ≤ ²0 (2)}
desired QoS Pref . In an example of content delivery over
a geographically spread-out distribution network, S attempts ∨ {|Pref (3) − P 0 (3)| ≤ ²0 (3)}
to keep the content latency L0 less than a prescribed limit
in a case of three outputs generated by S. The specs Φ(· · ·) is
L, as determined by various factors: such as the capacity
wired into the control logic executed by A0p , with the signal
of content proxy nodes, the bandwidth available on the path
flows between A0p and g ∗ (· · ·) validated against Φ(· · ·).
leading to clients, and algorithm employed to move content
When the output of S deviates from compliance to G, a
from a node to its downstream clients. The output behavior
tolerance to this deviation is based on the perceived usefulness
of S may be associated with non-functional attributes, such
of S to the application in those cases. With this idea, we derive
as boundedness and jitter. The above attributes capture the
a generalized measure of system dependability5 .
ability of S to provide a sustainable output behavior in the
presence of changes in user-requirements and/or fluctuations III. C ASE S TUDY:
in external environment incident on the physical plant — and A DAPTIVE FAULT- TOLERANCE OF WEB SERVICE
even under modifications to the plant itself (e.g., changes in The web service is replicated, with the servlet replicas ca-
proxy node placement in a CDN). Figure 2 empirically shows pable of independently processing the queries and computing
how the tracking error of S varies relative to the hostility of the results returned to the client. The replication purports
an environment condition e ∈ E ∗ . The system S is deemed to improve the web service availability and performance
as highly risky to be trusted upon (i.e., dependability ≈ 0) experienced by clients, in the presence of server failures
when |Pref − P 0 | > ²m . The convex behavior depicts the and workload fluctuations. The system employs voting among
increased additional effort (or, cost) incurred by A0p to counter server replicas that handle a client query, in order to generate
the disruption caused by an environment change from e to a timely and accurate response [8]. See Figure 3. Our study
e + δe: say, extra resource allocation in the infrastructure, de- focuses on MBE elements in an autonomic management of
ployment of robust infrastructure components, and/or service- the server replication process.
layer algorithm reconfigurations.
The goal G to be met by S is specified in a declarative form A. System-level parameters for server replication
based on the TE observed across one or more input-output The parameter fm , which depicts the maximum number of
replicas that can be assumed to fail, impacts the choice of
3 Mapping the domain-specific signaling onto a generic cloud API (such as
OCCI) requires a study of many use-cases and requirements [7].
N : the total number of replicas. Though the system designer
4 The tracking error ² depicts the ability of system S to generate a desired does not have exact knowledge of the actual number of failed
output Pref . Whereas, the parameter: γ = |O − O∗ | depicts the modeling
error, i.e., how accurate the computational model g(I, O, s, E), embodied in 5 Dependability of S depicts a continuous behavioral property, characteriz-
the controller C of S, is in predicting the actual behavior generated by the ing various levels of robustness of S against hostile conditions (instead of a
system processes: g(I, O∗ , s∗ , E ∗ ). binary thing [4]). Fault-tolerance is subsumed in this notion of robustness.

90
Trusted, to produce a correct result in a case when fa ≥ d N2 e. Thus, the
neutral query failure probability Pqf may be estimated in terms of the
certifier H probability that at least fm + 1 of the servlets get attacked and
reason about
CLIENT APPLICATIONS exhibit faulty behavior (Pqf = 0 when fa ≤ fm ). Depicting
Log of meta-data
for QoS analysis

how good the QoS


obligations are met
actual QoS this condition, we have:

CONTROL
QoS specs
server task observed µ K−f −1 ¶

PLANE

PLANE
DATA
requests (response time; (response time desired: T d ' m
fault-tolerance) fault-tolerance desired)
XK fa − fm − 1
s)

Pqf = rfa × µ ¶
cti on

T,', . . [hAT T (fa ) × ]. (1)


on

service interface
funlecti

K
fa =fm +1 f a
s & col

(provider of
Server replication DATA SERVICE value-added
ter ta
me da

fm, . . (management of service)


control algorithm server The web service integrity is then: [1−Pqf ]. Consider the earlier
ra ta-

redundancy,
security, . .)
example of N = 5, K = 8, fm = 2, r = 0.2, and uniform
(pa me

SM K > N t 2fm+1
mo(syst I (assuming distribution of hAT T (fa ) over the interval [1, 8]. The estimated
int nito em server instantiation on VMs leasing of VMs no byzantine query failure probability is: Pqf = 223.3×10−5 . By increasing
erf rin (# of VMs running server: from cloud
ac g failures)
e) N=3) (K=6) the degree of replication, say as N = 7, Pqf = 36 × 10−5 .
Vc, . . Equation (1) is based on a stochastic analysis of the voting
infrastructure interface
system in terms of its internal state variables. Under certain
VM running server instance
CLOUD assumptions about the statistical independence of failures, the
X INFRASTRUCTURE results allow a quick estimate of the extent of faults occurring
Idle VM node (available cycles: Vc) X (manages resources:
VM, storage, . .) in the system and the impact on client-perceived QoS: namely,
difficult-to-measure parameters: query failure probability.
# of VMs suffering failure: fa = 2; (provider of
VM failure probability: q X: attack on VM raw service) IV. E VALUATING ADAPTATION IN CLOUD - BASED SYSTEMS
Fig. 3. Cloud-based replicated data repository service We employ model-based analysis of the control-theoretic
adaptation loops in a complex system S [9] to reason about
system dependability in a cloud setting. Recall the system
replicas fa , it is reasonable to assume that a probability compositional view: S ≡ A0p ⊗ g ∗ (I, O∗ , s∗ , E ∗ ), where A0p
distribution hAT T (fa ) is known. This statistical information and g ∗ (I, O∗ , s∗ , E ∗ ) depict the adaptation and core compu-
can guide a more accurate choice of fm such that fa ≤ fm tational processes respectively. The reasoning functionality is
with a high probability – and hence N . The QoS is prescribed embodied in the management module H — c.f. Figure 1.
in terms of non-functional attributes of the replicated web
A. Model-based reasoning about system behavior
system: availability, integrity, performance, and resilience.
Consider the return of a query result by the voting module. H embodies the functionality to assess the effectiveness
Here, an integrity violation may possibly occur if more than of A0p by monitoring the tracking error ² = |Pref − P 0 |. It
fm server modules are attacked, i.e., fa > fm . Because, is possible that A0p embodies autonomic elements to morph
there is a chance that the query result is incorrect because the controller algorithms in C and/or the plant parameters, if
the result proposed by a faulty servlet may have generated needed, to reduce ² (and hence increase dependability). This
enough consents for the voting module to declare the faulty involves predicting the future plant conditions and environ-
proposal as a correct one. In another scenario, it is possible ment, and determining the anticipated control actions therein:
that none of the first (2fm + 1) iterations produce a result such as dynamic switching of the push/pull algorithm and/or
enjoying fm + 1 consents. Here, no result will be delivered to its parameters in a CDN to meet the changes in client de-
the client — which reduces the service availability. mands [11]. The control-theorectic elements needed for a self-
The case of a hostile event e ≡ [fa > fm ] when the voting managing behavior of S [12], if any, and the computational
algorithm is itself vulnerable to failure is our interest. It may intelligence aspects of A0p therein [13], are6 captured in an

manifest as, say, a non-completion of client queries and a assessment of ² vis-a-vis e (say, de ) — c.f. Figure 2.
return of incorrect results — which lowers the availability and In the absence of exact knowledge about E ∗ in the controller
integrity respectively of the ’data service’ provided by S. C, we capture the variation of tracking error TE with respect
to the unknown inputs of the plant (E ∗ −E) by treating TE as
B. Analytical modeling of server replication a random variable with probability distribution that is derived
The algorithm designer for the voting-based replicated pro- from the native distributions of (E ∗ − E). This is combined
cessing method may conservatively set fm = d N2 e − 1 for a with an application-specific characterization of how TE = ²
given N — even though the probability distribution hAT T (fa ) captures the usefulness of S in various operating conditions.
may instill confidence in the designer to set fm optimistically
6 We treat failures as causing system-level output errors that are indistin-
to a lower value. Though the probability of a correct decision
guishable from the errors arising due to system modeling inaccuracies. This
is increased in the conservative case (by virtue of the increased unifying view allows S to employ feedback-based control techniques to deal
margin of consent voting), the voting algorithm may still fail with the failure-induced issues as well.

91
We surmount the issue of combinatorial complexity faced in
analyzing a networked system S by resorting to a model-based tracking error
engineering (MBE) approach that combines a tractable per- Model-based H=[J- J’]

modeling error
formance analysis with a control-theoretic feedback process, controller C

\=[J”- J’]
in order to converge to a (sub-)optimal resource allocation actual traffic /*
QoS specs (based on combined
in the underlying infrastructure ([14] describes the role of J application requests and
actual environment changes)
feedback loops in self-adaptive systems). The MBE approach uted network
takes into account the cross-layer dependencies by featuring compallocation system
r c e schedule tasks
the functional elements of control loops residing in various resou based ondel 2 to resources at
m mo system nodes
layers with a holistic system-level model of S [15]. syste/,J,s,E) 1
8 performance J’
G(

T(V’,E’)
5
(subject to

G(V,E)
3 4 7 observation error

topology of compute nodes & links


B. Control-theoretic feedback process Script of flow specs in actual networked system:
True model G*(/*,J*,s*,E*)
of workload traffic
The determination of optimal resource allocation is a step- /={Oi, Oj, Ok, . . } E: parameters space
hostile
by-step process, with each step analyzing a an incremental assumed for E*
environment E*
incident on system
allocation in terms of performance indices (such as service model-based
system description
latency and overhead) using a computational model of S, and (using declarative scripts)
using the analysis to refine the resource allocation in a next model Poisson-model of 1
2 model-estimated
8 performance
step for improved performance. The goal is to finally make the workload workload traffic 5 J”
topology of
actual system output γ 0 as close as possible to the QoS-specs generator network system
assumed in
γ. The computational model of target system employed as a 3 4 7 model G(. . .) system model does not consider
the link between nodes 1 and 4
building-block should be tractable (in a mathematical sense):
say, closed-form mapping of the resource allocation parameters Fig. 4. Functional elements for model-based system analysis & control
to achievable performance indices γ 00 — the latter is the
model-estimated output of system. A closed-form modeling of
S may be feasible: say, from a consideration of Poisson-based determined by the controller parameters7 .
workload traffic: say, the client requests on web service. With a
closed-form as elementary block, the properties of Markovian C. Cost considerations for application services
processes and traffic aggregation can be exploited to yield In a cloud-based setting, a distributed application involves
Queuing Analysis based formulas for performance indices. the use of infrastructure resources, viz., compute nodes and
The idea is based on treating an arbitrary workload traffic node interconnects, and the running of a domain-specific
process incident on S as a piece-wise linear superposition of distributed algorithm software on them. This incurs costs that
Poissonian traffic processes, with model-based corrections ex- arise from two distinct aspects:
ercised on the resource allocations in various control-theoretic 1) Leasing of K nodes and multiple node-interconnects
steps. Here, a model-based correction involves observing the from the infrastructure provider, out of which N nodes
actual performance accrued vis-a-vis the Poisson-estimated and their interconnects provide the service needed by
performance and using this observed error in refining the applications, where 2 ≤ N < K;
resource allocation decision in a subsequent control step. See 2) Software development effort in a service-layer algorithm
Figure 4. The controller C implements a Partially-observable running on the N -out-of-K nodes and their intercon-
Markov Decision Process (PO-MDP) [10] to determine a nects, with the remaining (K − N ) nodes and intercon-
resource allocation based on the observed modeling error nects functioning as redundant units.
(ME): ψ = [γ 0 − γ 00 ]. The resource allocation decision is
exercised on the actual system S, with the resulting (steady- The component redundancy managed in algorithm layer pur-
state) system output as γ 0 . The system output tracking error ports to minimize the occurrence of service degradations and
(TE) is given as: ² = [γ − γ 0 ]. The ME ψ is combined with failures: say, by tolerating up to fm node crashes — where
the TE ² by the controller C in its decision-making. 1 ≤ fm < N . Referring to the example of data server
replication in Figure 3, the internal parameters of replication
The inter-leaving of a closed-form analysis with the piece-
algorithm are: K = 6 and N = 3 as depicted in the cloud
wise linearity of Poissonian-type of traffic and a control-
infrastructure, with fm = 1 determined therefrom.
theoretic error correction with system observation-based feed-
The total leasing cost may increase either linearly or
back over multiple steps allows us to solve the otherwise-
monotonically concave with K — which means a constant
intractable problem of system analysis for arbitrary traffic
or a decreasing per-node cost with an increase in K, as
patterns (i.e., workload) incident on S. The above interleaving
negotiated with the infrastructure provider (e.g., pay-by-use
of control and modeling steps aids the traversal towards a final
solution. This approach yields a reasonable convergence to 7 The cumulative model-based correction mechanism in resource allocation
a (sub-)optimal resource allocation, with the system stability decisions is an instance of machine-learning embodied in the controller.

92
pricing) [6]. The service layer cost is tied to the development external environment E* { [fa,r]
and running of distributed algorithm software, and hence is servlet
domain-specific. In a push/pull algorithm for CDN, N proxy B: system failure

Management module to assess


bandwidth
nodes in a distribution tree requires (N −1) content forwarding s1 s2 .. sN

observe application-level
interconnects and N 0 content storages (where 1 ≤ N 0 ≤ N ) —

system dependability
l
tro

control module
con B

QoS parameters
besides the base software itself to infuse the proxy capability

replication
N , Replicated fm: Max. # of servlet failures

respon
in the N nodes. In a replicated web service layer, the needed proto web Service assumed by voting protocol
software diversity among N servlet replicas (to deter failures) para col r: degree of faulty behavior of

ses
me
plug ter a failed servlet (0 < r d 1)

que uest
incurs O(N 2 ) cost — besides the base software itself running

req
-in voting protocol
fa: Actual # of servlet failures

ry
on the (N + 1) nodes and N interconnects. The service layer [fm,'
]
sub-system
ma K: Total servlet pool size(K t N)
cost often exhibits a monotonic convex behavior with respect res jorit
po y
to N , and it dominates over the infrastructure costs because QoS-ada ns client
ptive e
web-base x
of the specialized software development efforts involved. informati d
The combined cost incurred at the infrastructure and service on system
layers gets assigned to the applications in one form or the
Prototype system results
other. The incurrence of per-node costs may compel a service
(on UNIX-based LAN)

time-to-complete query
provider to limit the number of nodes N participating in an delayed query result (TTC > ') is less useful
algorithm execution, thereby lowering the attainable QoS. The Î reduces “service availability”

TTC (in msec)


resulting loss of some service value to the application may 1250
'=500 msec
itself manifest as a cost for the service provider that possibly 750
o x fm=3, B=300 kb
out-weighs the cost savings arising from a lower N and/or K. x
ps
250 fm=2, B=300 kbo o x
o o
x
ps o
D. Revisit: Autonomic control of web server replication 3 4 5 6 7 8 9 10
The formulation (1) is programmed into the controller C # of servlets participating in web replication (N)
to exercise a model-based control of the web server system. Fig. 5. Impact of [N, fm ] on replication QoS: Experimental study
Since the environment parameter fa is not known, C may
resort to state-based online parameter identification to infer
fa . The service-external state observed is the number of voting
latency performance — and hence the service availability.
steps, L, needed to deliver a query result to the client. Here, a
Figure 5 shows the results from our experimental study on
voting step involves the proposed delivery of query result by
a UNIX-based replicated data service implementation. For a
a servelet, and the collection of votes from other servelets for
given N and B, the query latency exhibits a convex increase
this result. If QK is the probability that it takes K steps to
with respect to fm . On the other hand, an increase in B causes
decide on delivering a query result to the client, then:
a less-than-linear decrease in the latency. These results, which
2f
Xa +1 agree with the analytical model (2), corroborate our empirical
K= QK × K. (2) observation on how a system output error ² varies with respect
K=1 to the hostility of environment condition e — c.f. Figure 2.
A higher fa leads to an increase in K, with a worst-case failure Given the partial observability of fault-severity parameters
scenario incurring K = (2fa + 1) voting steps when all the fa [fa , r], an optimal choice of [N, fm ] under cost contraints B
faulty servelets behave erroneously. Under certain assumptions involves the use of heuristics to search the solution space:
about the failure behavior, the observed K is mapped onto such as greedy and genetic search methods. For model-based
an inference about fa . Thereupon, Pqf is estimated to enable service assessment by H, a specific choice of greedy versus
online control of the web server system. C then sets the genetic algorithms is however less important — though the
control parameters N and fm , as exercised on the replica performance results shed insight into these choices.
infrastructure and voting algorithm layers.
We employ PO-MDP to deal with the uncertainties in V. R ELATED WORKS ON ADAPTIVE CLOUD SYSTEMS
a replicated web service system caused by server failures.
Equations (1) and (2) are used as black-box relations in the Today’s distributed systems embody both adaptation behav-
PO-MDP tools when considering various failure probability iors and functional behaviors. The former deals with adjusting
distributions: hAT T (fa ). The analysis allows changing the the system operations according to the environment conditions
number of servers in the replica pool (i.e., increase/decrease (e.g., increasing the number of server replicas to improve
N ) dynamically. As part of changes in the service infrastruc- web service performance). Whereas, the latter deals with
ture, the voting algorithm may also increase/decrease fm , as requirements such as network robustness and security. In this
determined by the observed fault serverity parameters [fa00 , r00 ]. light, we broadly categorize the existing works as dealing with:
Both the replication control parameters [N, fm ] and the • Systems engineering for the control-theoretic aspects of
infrastructure network bandwidth B strongly impact the query adaptation (such as stability and convergence) [2], [16];

93
• Software engineering for the verification of application R EFERENCES
requirements (including para-functional ones) [17], [18]. [1] Panel Discussion (K. R. Joshi, G. Bunker, F. Jahanian, A. V. Moorsel, J.
In contrast, our work focuses on model-based assessment of Weinman) Dependability in the Cloud: Challenges and Opportunities. In
IEEE/IFIP Intl. Conf. on Dependable Systems and Networks, June 2009.
the dependability of cloud-based systems (our earlier work is [2] B. Li, K. Nahrstedt. A Control-based Middleware Framework for
on an assessment of Internet-type network systems [19]). Quality of Service Adaptations. In IEEE Journal on Selected Areas in
[20] provides a QoS-aware network service composition and Communications, vol.17, no.9 Sept.1999.
[3] C. C. Lamb, P.A. Jamkhedkar, G. L. Heileman, and C. T. Abdallah.
adaptation to infuse a core self-management intelligence for Managed Control of Composite Cloud Systems. In proc. IEEE Intl. Symp.
autonomic operations across heterogeneous networks. How- on Service-oriented System Engineering, Irvine (CA), Dec.2011.
ever, cloud components are under third-party control with less- [4] A. Avizienis, J. C. Laprie, B. Randell, C. Landwehr. Basic Concepts and
Taxonomy of Dependable and Secure Computing. In IEEE Transactions
than-100% trust — which poses challenges in guaranteeing on Dependable and Secure Computing, 1(1), pp.11-33, Jan. 2004.
system-level reliability and robustness. [5] P. G. Bridges, M. Hiltunen, and R. D. Schlichting. Cholla: A Framework
Chisel [21] provides an adaptation framework by separating for Composing and Coordinating Adaptations in Networked Systems. In
IEEE Transactions on Computers, 58(11), pp.1456-1469, Nov. 2009.
the core system functionality from those dealing with non- [6] Sun Microsystems. Introduction to Cloud Computing architecture. In
functional behaviors. This allows a seamless and transparent White Paper, 1st ed., June 2009.
infusion of new behaviors at run-time to deal with unantici- [7] T. Metsch. Open Cloud Computing Interface — Use Cases and Require-
ments for a Cloud API. Open Grid Forum, Sept. 2009.
pated external events e ∈ E ∗ , and a removal of those behaviors [8] K. Ravindran, K. A. Kwiat, A. Sabbir, and B. Cao. Replica Voting: a
when the conditions causing e cease. The dynamic frameworks Distributed Middleware Service for Real-time Dependable Systems. In
discussed therein are applicable to cloud-based systems, given proc. IEEE/ACM COMSWARE’06, New Delhi (India), January 2006.
[9] M. Cordier, P. Dague, M. Dumas, F. Levy, A. Montmain, M. Staroswiecki,
the uncontrolled nature of third-party administered cloud re- and L. Trave-massuyes. A comparative analysis of AI and control theory
sources. Our emphasis however is on the decision-theoretic approaches to model-based diagnosis. In proc. 14th European Conference
mechanisms to control the behavior of cloud-based systems. on Artificial Intelligence, pp.136-140, 2000.
[10] F. S. Hillier and G. J. Lieberman. ”Non-linear Programming” and ”Meta-
[22] advocates dependability differentiation to identify dis- heuristics”. Chap. 12, 13, Introduction to Operations Research, McGraw-
tinct classes of cloud-based applications (with appropriate Hill publ. (8th ed.), pp.547-616, 2005.
scheduling of the underlying cloud resources). Here, system [11] K. Ravindran. Dynamic Protocol-level Adaptations for Performance and
Availability of Distributed Network Services. in Modeling Autonomic
assessment is feasible with our probabilistic treatment of a Communication Environments, Multicon Lecture Notes, Oct. 2007.
control error vis-a-vis its impact on system dependability [23]. [12] Y. Diao, J. L. Hellerstein, G. Kaiser, S. Parekh, and D. Phung. Self-
Managing Systems: A Control Theory Foundation. In IBM Research
VI. C ONCLUSIONS Report, RC23374 (W0410-080), Oct.2004.
[13] R. C. Eberhart and Y. Shi. Computational Intelligence. In chap.
We treat system dependability as application-level QoS at a 2, Computational Intelligence: Concepts to Implementations, Morgan
meta-level (regardless of the system complexity). Dependabil- Kaufman Publ, 2007.
ity assessment of a cloud-based system involves three aspects: [14] Y. Brun, G.D.M. Serugendo, C. Gacek, H. Giese, H. Kienle, M. Litoiu,
H. Muller, M. Pezze, and M. Shaw. Engineering Self-Adaptive Systems
measurability, predictability, and adaptability of system behav- Through Feedback Loops. Book Chapter, Self-Adaptive Systems, LNCS
ior — which are enabled by the programmability of cloud- 5525, Springer-Verlag, pp.48-70, 2009.
based infrastructures and services. Guided by the concepts [15] C. G. Cassandras and S. Lafortune. Introduction to Discrete Event
Systems, Springer-Verlag publ., 2007.
provided in [4], the paper studied model-based engineering [16] C. Lu, Y. Lu, T. F. Abdelzaher, J. A. Stankovic, S. H.. Son. Feedback
techniques to quantify system dependability, and certify cloud- Control Architecture and Design Methodology for Service Delay Guaran-
based systems therein. System-level fault-tolerance mecha- tees in Web Servers. In IEEE Trans. on Parallel and Distributed Systems,
17(7), Sept. 2006.
nisms, hitherto treated separately, are subsumed in the broader [17] I. Schaefer and A. P. Heffter. Slicing for Model Reduction in Adaptive
notion of system dependability. Embedded Systems Development. In Workshop on Software Engineering
Given a system S realized on a cloud-based infrastructure, for Adaptive and Self-managing Systems (SEAMS 2008), May 2008.
[18] J. Yi, H. Woo, J. C. Browne, A. K. Mok, F. Xie, E. Atkins, and C. G. Lee.
the paper employed model-based engineering techniques to Incorporating Resource Safety Verification to Executable Model-based
analyze the output behavior of S relative to what is expected Development for Embedded Systems. In IEEE Real-time and Embedded
of S under uncontrolled environment conditions incident on S Technology and Applications Symp., pp.137-146, 2008.
[19] K. Ravindran. Model-based Engineering for Certification of Complex
(such as attacks and failures in the infrastructure). A manage- Adaptive Network Systems. In proc. IEEE Workshop on Cyber-Physical
ment entity H external to S incorporates the observation logic Networking Systems, ICDCS’12, Macau (China), June 2012.
to reason about capability of S by quantification of control [20] J. Xiao and R. Boutaba. QoS-Aware Service Composition and Adapta-
tion in Autonomic Communication. In IEEE Journal on Selected Areas
errors and probabilistic QoS assessment. We also undertook a in Communications, 23(12), Dec. 2005.
case study of replicated data servers to assess the quality of [21] J. Keeney and V. Cahill. Chisel: A Policy-Driven, Context-Aware,
fault-tolerance. The focus of study was on the assessment of Dynamic Adaptation Framework. In proc. IEEE Intl. Workshop on
Policies for Distributed Systems and Networks (POLICY’03), June 2003.
how good the adaptation processes are against the imperfect [22] Ameen Chilwan. Dependability Differentiation in Cloud Services.
knowledge about system operating conditions: such as server Master’s Thesis, Dept. of Telematics, Norwegian University of Science
failures and available network bandwidth. and Technology, Advisor: P.E. Heegaard, July 2011.
[23] K. Ravindran. Managing Robustness of Distributed Applications Under
An advantage of our management structure for the assess- Uncertainties: An Information Assurance Perspective. in proc. Cyber
ment of cloud-based systems is the reduction in development Security and Information Intelligence Research Workshop (CSIIRW),
cost of distributed control software for system management ACM-ICPS, Oak Ridge National Laboratory (TN), April 2010.
via model reuse and service-level programming.

94

You might also like