You are on page 1of 6

1

Project assignment for the Advanced Topics of Machine Learning PhD course; instructor: Prof. Alessandro Lazaric

Reinforcement Learning in Autonomic Computing


Davide Basilio Bartolini, PhD student
Politecnico di Milano, Dipartimento di Elettronica e Informazione
bartolini@elet.polimi.it

Abstract
Autonomic computing (AC) was born a decade ago as a response to the increasing complexity of computing systems at
the level of IT infrastructures, proposing to make such infrastructures able to autonomously manage (part of) their complexity,
thusly easing the ever more time and expertises requiring job of human developers and administrators. Since its birth, AC has
developed into a lively multidisciplinary research field, leveraging theories and techniques from different disciplines, such as
computer science, control theory, and artificial intelligence (AI). One of the branches of AI harnessed in AC is Reinforcement
learning (RL), which tackles the problem of an agent learning through trial and error interaction with a dynamic environment and
it is being employed in AC to automatically learn policies in order to control the runtime behavior of a computing system.
This survey, after presenting relevant concepts from the fields of both autonomic computing and reinforcement learning,
reviews relevant works in literature employing RL techniques to obtain self–management of (parts of) a computing system, which
is the ultimate goal of autonomic computing.

I. I NTRODUCTION AND BACKGROUND


Prior to reviewing works coupling autonomic computing and reinforcement learning, it is convenient to report some basics
about the two topics, to define concepts that will be used in this survey. For this reason, this Section provides an overview of
the concepts at the base of AC and RL.

A. Autonomic Computing
Autonomic computing was born in 2001 with a manifesto by IBM researchers [3] , envisioning an IT industry where computing
systems management should be automatically handled with minimal human intervention. The idea was borrowed from the
autonomic nervous system in biological life, which autonomously maintains homeostasis within the organism with the central
nervous system being unaware of its workings. This idea has been developed at a theoretical level, leading to a well–defined
vision and to a characterization of the desired properties for an autonomic computing system [5,11] . The research field of
autonomic computing is still young and the realization of its vision has yet to be achieved, as many research challenges are
still unsolved [6] . In particular, the process of applying the ideas at the base of AC to real systems, moving forward from
pure theoretical investigation is being tackled in recent times (e.g., the Metronome framework [12] applies concepts from AC
to performance management at the operating system level).
Even though autonomic computing can be applied at very different levels, a common feature of any such infrastructure is
the presence of a feedback control loop exploiting online information about the system and its environment to adapt towards
specified goals. As shown in Figure 1, there are different formalisms to describe such control loop.

Detection Analyzing Planning


Decision

Monitoring
System / Environment
Action Observation
Status
Monitoring Knowledge Executing

Sensors
Alteration of
Action
the status

Policy(ies)
Actuators Sensors
System Decision
System
Actuators

Environment Environment

(a) Self–adaptation loop [11] (b) Monitor–Analyze–Plan–Execute with shared (c) Observe–Decide–Act (ODA) loop [12]
Knowledge base (MAPE-K) loop [5]

Figure 1. Three different representations of the control loop at the base of an autonomic computing system
2

A first version of the autonomic control scheme is named Self-adaptation control loop [11] and it is represented in Figure 1(a).
This representation emphasizes the separation between the detection and decision phases. The detection process is in charge
of analyzing the data coming from the sensors and to detect when something should be changed in order to restore the system
from an unwanted state into its desired working conditions. The decision process is in charge of determining what should be
changed, i.e., picking the right action to be performed. A second version of the autonomic control loop is called MAPE-K [5]
and it is represented in Figure 1(b). When an autonomic element is described by means of the MAPE-K representation, the
component which implements the control loop is referred to as the autonomic manager, which interacts with the managed
element by gathering data through sensors and acting through actuators (or effectors). This control scheme emphasizes the fact
that a shared knowledge about the system and its environment must be detained in order to successfully execute the autonomic
control scheme. A third version of the autonomic control loop is named ODA loop [12] and it is represented in Figure 1(c). This
representation is more general with respect to the MAPE-K and Self-adaptive schemes and, being more generic, it summarizes
the essence of the autonomic control loop. The steps of the ODA loop are observation of the internal and environmental status,
decision of what action (or whether no action at all) is to be taken, based on the observations and action, i.e., perturbation of
the internal or external status in order to modify it towards a better condition for the system.
For the purpose of this survey, the most important phase in the autonomic control loop is indeed the one where decisions
are made (i.e., the decision stage, referring to the self-adaptation or to the ODA representations, or the analyzing and planning
phases, referring to the MAPE-K formalism). In fact, reinforcement learning provides the possibility of learning a decision–
making mechanism through an online trial and error interaction with the system; provided that information about the current
status and the desired goals and knobs to act onto system parameters are provided by other autonomic components, the decision
phase is where RL techniques can be employed to serve AC.

B. Reinforcement Learning
Reinforcement learning is a branch of AI accounting theory and techniques for learning optimal policies in a sequential
decision–making situation. Differently from other learning approaches (e.g., reinforcement learning), RL is based on the
assumption of a stochastic environment without the possibility of knowing examples of best actions for specific situations.
In RL, an agent builds a policy for solving a certain problem through a trial and error process, receiving feedback from the
environment in the form of a reward associated with each tried action [9] . More in details, the trial and error approach is taken
in active RL, whereas in passive RL the agent just observes the evolution of the world, trying to learn the utilities of being
in various states. This survey focuses on active RL, since it is a more powerful model, allowing explicit exploration of the
state space, and it is supported by the structure of the autonomic control loop (e.g., ODA – see Section I-A). Moreover, it is
possible to distinguish between RL algorithms that perform a search in the space of all the possible behaviors (e.g., genetic
programming), and algorithms leveraging statistical techniques to estimate the utility of states and actions [4] . Most of the works
connecting autonomic computing and reinforcement learning make use of algorithms in the second of these two classes.
In the standard model for RL, an agent is represented as an entity connected to its environment through perception and
action, as represented in Figure 2 (borrowed from Russell and Norvig [9] ). It should be quite immediate to see that the model

Figure 2. Generic representation of an agent [9]

is very similar to those given for the control loop in an autonomic computing system, suggesting the applicability of RL to
that context. More in details, at each step of interaction with the environment, the agent is provided with an input i through its
sensors, observing some property of the current state of the environment s; then, the agent chooses an action a to be performed
through its actuators (or effectors). This action can change the state of the environment and this change is reflected back to
the agent with a scalar reinforcement signal r, usually in the form of a reward (not shown in Figure 2). The duty of the agent
is to learn a behavior B, through a certain algorithm based on a trial and error process, such that the long–run sum of the
rewards is maximized. More formally, assuming that the environment can be modeled according to a Markov Decision Process
(MDP) [7] , a RL model is described by [4] :
• a discrete set S of possible states for the environment;
• a discrete set A of possible actions the agent can perform;
3

• a set of scalar reinforcement signals, either binary ∈ {0, 1} or ∈ R, representing the reward.
In the most simple scenario, the percepts gathered by the agent from the environment faithfully and completely describe its
current status s ∈ S; in this case the environment is said to be completely observable. More complex models (representing
the environment as a Partially–Observable Markov Decision Process – POMDP) can take into account the possibility of the
environment to be partially observable, i.e., the agent does not get a faithful and complete perception of the state of the
environment, but the status is filtered through an input function I. For instance, the perception of the agent could indicate he
environment be in a certain state si with pi probability, with pi ∈ / {0, 1}. The objective of the agent is to learn a policy π
which maps states to actions with the aim of maximizing a certain long–run measure of reinforcement (i.e., to maximize a
function of the overall achieved reward). What characterizes RL with respect to other supervised learning approaches is that
the model admits the agent receive only immediate information and there be no preservation of I/O pairs.

II. R EINFORCEMENT L EARNING TO A ID AUTONOMIC C OMPUTING


Autonomic computing systems classically rely on an autonomic manager leveraging knowledge on the system as formalized
in a model defined at design time. This approach, despite making use of runtime information through an autonomic control loop,
somehow fails to embody the full potential of AC, as the system model is predefined. In fact, it is ever more difficult to provide
accurate models for computing systems able to let an autonomic control mechanism achieve the desired performance and these
difficulties are a strong limiting factor for the adoption of self–management techniques in contemporary computing systems [14] .
Machine learning has been seen as a very promising technique to address such issue, called the "knowledge bottleneck", as ML
can be leveraged to incrementally build a system model through online learning needing no (or very little) built–in previous
knowledge. Moreover, as already pointed out in Section I-B, using the RL operational model within the autonomic control loop
(e.g., the MAPE-k loop, considered by Tesauro [14] ) appears direct and natural, assuming that the monitoring phase (realized
through sensors) provide relevant state descriptions and reward signals. The only major mismatch between the two models
is that RL policies are generally seen as reactive planners (i.e., they make immediate decisions, without explicit search or
forecasting of future states [14] ), while the planning phase in the MAPE-k loop is more general. This mismatch may limit RL
applicability in some more complex cases, but it does not impair its use in common cases for AC, such as management of
real time applications, which lie in a reactive context [14] .
The remaining of this Section provides an overview of some interesting works found in literature building RL in an AC
scenario to provide the decision phase (or, equivalently, the analyzing and planning phases) of the autonomic control loop.

A. Self–Optimization for QoS


Whiteson and Stone [16] , in 2004, were among the first to consider the use of a RL module to realize self–optimization in the
context of network routing. They use a learning approach to continuously improve the system performance, and a scheduling
algorithm relying on a heuristic to take into account packet specificities such as priorities. The routing schema is based on
an online learning technique called Q–routing, according to which each node of a network is enhanced with a RL module
maintaining a table of estimates about the time required to route packets in different ways. The algorithm is implemented in a
simulation environment modeling the interactions among the nodes of a network, represented in a graph. Transmitted packets
are modeled as jobs and a reward based on a utility function is associated to each completed job (i.e., routed packet). The
simulation results are positive, but this approach only deals with one QoS dimension: the routing time, thus not exploiting the
multi–objective capabilities of RL algorithms.

B. Learning for Multiple Objectives


One of the strengths of RL applied to AC is the possibility of specifying a multi–objective reward function to obtain learning
towards more than one dimension. Amoui, Salehie, Mirarab, and Tahvildari [1] describe their work in the context of autonomic
software and build RL into a MAPE-k adaptation loop at the planning phase to learn a policy for selecting the best adaptive
action at a give time. The reasons put forth by the authors for employing RL are mainly four:
• the chosen RL algorithm provides multi–objective learning;
• the RL agent can be modified (by adding punishments when the goals are not satisfied) to perform both reactive and
deliberative decision making;
• RL provides dynamic online learning, providing the ability of adapting to previously unseen situations and of managing
uncertainty;
• RL can be very time–efficient, with algorithms for decision making performing in O(m) time for m possible actions on
a learned policy.
According to the authors, these reasons highlight how RL can be a promising solution to the problem of planning in an
autonomic manager. The authors also address the problem of exploration through trial and error, which may be problematic
in cases when making wrong decisions for learning may be unacceptable. Three possible solutions are proposed:
• having a learning phase during the testing of the system to be used for exploration;
4

•initializing the learning algorithm with values determined by human experts, so that the initial exploration be more focused;
•relying on simulation to perform the learning phase before the actual system is implemented.
Based on this rationale, the authors propose a RL–based decision maker based on the State–Action–Reward–State–Action
(SARSA) algorithm [4] ; the model for the decision–making process is represented in Figure 3. The monitoring process is

att State âtt State Reward rt


(attributes) Generator Mapper Function (reward)
RL
Engine
at
st (action)
(state)

Effector
att+1
Sensor

Adaptable Web Application

Figure 3. Process model of the RL–based decision maker proposed by Amoui et al.; adapted from the original [1]

modeled as the measurement of a set of attributes of the environment ati , with i ∈ 1, . . . , n and the objectives are represented
as a set of k goals, with a set of k binary variables G(s) = {g1 (s), . . . , gk (s)} indicating whether each goal is or is not being
met. The possible adaptation actions are represented in a set AC = {a1 , . . . , am }. The major modules involved in the process
are a state generator discretizing the observed values of the attributes; a state mapper aggregating the discretized attributes in
a single key, representing the state; a reward function computing the reinforcement signal and considering the current values
of the k variables in the set G(s); a RL engine which both updates the current state model (represented as a Q–table and
selects the next action. The mechanism is implemented and evaluated within a simulation model of a news web application,
originally developed by Salehie and Tahvildari [10] , and the results show that the system is able to learn to behave better than
choosing actions at random, with RL used for learning a policy in a preliminary testing/tuning phase.

C. Distributed and Collaborative Scenarios


One of the most interesting problems in autonomic computing research is the management of distributed adaptation policies;
this approach is useful in distributed contexts where keeping a consistent global state is too complex (e.g., in a cloud computing
environment). This problem has been tackled, at a theoretical level, by Dowling, Cunningham, Curran, and Cahill [2] , who
propose a reinforcement learning–based model called Collaborative Reinforcement Learning (CRL) to tackle the complex
time–varying problem of coordinating autonomic components (i.e., agents) distributed in a system with no notion of a global
state. CRL extends reinforcement learning with a coordination model describing how agents cooperate to solve a system–wide
optimization problem decomposed in a set of Discrete Optimization Problems (DOPs). Figure 4 shows a schema representing
the approach to collaborative distributed solution of DOPs. Each DOP is modeled as a MDP and the CRL model solves
advertise(V(s)) | delegate(DOP)

state si(t) state sj(t)

reward ri(t) Agent Decay Agent reward rj(t)

action ai(t) action aj(t)

ri(t+1) rj(t+1)
Partially Shared
si(t+1) sj(t+1)
Environment

Figure 4. Schema of the Collaborative Reinforcement Learning (CRL) approach proposed by Dowling et al. [2]

system–wide problems by specifying how individual agents (i.e., autonomic components) can either resolve a certain DOP
via reinforcement learning (i.e., learning a policy to maximize a certain function of the reinforcement signal) and share the
solution with the other agents or delegate the solution to a neighboring agent. Within this model, DOP may be delegated several
times before being eventually handled by an agent; reasons for delegation may be the impossibility for an agent of solving
the problem or the estimated cost of doing so being higher than that foreseen by a neighboring agent. Details on the CRL
algorithm are formalized in the paper [2] , where the authors also propose a probabilistic on–demand network routing protocol
5

based on CRL and called SAMPLE. This protocol has been implemented in a network simulator framework and simulation
results show how SAMPLE is capable of showing autonomic properties of self–optimization leveraging the CRL algorithm.
The application of RL in a distributed AC scenario is treated, at a more practical level, also by Rao, Bu, Wang, and Xu [8] ,
who present a distributed RL algorithm that facilitates the provisioning of virtualized resources in cloud computing in an
autonomic fashion. They a reinforcement learning algorithm to manage autonomic allocation of virtual resources to VMs upon
changes in the applications’ workload. By doing so, VM resources can be automatically provisioned to match the current
applications demand, and not the peak one. The proposed approach is based on model–based RL and the states considered in
the learning algorithm are the possible VM resource allocations; the available changes to the allocations form the set of actions.
The reinforcement signal is fed to the RL decision mechanism whenever it decides to adjust the resource allocation for the
VMs and it consists of performance feedback from individual VMs. After a sufficient interaction time with the environment
(exploring the solution space by trying different configurations and receiving feedback), the controller is proved to be able to
obtain good estimations of the allocation decisions, given the state of the workload on the different VMs. Further results show
that, starting from an arbitrary initial setup, the controller is able to choose optimal resource allocations for the managed VMs.

D. Hybrid Approaches
Under some circumstances, reinforcement learning may be paired with different techniques to get better results in terms
of autonomic management. Vienne and Sourrouille [15] associate RL with a control mechanism to improve and adapt the
QoS management policy in a dynamic execution environment. They present a middleware consisting in a layer for resources
allocation with the goal of managing QoS in a computing system characterized by a dynamic runtime behavior. RL is used to
estimate the benefit of taking a certain action with the environment (i.e., the managed computing system) presenting a certain
state. The main advantage brought by the proposed middleware is for the application designer to not worry too much for
the performance of the applications, but with high–level descriptions; moreover, the system is made capable of coping with
unexpected changes in the execution context. The use of RL along with a control mechanism allows to require less pre–fed
information on the system (e.g., a control theory based controller needs a precise model of the system in order to be effective).
Even though the RL controller provides less guarantees, it requires far less a–priori information on the controlled system.
Tesauro, Jong, Das, and Bennani [13] tackles the problem of avoiding wild exploration in bootstrapping the RL system in
scenarios where the cost of taking a too–wrong decision is higher than the learning benefit for the algorithm. The proposed
solution couples RL with a policy based on queuing model: to bootstrap the RL controller, the managed system is ruled with
the queuing model–based policy and the RL algorithm is trained offline based on the collected data. This hybrid approach
allows the RL controller to bootstrap from existing management policies, substantially reducing learning and costliness. The
effectiveness of the approach is tested in the context of a simple data–center prototype.

III. C ONCLUSIONS AND P ERSPECTIVE


Within the field of autonomic management of computing system and self-* system–management problems, reinforcement
learning was introduced as a novel and radically different approach with respect to decision–making techniques classically
leveraged in such scenarios. The first applications of RL in this context are still relatively young, but different works in
literature have explored applications of RL into various scenarios. The main strength of RL with respect to other decision–
making methods is requiring less system–specific knowledge while being able to still synthesize reasonably near–optimal
policies. Certainly, there are still unresolved issues with RL, mainly related to near–eternal training times, highly complex
state descriptions and poor performance while learning due to random exploration, which may be too costly in some scenarios.
Some of these problems have bee tackled, e.g., using a hybrid approach to dispense with the cost of online learning from
scratch through pure trial an error. These advances, together with applications of RL in real–world problems (e.g., resources
allocation for VMs management), are showing that RL can be truly effective as a decision mechanism for AC, and that, with
more research, it will be possible to achieve its promises of outperforming well–established autonomic control techniques.
6

R EFERENCES
[1] M. Amoui, M. Salehie, S. Mirarab, and L. Tahvildari. Adaptive Action Selection in Autonomic Software Using Reinforcement Learning.
In Autonomic and Autonomous Systems, 2008. ICAS 2008. Fourth International Conference on, pages 175–181, march 2008.
[2] J. Dowling, R. Cunningham, E. Curran, and V. Cahill. Collaborative reinforcement learning of autonomic behaviour. In Database and
Expert Systems Applications, 2004. Proceedings. 15th International Workshop on.
[3] Paul Horn. Autonomic computing: IBM’s perspective on the state of information technology, Oct 2001. [Online] Available: http:
//www.research.ibm.com/autonomic/manifesto/autonomic_computing.pdf.
[4] Leslie Pack Kaelbling, Michael L. Littman, and Andrew W. Moore. Reinforcement learning: a survey. J. Artif. Int. Res., 4(1):237–285,
May 1996.
[5] Jeffrey O Kephart and David M Chess. The Vision of Autonomic Computing. Computer, 36(January):41–50, 2003.
[6] J.O. Kephart. Research challenges of autonomic computing. In Proceedings of the 27th international conference on Software engineering,
ICSE ’05, page 15âĂŞ22, New York, NY, USA, 2005. IEEe.
[7] Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., New York,
NY, USA, 1st edition, 1994. ISBN 0471619779.
[8] Jia Rao, Xiangping Bu, Kun Wang, and Cheng-Zhong Xu. Self-adaptive provisioning of virtualized resources in cloud computing.
SIGMETRICS Perform. Eval. Rev., 39(1):321–322, June 2011.
[9] Stuart Russell and Peter Norvig. Artificial Intelligence: A Modern Approach. Prentice Hall Press, Upper Saddle River, NJ, USA, 3rd
edition, 2009. ISBN 0136042597, 9780136042594.
[10] Mazeiar Salehie and Ladan Tahvildari. A weighted voting mechanism for action selection problem in self-adaptive software. In
Proceedings of the First International Conference on Self-Adaptive and Self-Organizing Systems, SASO ’07, pages 328–331, Washington,
DC, USA, 2007. IEEE Computer Society. ISBN 0-7695-2906-2. doi: 10.1109/SASO.2007.4. URL http://dx.doi.org/10.1109/SASO.
2007.4.
[11] Mazeiar Salehie and Ladan Tahvildari. Self-adaptive software: Landscape and research challenges. ACM Transactions on Autonomous
and Adaptive Systems, 4(2):1–42, May 2009.
[12] Filippo Sironi, Davide Basilio Bartolini, Simone Campanoni, Fabio Cancare, Henry Hoffmann, Donatella Sciuto, and Marco D.
Santambrogio. Metronome: operating system level performance management via self-adaptive computing. In Proceedings of the
49th Annual Design Automation Conference, DAC ’12, pages 856–865, New York, NY, USA, 2012. ACM.
[13] G. Tesauro, N.K. Jong, R. Das, and M.N. Bennani. A Hybrid Reinforcement Learning Approach to Autonomic Resource Allocation.
In Autonomic Computing, 2006. ICAC ’06. IEEE International Conference on, pages 65–73, june 2006.
[14] Gerald Tesauro. Reinforcement Learning in Autonomic Computing: A Manifesto and Case Studies. Internet Computing, IEEE, 11(1):
22–30, jan.-feb. 2007.
[15] Patrice Vienne and Jean-Louis Sourrouille. A middleware for autonomic QoS management based on learning. In Proceedings of the
5th international workshop on Software engineering and middleware, SEM ’05, pages 1–8, New York, NY, USA, 2005. ACM.
[16] Shimon Whiteson and Peter Stone. Towards autonomic computing: Adaptive network routing and scheduling. Autonomic Computing,
International Conference on, 0:286–287, 2004. doi: http://doi.ieeecomputersociety.org/10.1109/ICAC.2004.62.

June 29, 2012


Document produced with LATEX.

You might also like