Reinforcement Learning-Based Dynamic Adaptation Planning Method For Architecture-Based Self-Managed Software

Reinforcement Learning-Based Dynamic Adaptation Planning Method for
Architecture-based Self-Managed Software∗
Dongsun Kim and Sooyong Park

Department of Computer Science and Engineering, Sogang University
Shinsoo-dong, Mapo-Gu, Seoul, Korea
{darkrsw,sypark}@sogang.ac.kr
Abstract can dynamically change its architectural configuration with-

out human intervention. To achieve architecture-based self-
Recently, software systems face dynamically changing management, lots of challenges should be addressed such
environments, and the users of the systems provide chang- as identifying adaptable requirements, modeling adaptable
ing requirements at run-time. Self-management is emerg- software architecture, collecting environmental data, plan-
ing to deal with these problems. One of the key is- ning software architecture, and reconfiguring software dy-
sues to achieve self-management is planning for select- namically. Among the challenges, planning software ar-
ing appropriate structure or behavior of self-managed soft- chitecture is one the key issues to implement autonomy in
ware systems. There are two types of planning in self- architecture-based self-management.
management: off-line and on-line planning. Recent dis- Planning in self-management indicates the ability to
cussion has focused on off-line planning which provides make a decision which represents behavioral or structural
static relationships between environmental changes and changes to software systems. In other words, planning is an
software configurations. In on-line planning, a software activity in which a software system maps appropriate struc-
system can autonomously derive mappings between envi- ture or behavior in the presence of a new situation of the en-
ronmental changes and software configurations by learn- vironment or new user requirements. There are two differ-
ing its dynamic environment and using its prior experience. ent types of planning in self-management[11]: off-line and
In this paper, we propose a reinforcement learning-based on-line planning. Off-line planning represents that map-
approach to on-line planning in architecture-based self- pings between situations (or states) and possible software
management. This approach enables a software system to configurations are made by software construction process
improve its behavior by learning the results of its behavior with human intervention. For example, mapping strategies
and by dynamically changing its plans based on the learn- to invariants described in [5] is one of the off-line planning
ing in the presence of environmental changes. The paper because those are specified by an architectural description
presents a case study to illustrate the approach and its re- language by construction process1 .
sult shows that reinforcement learning-based on-line plan- On-line planning represents that a system can dynami-
ning is effective for architecture-based self-management. cally and autonomously make a decision about mappings
between situations and configurations. In on-line planning,
generally, the software system carries out its task based
on the current configuration and accumulates its experience
1. Introduction
(quantitative results of the current configuration to the cur-
rent situation), and then learns the effectiveness of the con-
As software systems face dynamically changing envi- figuration based on the previous experience. Using learning
ronments and have to cope with various requirements at data, the system can determine appropriate (generally best-
run-time, they need the ability to adapt to the environments so-far) mappings in the presence of situation changes. By
and new requirements[12, 1]. Increasing demands for more repeating this process (execution, accumulation, learning,
adaptive software introduced the concept of architecture- and decision making), the system can identify better map-
based self-management[11] in which the software system pings (i.e. improving plans by repeated learning).
∗ This research was performed for the Intelligent Robotics Development 1 These plans can be injected at run-time (i.e. late binding), but those
Program, one of the 21st Century Frontier R&D Programs funded by the plans are eventually identified by analysis and design process with human
Ministry of Knowledge Economy (MKE). intervention
SEAMS’09, May 18-19, 2009, Vancouver, Canada

978-1-4244-3724-5/09/$25.00 © 2009 IEEE 76 ICSE’09 Workshop
Agent
Previous approaches so far to planning in self-
management focus on an off-line planning process where
adaptation plans are previous designed or predefined at con-
State Reward Action
struction time[11] and self-contained. Off-line planning is st rt at
effective if developers can anticipate every possible map- rt +1
ping between situations and configurations. However, it is
difficult to anticipate interrelationships between situations
st +1 Environment
and actions in actual execution before the deployment of
software. On the other hand, an on-line planning process
Figure 1. The agent-environment interaction
enables a software system to generate plans adaptively by
in reinforcement learning.
evaluating and learning its behavior, and exploiting the re-
sult at run-time in the presence of uncertain environments.
Software
Thus, if we can provide on-line planning capabilities to a
System
system, the system can more autonomously adapt its con-
figuration to the environment.
In this paper, we propose an reinforcement learning- State Reward
based on-line planning approach to architecture-based self- st rt Situation
managed systems. Specifically, our approach applies Q- Action
it at
learning[23]. To support reinforcement learning-based on-
line planning in architecture-based software, we present
several elements. Those are (a) representations, which are rt +1
discovered from software requirements denoted by goals st +1 Environment
and scenarios, (b) fitness functions to evaluate the behav-
ior of a system, (c) operators to facilitate on-line planning,
and (d) a process to apply the previous three elements to Figure 2. The software system-environment
actual run-time execution. interaction in self-management.
The paper is organized as follows. Section 2 de-
scribes planning approaches to architecture-based self-
management. Section 3 presents our approach which
the next state (st+1 ) and produces a reward. The software
consists of representation (Section 3.1), fitness (Section
system(agent) selects a new action (at+1 ) based on the cur-
3.2), operators (Section 3.3), and an on-line evolution
rent state (st+1 ) and reward (rt+1 ). This continual interac-
process (Section 3.4) to achieve on-line planning in self-
tion is the basis realizing the autonomy of self-management.
management. Section 4 describes a case study conducted to
However, the ideal model is not appropriate for actual
verify the effectiveness of our approach. Section 5 summa-
rizes the contributions of the paper. software systems. State transition and reward observation
can be delayed because it takes time until the given ac-
tion affects the environment. If this timing issue is ignored,
2. Planning Approaches in Architecture-based the system cannot monitor the state transition properly nor
Self-Management observe an appropriate reward. Hence, the systems need
a modified interaction model for self-managed software as
Planning in software systems can be represented as an in- depicted in Figure 2. In the model, the difference from the
teraction between an agent and an environment, as depicted model shown in Figure 1, is the existence of ‘situation it ’.
in Figure 1[18]. This model shows the ideal interaction in This allows the system to know when it must monitor the
which the agent can immediately monitor states of the en- state and reward, and also when it must take the action cor-
vironment and observe rewards from the environment after responding to the state. In other words, it triggers the adap-
the agent gives an action to the environment. In the inter- tation of self-managed systems.
action model, a software system plays a role as an agent. One remaining issue is how this interaction model can be
The system monitors the current state (st ) of the environ- implemented in architecture-based self-management. Sup-
ment (where the software system operates) and observes pose that we can observe a finite set of situations that a soft-
the current reward (rt ) which is a special numerical value ware system can encounters and let S be the set. Let C be a
from the environment. Based on the state and observed re- finite set of architectural configurations that the system can
ward, the software system (agent) selects and takes an ac- take. When a situation (s ∈ S) occurs, if the system can
tion (at ). The taken action at influences the environment. always choose the best architectural configuration (c ∈ C,
After applying the action, the environment transitions into like to actions in Figure 2) among possible configurations
77
of the system, then we can say the system have a set of the accumulation[9, 10]. In the selection step, the system au-
best plans to the environment. This is can be formulated by: tonomously chooses a configuration which is suitable for
the current situation. Generally, the configuration is cho-
P :S→C (1) sen by the greedy selection strategy in which the best-so-far
configuration is chosen. However, this strategy may lead to
the problem of local optima. Hence, the system must adjust
V (si , P (si )) = max {V (si , cj )}, ∀si ∈ S (2) its strategy between exploitation and exploration. Exploita-
∀cj ∈C
tion represents that the system chooses a greedy configura-
where P in equation (1) is a planner which maps each tion while exploration means that the system intentionally
situation (s ∈ S) into an appropriate configuration (c ∈ chooses an suboptimal or random configuration to find a
C) during the system’s execution. V in equation (2) is a better solution. It is important to adjust the ratio of two
value function which calculates the value of a configuration strategies because it determines the planning performance.
based on the given situation si ∈ S. Equation (2) indicates In the evaluation step, the system must estimate the ef-
if the configuration that the planner P chosen is the best fectiveness of the configuration which is taken in the selec-
configuration, P is the best planner in the environment. One tion step. The key issue in the evaluation step is to define
of the primary goals of researchers in self-management is the way to determine the numerical values which represent
to find an optimal planner which can autonomously make the reward of the configuration because the numerical rep-
plans even in dynamically changing environments. The rest resentation enables the accumulation and comparison of the
of this section discusses two planning approaches to find an rewards. In the accumulation step, the system stores the nu-
optimal planner in architecture-based self-management. merical values identified in the evaluation step. The system
must adjust the accumulation ratio between already accu-
2.1. Off-line Planning mulated knowledge and newly incoming experience. If the
system uses accumulated knowledge too much in the accu-
In architecture-based self-management, off-line plan- mulation step, it may slow down accumulation speed. On
ning means that decisions which define the relationship be- the other hand, if the system uses new experience too much
tween situations and plans (i.e. configurations) are made in the step, it may cause ineffective accumulation. With
prior to run-time. In other words, whenever a system en- accumulated knowledge, the on-line planner can select an
counters a specific situation si from an environment each appropriate action in the next selection step.
time, the system selects and executes exactly one identical With these steps, self-managed systems can take advan-
configuration ci . Solutions so far to self-managed systems tage of on-line planning in the presence of dynamically
focus on off-line planning[11]. For example, plans are made changing environments. The next section describes how on-
by a system administrator through console or hard-coded line planning can be applied to actual systems in detail.
by a developer[14], architectural adaptation is described by
mapping invariants and strategies in ADL description[5], or
architectural changes are triggered by utility functions[4]. 3. Q-learning-based Self-Management
These off-line approaches can be effective if developers
can identify well-defined goals, states, configurations, re- This section presents an approach to designing
wards, and their relationships along with test environments architecture-based self-managed software by applying on-
that exactly illustrate the characteristics of the actual op- line planning based on Q-learning. Generally, we need to
erating environments before deployment time. However, consider three elements to apply metaheuristics such as re-
it is very difficult to identify them due to the nature of inforcement learning; those elements are ‘representation’,
planning[9]. However, On-line planning, which gives more ‘fitness’, and ‘operators’[7]. In reinforcement learning, the
effective autonomy in architecture-based self-management, representations of states and actions (=configurations) are
presents an alternative to overcome the limitation of off-line critical to shaping the nature of the search problem. The
planning. fitness function is used to determine which solution is bet-
ter. The operators enable a system to determine neighbor
2.2. On-line Planning solutions which can accelerate the learning process. Hence,
the proposed approach provides state and action represen-
On-line planning in architecture-based self-management tations that the system can exploit, fitness function design
represents that a software system can autonomously choose for the system to determine better solutions, and operators
a configuration with respect to the current situation that to manipulate solutions. In addition to these three elements,
the system encounters. Generally, an on-line planning the approach provides an on-line evaluation process which
process has three major steps: selection, evaluation, describes the control loop of the system at run-time.
78
3.1. Representation battery’ or ‘full-battery’. While this information represents
a long-term state of the environment, a situation represents
The representations of states and actions are crucial for a transient change of the environment, e.g. ‘hit by wall’ in
designing self-managed software using on-line planning be- an autonomous mobile robot system. Situations are triggers
cause it defines the problem space and the solution space of to begin the adaptation of the system. Situations can also
the system. Their representations are usually depicted in be identified from condition information. The condition,
numerical representation or symbolic code. In the former which describes a transient event such as ‘when the system
case, it is more suitable for neural networks[16] and sup- is hit by bullet’, can be transformed into a situation.
port vector machines[3, 22]. A numerical representation is One more element which must be considered when talk-
not appropriate for shaping the problem and solution space ing about states is the current software architecture of the
of reinforcement learning which uses discrete information, system. Although the architecture is not part of the environ-
because it may cause state explosion which leads to the in- ment, it can influence the planning of the system, for exam-
crease of training time of the system. Hence, the approach ple, it can be used to find an admissible action set. An archi-
uses symbolic code to represent the problem(state) and so- tecture can be denoted by symbolic notation which consists
lution(action) space. of a vector of components and connects e.g. (component:A,
Our approach provides a goal and scenario-based dis- component:B, component:C, connector:A-B, connector:B-
covery process for more systematic state and action discov- C). To simplify them, it can be transformed into a unique
ery. Goal[20, 13] and scenario-based approaches[21, 17] architecture identifier, e.g. arch:1a.
are widely used for the elicitation of software requirements. These three pieces of information(situation, long-term
Also, goal and scenario discovery have been studied[15]. state, architecture) compose the state information. An ex-
Goals denote the objectives that a system must achieve, ample of elements of state information is depicted in Ta-
for example, ‘maintain high throughput in a communica- ble 2. The state information of the system can be denoted
tion system’ or ‘minimize response time’. Scenarios rep- in a vector form such as (situation, long-term
resent a sequence of events to achieve a specific goal, for state, architecture information). For ex-
example, ‘the system is given user input and store them into ample, (hit-by-wall, near, low, arch:1a) and (hit-by-wall,
databases’. Goals are structured hierarchically to represent far, full, arch:1a), which are combinations of elements from
the relationship of super and subgoals. A subgoal represents Table 2.
more specific objectives to achieve its supergoal, eventually Actions can be identified by extracting behavior or reac-
it supports the root goal of the goal hierarchy. A goal has tion from scenarios. As depicted in Table 1, a set of behav-
scenarios and they are used for discovering new subgoals ior(reactions) is identified in a pair of conditions(or stimuli).
by describing the related goal in detail. If these pairs between conditions and reactions are fixed be-
The proposed approach exploits goals and scenarios to fore deployment time, it can be considered off-line plan-
discover states and actions. Once goal and scenario struc- ning, i.e. static plans. The goal of this approach is on-line
ture is organized, they can be mapped to states and actions planning in self-management, the set of actions should be
by reforming them. First, scenarios must be reformed into discovered separately. Similar to state information, actions
the pair of ‘condition → behavior’ or ‘stimulus → reaction’, can be identified by discovering action elements and group-
e.g. ‘when the battery of the system is low(condition or ing the elements into a type. For example, first, identify ac-
stimulus), turn off the additional backup storage(behavior tion elements such as ‘stop moving’ or ‘enhance precision’.
or reaction)’. This reforming is depicted in Table 1. In Then, group the elements which have the same type.
this table, the discovered goals are listed in sequence (goals Examples of action elements and its type are shown in
will be used to discover fitness functions as described in Table 2. With these pieces of information, each action can
Section 3.2). The scenarios of a specific goal are listed be represented by a vector form such as (precise maneuver,
by the goal. Each scenario is reformed into two additional rich) or (stop, text-only) where the first dimension is
columns: Condition(stimulus) and Behavior(reaction). ‘Movement’ and the second one is ‘GUI’. An action ele-
States are identified from the scenarios. Conditions in ment such as ‘rich’ in GUI or ‘stop’ in Movement, implies
the scenarios can be candidates of states. A condition rep- architectural changes which include adding, removing,
resents a possible state of the system, e.g. ‘the system’s replacing a component, and changing the topology of an
battery is low’ implies ‘low-battery’ or ‘the system’s bat- architecture. Hence, each action must be mapped with a set
tery is full’ implies ‘full-battery’. These conditions depict of architectural changes. For example, the action ‘(precise
physical states of the system, i.e. what the environment has maneuver, rich)’ can be mapped with a sequence of
or shows. Thus, a group of conditions represents a dimen- ‘[add:(visionbased localizer), connect:
sion(type) of states, for example, ‘battery-level’ is a dimen- (visionbased localizer)-(pathplanner),
sion of state information and it can have value, either ‘low- replace: (normal gui)-by-(rich gui)]’
79
Table 1. Reformed goals and scenarios
Goal Scenario Condition(stimulus) Behavior(reaction)
Goal 1 Sc. 1-1 when the battery of Cond. 1-1 the battery Beh. 1-1 turn off
Maximize the system is low , turn off the of the system is low the additional backup
system additional backup storage storage
availability Sc. 1-2 . . . Cond. 1-2 . . . Beh. 1-2 . . .
Goal 2 . . . Sc. 2-1 . . . Cond. 2-1 . . . Beh. 2-1 . . .
... ... ... ...
Goal n.m . . . Sc. n.m-1 . . . Cond. n.m-1 . . . Beh. n.m-1 . . .
Table 2. An example of state and action information

State(long-term) Action
Situation
Type Range Type Range
hit-by-wall distance {near, far} Movement {precise maneuver, stop, quick
maneuver}
hit-by-user battery {low, mid, GUI {rich, normal, text-only}
full}
where (...) indicates a component name. ing shows the process to define the fitness function. This
The discovery process shown in this section identifies process must be manually conducted by system developers.
states and actions by extracting data elements from goals
and scenarios. Because goals and scenarios directly repre- 1. From the root goal, search goals which can be numer-
sent the objectives and user experiences, and also they show ically evaluated by a top-down search strategy.
how the system influences the environment, the number of
states and actions can be limited. In other words, it helps
2. If an appropriate goal is found, define a function
the system prevent a state explosion.
that represents the goal and mark all subgoals of the
goal(i.e. subtree). Then, stop the search of the subtree.
3.2. Fitness
3. Repeat the search until all leaf nodes are marked.
The fitness function of the system is crucial because it
represents the reward of the action that the system chooses.
The function provides a criterion to evaluate the system’s More than one of the fitness functions can be identified
behavior. Specifically, in autonomous systems, it is neces- by the discovery process. In this case, it is necessary to inte-
sary to know which of two actions is the better according grate the functions into one fitness function. The following
to the current state. Hence, rewards that the fitness func- equation depicts the integrated fitness function:
tion generates are usually represented by numerical num-
X
bers which can be compared to each other. The fitness rt = f (t) = wi fi (t) (3)
function is usually application specific, so it is important to i
discover the function from systems requirements. Our ap-
proach exploits the goal and scenario structure discovered where rt is the reward at time t, f (t) is the (integrated)
in Section 3.1. In particular, goals are the source of fitness P of the i-th function, fi (t)
fitness function, wi is the weight
discovery. is the i-th function of t, and i |wi | = 1. Equation (3) as-
Generally, goals, especially higher ones including the sumes that the objective of the system is to maximize the
root goal, are too abstract to define numerical functions. values of all functions fi (t). To minimize a certain value,
Thus, it is necessary to find an appropriate goal level to de- multiply −1 to the weight value of the function fi (t). Every
fine the fitness function. It is difficult to define universal function fi (t) corresponds to the observed data of the sys-
rules for choosing appropriate goals which describes nu- tem at run-time. For example, the observed network latency
merical functions of the system, but it is possible to propose 100ms at time t is transformed into 10 by the function fi (t)
a systematic process to identify the functions. The follow- and multiplied by -1 because it should be minimized.
80
3.3. Operators 3.4.2 Planning Phase
Using the state identified by the detection phase, the system
Different metaheuristics use different operators. For ex- chooses an action to adapt itself to the state. At this time,
ample, genetic algorithms use crossover and mutation. The the system uses an action selection strategy. In general, Q-
reason why algorithms use operators is to accelerate search- learning uses ‘ε-greedy’ selection strategy as an off-policy
ing better solutions. In reinforcement learning, operators strategy to choose an action from the admissible action set
are used to reduce the search space, i.e. the number of ac- A(s) of the current state s. The strategy is a stochastic pro-
tions. Operator A(s) specifies an admissible action set of cess in which the system exploits prior knowledge or ex-
the observed state s. For example, when a mobile robot plores a new action. This is controlled by a value ε deter-
collides with the wall, actions, which are related to motion mined by the developer, where 0 ≤ ε < 1. In this phase, the
control are more admissible than those of arm manipulation. system generates a random number r. If r < ε, the system
This operator is crucial because it can reduce the training chooses an action randomly from the admissible action set.
time of the system. Otherwise(r > ε), it chooses the best-so-far action by com-
paring the value of each action accumulated in the learn-
3.4. On-line Evolution Process ing phase(see Section 3.4.5). In this manner, the stochastic
strategy prevents the system from falling local optima. The
With three elements discussed through Section 3.1∼3.3, chosen action is used by the execution phase.
the system can apply on-line planning based on Q-learning.
Q-learning[23] is one of temporal-difference (TD) learn- 3.4.3 Execution Phase
ing which is a combination of Monte Carlo ideas and dy-
namic programming ideas[18]. Specifically, Q-learning is This phase applies the action, which is chosen in the previ-
an off-policy TD control algorithm. The reason why we ous phase(planning phase), to the system. As the action de-
choose Q-learning is its modeless characteristics. In con- scribes architectural changes such as adding, removing, re-
trast to model-based planners[2, 19], model-free TD meth- placing components, and reconfiguring architectural topol-
ods can learn directly from raw experience without a model ogy, the system must have architecture manipulation facil-
of the environment. Also, TD learning assumes that it up- ities, i.e. dynamic architecture. Also, it must support the
dates observed data based in part on prior knowledge, with- way to make a quiescent state[6]. The quiescent state2 in ar-
out waiting for a final outcome. TD methods also assume chitectures represents that the system(or part of the system
that events of the environment are recurring. These sat- which are relevant to the changes described by the action)
isfy the properties of on-line planning described in section is safe to execute architectural reconfiguration. If the action
2. This section presents the way to exploit those elements is not executed in the quiescent state, it may cause unsafe
in the on-line evolution process. The process consists of changes which can lead to a malfunction.
five phases: detection, planning, execution, evaluation, and Once reconfiguration is done, the system carries out its
learning phases. own functionalities by using its reconfigured architecture.
The system keeps executiing until it encounters a new sit-
uation or it terminates. Although this may overlap with
3.4.1 Detection Phase
the detection phase, the detection in this phase indicates a
In the detection phase, the system monitors the current state phase transfer from this phase to the next phase(the detec-
of the environment where the system operates. When de- tion phase of course uses this detected situation).
tecting states, the system uses the notation presented in Sec-
tion 3.1. Continual detection may cause performance degra- 3.4.4 Evaluation Phase
dation. Thus, it is crucial to monitor the change which ac-
tually triggers the adaptation of the system. Architecture After the execution phase, the system must evaluate its pre-
information is not suitable for the purpose because it does vious execution by observing a reward from the environ-
not reflect environmental changes and it is changed only by ment. As mentioned in section 3.2, the system continuously
the system. Also long-term states are not feasible because observes values previously defined by the fitness function.
they are only the result of environmental changes instead These values will be used for calculating the reward of the
of the cause. Situations can be appropriate triggers because action taken. The values must be retrieved immediately.
they describe moments that the system needs adaptation. If Otherwise, in other words, if the system is delayed in re-
a situation is detected, then the system observes long-term trieving reward values, it may influence the performance or
states and the current architecture of the system, and de- 2 The term ‘state’ is different from the term mentioned in Section 3.1.
notes them into the representation presented in 3.1. These In this context, the state indicates the execution state of the software archi-
data are passed to the next phase: the planning phase. tecture
81
i) Case 1: learning a new situation Q( s, ai ) ← Reward
e.g. <hit-by-bullet>
Examples that a system uses the above process are shown
Current Arch. Reconfig-
Situation uration
Learner
in Figure 3. i): When a system encounters a new situa-
Planner action ai
tion, the planner in the system chooses an arbitrary recon-
State
No Learning data,
(i.e. arg amax Q ( s, a ) =N/A) Reconfigured
Evaluator Reward figuration action because there is no learning data about the
e.g. (distance=near,
angle=behind, choose an arbitrary Arch. After execution, new situation. Then, The system reconfigures its architec-
energy=mid, action ai for reconfigur- Fitness evaluated rewards
enemy’s energy=low) ation Functions is used for updating ture based on the chosen action. During the system exe-
e.g. replace:(rammer)-by- Q( s, ai ) value.
(avoider) cutes its functionality, the evaluator the execution based on
Q( s, ai ) ←Q ( s, ai ) ⊕
ii) Case 2: Exploiting learning data Reward the predefined fitness function and produces rewards. The
Current Arch. Reconfig-
Situation uration
learner accumulates the rewards with respect to the prior
Planner action Learner situation (planning model construction). ii): When the sys-
a best − so − far
State Using prior experience tem encounters an already experienced situation, the plan-
(i.e. learning data), Choose
Reconfigured Evaluator Reward
best-so-far(arg max Q( s, a) ) ner exploits the best-so-far action based on learning data
Arch.
action a best − so − far After execution,
a
Fitness evaluated rewards (exploitation based on the previously constructed planning

Functions is used for updating
existing Q( s, ai ) value. model). After the execution of the reconfigured architec-
(i.e. reinforcement)
ture based on the best-so-far action, the evaluator produces
rewards. The rewards are used to reinforce current Q-values
Figure 3. Examples of applying the on-line
(model reinforcement). When the environment changes, the
evolution process.
current Q-values are not appropriate the changed environ-
ment. Hence, the system must try actions with respect to
situations and accumulate rewards and reinforce Q-values
quality of the learning phase. Hence, the system must have to adapt the changed environment (model updating).
facilities to observe the values immediately. The observed With these facilities discussed through section 3, the
values are passed to the next phase: the learning phase. approach supports software developers to construct self-
managed software with on-line planning based on Q-
3.4.5 Learning Phase learning. The next section shows the result of a case study
which applies the approach to a robot simulator.
In this phase, the system accumulates the experiences which
represent the previous execution described in section 3.4.3,
by using the reward observed in the evaluation phase. This
4. Case Study
phase directly uses Q-learning. The key activity of Q-
learning is updating Q-values. This update process is de- This section reports on a case study which applies our
picted in equation (4), approach to an autonomous system. The environment in
this case study is Robocode[8] which is a robot battle sim-
ulator. Robocode provides a rectangular battle field where
Q(st , at ) = (1−α)Q(st , at )+α[rt+1 +γmaxa Q(st+1 , a)] user-programmed robots can fight each other. It also pro-
(4) vides APIs(Application Programming Interfaces) for robot
where 0 ≤ α ≤ 1 and 0 < γ ≤ 1. α is a constant step- design. A robot consists of a radar, gun, and body. The radar
size parameter and γ is a discount factor. This update equa- is responsible for locating enemy robots on the battlefield.
tion accumulates the result of the previous execution based It indicates the location of an enemy robot by providing a
on the current state, the chosen action, and the observed distance and a relative angle. The gun fires a bullet to a
reward. This technique composites the current experience enemy robot. It can control the power of the bullet, but a
upon the previous experience. The term (1 − α)Q(st , at ) more powerful bullet may cause a longer cooling time. The
represents the value the system already knows and its body contains a radar, a gun, and two wheels. By giving
weight for accumulation where α controls the weight. On speed values to the wheels, the robot can move around in the
the other hand, the term α[rt+1 + γmaxa Q(st+1 , a)] rep- battlefield. Basically, Robocode only provides templates to
resents the current reward and the expectation of the greedy build a robot. Robot developers should construct software
action of the newly observed state where α controls the systems for their robots. In general, developers make firing,
term’s weight and γ controls the weight of the greedy ac- targeting, and maneuvering functionalities to build a robot.
tion. In this manner, the system can accumulate its expe- The reason why we chose Robocode is that it can pro-
rience by updating Q(st , at ) = v where st is the detected vide a dynamically changing environment and enough un-
state, at is the action taken by the system at st , and v is the certainty, as well as being good for testing self-managed
value of the action on st . This knowledge will be used in the software with on-line planning. In particular, it is hard to
planning phase(see section 3.4.2) to choose the best-so-far anticipate the behavior of an enemy robot prior to run-time.
action. Also, several communities provide diverse strategies for fir-
82
alpha=0.3, gamma=0.7, epsilon=1.0
metaheuristics[7] and to verify whether our planning ap-
3500
proach can generates plans without an initial model of the
3000 environment. We chose a robot named ‘AntiGravity 1.0’ as
y = -19.0 x + 2,721.1 the enemy. This robot has an anti-gravity algorithm which
2500 enables the robot to move as it has a reverse gravity engine.
In other words, the algorithm supposes that the wall of the
2000
Score
battlefield and other robots have gravity. Thus, the robot

1500 keeps enough distance between the wall3 and other robots.
The robot (‘AntiGravity 1.0’) is known for its good perfor-
1000
mance in battles with many other robots, in Robocode com-
500
munities. For convenience, the robot described in Section
4.1 will be denoted by ‘A Robot’ and ‘AntiGravity 1.0’ will
0 be denoted by ‘B Robot’.
1 2 3 4 5 6 7 8 9 10 The result of the battle, in which ‘A Robot’ with ran-
Round(X 10) dom action selection (ε = 1.0 indicates the robot chooses
A Robot B Robot
actions randomly whenever the system monitors any situa-
tion. This means the robot repeats case i) in Figure 3 with-
Figure 4. Scores of ‘A Robot’ which has pa- out exploiting prior experience) fought with ‘B Robot’, is
rameters: α = 0.3, γ = 0.7, ε = 1.0. ε = 1.0 in- shown in Figure 4. We carried out one hundred rounds, and
dicates the robot always randomly choose an observed scores after every ten rounds. Each score consists
action in the presence of any situation. (i.e. of a survival bonus (70 points), and bullet damage points.
exploration only) Scores satisfy the goals of the robots, ‘Eliminate the enemy’
and ‘Maximize my energy’, by surviving in the battle field
(which offers a survival bonus) and satisfy ‘Minimize the
enemy’s energy’ by hitting the enemy with bullets (which
ing, targeting, and maneuvering. These offer opportunities
offers bullet damage points).
to try several reconfiguration with respect to various situa-
The result shown in Figure 4 indicates that the robot with
tions.
random action selection does not adapt to the environment
The rest of this section presents a robot design which
(i.e. the enemy robot) even though it slightly outperforms
applies our approach. The design includes representations,
‘B Robot’. The regression line which represents the perfor-
fitness functions, and operators for the robot. Also this sec-
mance of ‘A robot’ in Figure 4 does not increase because the
tion presents the result of the evaluation which shows the
robot cannot exploit prior experience to find better config-
effectiveness of our approach.
urations (i.e. no planning model construction). Hence, the
robot cannot make effective plans. This indicates a planner
4.1. Robot Implementation is needed to effectively adapt to a modeless environment.
On the other hand, the scores of ‘A Robot’, which has the
To verify the effectiveness of the proposed approach, we
on-line planning capability (i.e. 0.0 < ε < 1.0 means the
implemented a robot with dynamic architecture. This robot
planner in the robot can exploit prior experience as depicted
can try lots of firing, targeting, and maneuvering strategies
in Figure 3 case ii)), gradually increases as depicted in Fig-
by changing its software architecture. Also, we identified
ure 5. Also, the robot outperforms ‘B Robot’. This indicates
situations, states, actions, fitness functions, and operators
that the approach enables the robot to, at least, learn its envi-
based on our approach. The detailed description of robot
ronment and enhance its performance at run-time. In detail,
implementation is provided in Appendix A.
the robot randomly chooses actions in the early stage of the
battle as depicted in Figure 3 case i) (exploration). As the
4.2. Evaluation rewards of prior actions are accumulated (model construc-
tion), the robot can exploit accumulated knowledge to make
This section shows the effectiveness of the approach by
a plan as shown in Figure 3 case ii) (exploitation). Con-
presenting the result of robot battles in Robocode. Two
sequently, the result depicted in Figure 5 shows the robot
experiments are conducted in this research. Learning data
can actively explore better solutions by using the reinforce-
were initialized at the beginning of every experiment except
ment learning-based planner in the presence of dynamically
the second experiment.
changing environments.
The first experiment was conducted to compare the ap-
The second experiment was conducted to verify the ef-
proach with random action selection (i.e. random adap-
tation) which is one of generic evaluation criteria for 3 When a robot is hit by wall, it causes energy reduction.
83
alpha=0.3, gamma=0.7, epsilon=0.5 alpha=0.3, gamma=0.7, epsilon=0.0
3500
3500
3000
y = 74.9 x + 2,361.5 y = -4.7 x + 2,813.3
3000
2500
2500 2000
Score
2000 1500
Score
1000
1500
500
1000
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
500 Round(X 10)
(a) A Robot B Robot
0
1 2 3 4 5 6 7 8 9 10 alpha=0.3, gamma=0.7, epsilon=0.0
3000
Round(X 10)
y = 4.8 x + 2,234.7
2500
A Robot B Robot
2000
Figure 5. Scores of ‘A Robot’ which has pa-
Score
rameters: α = 0.3, γ = 0.7, ε = 0.5. ε = 0.5 1500
indicates the robot can make plans by using

1000
prior knowledge in the presence of each situ-
ation. (i.e. the planner adjusts the proportion 500
of exploration and exploitation.)

0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Round(X 10)
(b) A Robot C Robot
fectiveness of on-line planning in self-management. In this alpha=0.3, gamma=0.7, epsilon=0.5

3500
experiment, we used the learning data (Q-values) learned by
y = 12.4 x + 2,534.0
the robot, conducted in the previous experiment (which has 3000
parameters α = 0.3, γ = 0.7, ε = 0.5). Using the data, ‘A 2500
Robot’ which implements greedy selection (i.e. ε = 0.0

2000
indicates the robot always exploits best-so-far actions in
Score
the presence of any situation.) outperforms ‘B Robot’ con- 1500
stantly as depicted in Figure 6.(a). However, when we in- 1000
troduced a new robot (‘C Robot’ which has ‘Dodge’ strat-

500
egy), which means the environment changed, the robot with
same parameters (i.e. exploitation only = off-line planner) 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
cannot outperform the enemy as depicted in Figure 6.(b). Round(X 10)
This indicates that the robot uses off-line planning and it (c) A Robot C Robot
cannot adapt to dynamic environment changes. This shows

the limitation of off-line planning which uses a fixed model Figure 6. (a) Exploitation only. The robot uses
for planning. To enable on-line planning, we changed ε to the previously accumulated Q-values to fight
0.5 and the result of the battle is shown in Figure 6.(c). with the already experienced robot without
The result indicates that ‘A Robot’ can learn the behavior further learning. (b) Exploitation only. the
of ‘C Robot’ and gradually outperform the enemy. This robot uses the Q-values to fight with a new
indicates software systems which apply our approach can robot. In other words, the robot uses off-line
search better solutions when it encounters new situations planning and it cannot adapt to dynamic en-
from the dynamically changing environments. The two ex- vironment changes. (c) Exploration and ex-
periments described in this section represent the effective- ploitation. In the early stages, ‘A Robot’ ran-
ness of the approach described in Section 3 by showing domly explores actions and accumulates re-
that the robot implemented by the approach can learn the wards from the exploration. Then, the robot
dynamically changing environment and outperform off-line exploits the knowledge learned to fight with
planning. the new robot.
84
5. Conclusions [7] M. Harman and B. F. Jones. Search-based software engi-
neering. Information & Software Technology, 43(14):833–
839, 2001.
Planning in architecture-based self-management repre-
[8] IBM. Robocode, http://robocode.alphaworks.ibm.com/home/
sents how a system can find appropriate relationships be- home.html, 2004.
tween situations that the system encounters and possible [9] G. Klein. Flexecution as a paradigm for replanning, part 1.
configurations that the system can take in dynamically IEEE Intelligent Systems, 22(5):79–83, 2007.
changing environments. The paper has discussed two types [10] G. Klein. Flexecution, part 2: Understanding and support-
of planning: off-line and on-line planning. On-line plan- ing flexible execution. IEEE Intelligent Systems, 22(6):108–
ning must be considered to deal with dynamically chang- 112, 2007.
ing environments because off-line planning has limitation [11] J. Kramer and J. Magee. Self-managed systems: an archi-
which uses fixed relationships between situations and con- tectural challenge. In FOSE ’07: 2007 Future of Software
Engineering, pages 259–268, Washington, DC, USA, 2007.
figurations for adaptation. On the other hand, on-line plan-
IEEE Computer Society.
ning enables a system to autonomously find better relation- [12] R. Laddaga. Active software. In P. Robertson, H. E. Shrobe,
ships between them in dynamic environments. This paper and R. Laddaga, editors, Proceedings of the First Interna-
has described an approach to designing architecture-based tional Workshop on Self-Adaptive Software, IWSAS 2000,
self-managed systems with on-line planning based on Q- volume 1936 of Lecture Notes in Computer Science, pages
learning. The approach provides a discovery process of 11–26. Springer, 2000.
representations, fitness functions, and operators to support [13] J. Mylopoulos. Goal-oriented requirements engineering. In
on-line planning. The discovered elements are organized 12th Asia-Pacific Software Engineering Conference (APSEC
by an on-line evolution process. A case study has been con- 2005), 15-17 December 2005, Taipei, Taiwan, page 3, 2005.
ducted to evaluate the approach. The result of the case study [14] P. Oreizy, N. Medvidovic, and R. N. Taylor. Architecture-
based runtime software evolution. In ICSE ’98: Proceedings
shows that our approach is effective for architecture-based
of the 20th international conference on Software engineer-
self-management. ing, pages 177–186, Washington, DC, USA, 1998. IEEE
Computer Society.
A Robot Implementation [15] C. Rolland, C. Souveyet, and C. B. Achour. Guiding goal
modeling using scenarios. IEEE Trans. Software Eng.,
24(12):1055–1071, 1998.
Due to space limitation, we provides an addi- [16] F. Rosenblatt. The perception: a probabilistic model for in-
tional material about robot implementation on this URL: formation storage and organization in the brain. Neurocom-
http://seapp.sogang.ac.kr/robotimpl.pdf puting: foundations of research, pages 89–114, 1988.
[17] A. G. Sutcliffe, N. A. M. Maiden, S. Minocha, and
D. Manuel. Supporting scenario-based requirements engi-
References neering. IEEE Trans. Software Eng., 24(12):1072–1088,
1998.
[1] T. Bäck. Adaptive business intelligence based on evolution [18] R. S. Sutton and A. G. Barto. Reinforcement Learning: An
strategies: some application examples of self-adaptive soft- Introduction. Cambridge, Massachusetts, The MIT Press,
ware. Inf. Sci., 148(1-4):113–121, 2002. 1998.
[2] P. Bertoli, A. Cimatti, M. Pistore, M. Roveri, and [19] D. Sykes, W. Heaven, J. Magee, and J. Kramer. From goals
P. Traverso. Mbp: a model based planner. In IJCAI’01 to components: a combined approach to self-management.
Workshop on Planning Under Uncertainty and Incomplete In SEAMS ’08: Proceedings of the 2008 international
Information, 2001. workshop on Software engineering for adaptive and self-
[3] C. J. C. Burges. A tutorial on support vector machines for managing systems, pages 1–8, New York, NY, USA, 2008.
pattern recognition. Data Min. Knowl. Discov., 2(2):121– ACM.
167, 1998. [20] A. van Lamsweerde. Goal-oriented requirements engineer-
[4] J. Floch, S. O. Hallsteinsen, E. Stav, F. Eliassen, K. Lund, ing: A guided tour. In 5th IEEE International Symposium on
and E. Gjørven. Using architecture models for runtime Requirements Engineering (RE 2001), 27-31 August 2001,
adaptability. IEEE Software, 23(2):62–70, 2006. Toronto, Canada, page 249, 2001.
[5] D. Garlan, S.-W. Cheng, A.-C. Huang, B. R. Schmerl, and [21] A. van Lamsweerde and L. Willemet. Inferring declara-
P. Steenkiste. Rainbow: Architecture-based self-adaptation tive requirements specifications from operational scenarios.
with reusable infrastructure. IEEE Computer, 37(10):46–54, IEEE Trans. Software Eng., 24(12):1089–1114, 1998.
2004. [22] V. N. Vapnik. The Nature of Statistical Learning Theory.
[6] H. Gomaa and M. Hussein. Software reconfiguration pat- Springer-Verlag, New York., 1995.
terns for dynamic evolution of software architectures. In 4th [23] C. J. C. H. Watkins. Learning from Delayed Rewards. PhD
Working IEEE / IFIP Conference on Software Architecture thesis, Cambridge University, 1989.
(WICSA 2004), 12-15 June 2004, Oslo, Norway, pages 79–
88, 2004.
85

Reinforcement Learning-Based Dynamic Adaptation Planning Method For Architecture-Based Self-Managed Software

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Reinforcement Learning-Based Dynamic Adaptation Planning Method For Architecture-Based Self-Managed Software

Uploaded by

Copyright:

Available Formats

Reinforcement Learning-Based Dynamic Adaptation Planning Method for

Architecture-based Self-Managed Software∗

Dongsun Kim and Sooyong Park

Abstract can dynamically change its architectural configuration with-

SEAMS’09, May 18-19, 2009, Vancouver, Canada

Table 2. An example of state and action information

Fitness evaluated rewards (exploitation based on the previously constructed planning

battlefield and other robots have gravity. Thus, the robot

Figure 5. Scores of ‘A Robot’ which has pa-

indicates the robot can make plans by using

of exploration and exploitation.)

fectiveness of on-line planning in self-management. In this alpha=0.3, gamma=0.7, epsilon=0.5

parameters α = 0.3, γ = 0.7, ε = 0.5). Using the data, ‘A 2500

Robot’ which implements greedy selection (i.e. ε = 0.0

the presence of any situation.) outperforms ‘B Robot’ con- 1500

stantly as depicted in Figure 6.(a). However, when we in- 1000

troduced a new robot (‘C Robot’ which has ‘Dodge’ strat-

cannot adapt to dynamic environment changes. This shows

You might also like