You are on page 1of 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/220707983

Enabling rollback support in IT change management systems

Conference Paper · January 2008


DOI: 10.1109/NOMS.2008.4575154 · Source: DBLP

CITATIONS READS
25 5,104

10 authors, including:

Guilherme Sperb Machado Weverton Cordeiro


University of Zurich Universidade Federal do Rio Grande do Sul
26 PUBLICATIONS   197 CITATIONS    69 PUBLICATIONS   510 CITATIONS   

SEE PROFILE SEE PROFILE

Cristiano Bonato Both Luciano Paschoal Gaspary


University of Vale do Rio dos Sinos (UNISINOS) Universidade Federal do Rio Grande do Sul
113 PUBLICATIONS   896 CITATIONS    205 PUBLICATIONS   1,696 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Changeledge View project

ProSeG - Information Security, Protection and Resilience in Smart Grids View project

All content following this page was uploaded by Akhil Sahai on 13 May 2020.

The user has requested enhancement of the downloaded file.


Enabling Rollback Support in IT Change
Management Systems
Guilherme Sperb Machado, Fábio Fabian Daitx, Claudio Bartolini, Akhil Sahai, David Trastour,
Weverton Luis da Costa Cordeiro, Katia Saikoski
Cristiano Bonato Both, Luciano Paschoal Gaspary, HP Laboratories Palo Alto, USA
Lisandro Zambenedetti Granville HP Laboratories Bristol, UK
Institute of Informatics, UFRGS - Brazil HP Brazil R&D, Brazil
{gsmachado, ffdaitx, weverton.cordeiro, cbboth, {claudio. bartolini, akhil.sahai, david.trastour,
paschoal, granville}@inf.ufrgs.br katia.saikoski}@hp.com

Abstract— The current research on IT change management has (RFC) documents that define which changes are needed but
been exploring several aspects of this new discipline, but it usually not how they must be performed. The definition of an RFC is
assumes that changes expressed in Requests for Change (RFC) the first step of a process that will generate a final change plan,
documents will be successfully executed over the managed IT
infrastructure. This assumption, however, is not realistic in actual which is essentially a workflow of low level activities that,
IT systems because failures during the execution of changes do when executed, will evolve the managed system to a new state
happen and cannot be ignored. In order to address this issue, consistent with the changes expressed in the original RFC.
we propose a solution where tightly-related change activities Although change management is a relatively new discipline,
are grouped together forming atomic groups of activities. These several important challenges have already been investigated
groups are atomic in the sense that if one activity fails, all other
already executed activities of the same group must rollback to in research projects [3] [4] [5]. Given the complexity of
move the system backwards to the previous state. The automation the subject, these investigations have naturally made some
of change rollback is especially convenient because it relieves assumptions that enabled the investigations to progress. One of
the IT human operator of manually undoing the activities of a these assumptions is that once an RFC is approved and ready
change group that has failed. To prove concept and technical to be deployed, the activities of the associated change plan
feasibility, we have materialized our solution in a prototype
system that, using elements of the Business Process Execution will always succeed and lead the IT infrastructure to the next
Language (BPEL), is able to control how atomic groups of consistent state. This assumption does not in fact represent
activities must be handled in IT change management systems. a realistic and practical situation in actual IT environments
because failures during the execution of changes do happen
and cannot be ignored.
I. I NTRODUCTION In this paper we focus our attention to the necessity of
Currently, modern companies and organizations are often handling failures on change plan execution, in order to avoid
unable to deliver high quality services without employing so- managed IT infrastructures of ending up in undesired states.
phisticated IT infrastructures to support their final businesses. We firstly define that, after the deployment of an RFC, the
Sophisticated IT infrastructures, in turn, are usually accom- managed infrastructure must have either successfully evolved
panied by complex management challenges that often lead to to a new state or returned to the state previous to the change
increasing maintenance costs. A rational management of IT in- request. In other words, RFC deployment is treated as a
frastructures then becomes a critical issue for any organization single atomic transaction. To support this behavior, whenever
that aims at keeping a good financial health. In order to provide a failure occurs during a change plan execution, a rollback
a more systematic IT infrastructure management – and thus procedure is invoked to undo changes executed so far and
reduce management costs – the widely recognized Information abort the ongoing change plan. In fact, we go further and
Technology Infrastructure Library (ITIL) [1] presents a set observe that for some IT scenarios it would be too restrictive
of best practices and processes that helps organizations to to consider an RFC the only possible atomic element. We thus
properly maintain their IT infrastructures. propose that additional atomic elements can be complemen-
Among the ITIL processes, change management [2] is the tarily defined in a granularity finer than that of an RFC, for
one that defines how changes in IT infrastructures should be example, at the change plan level.
planned, scheduled, implemented, and assessed. The impor- In order to materialize atomic transactions on IT manage-
tance of change management resides in the fact that changes ment, we have employed a set of techniques in a prototype
in IT infrastructures must be executed in a way that does system developed to evaluate our proposed solution. In par-
not lead the managed systems to unknown or inconsistent ticular, we have explored some exception-related mechanisms
states. To address this issue, changes required over the IT (e.g., fault handlers, asynchronous notifications, compensation
infrastructures are firstly expressed in Requests for Change activities) present in the Business Process Execution Language

978-1-4244-2066-7/08/$25.00 ©2008 IEEE 347


(BPEL) [6]. In our prototype, atomic transactions defined by partial failures. In addition, the authors recognize that the
human system operators at the RFC and change plan levels are proposed solution has some bottlenecks, such as complexity
translated into BPEL constructions via a translating algorithm limit related to the number of objects and existing operators,
also presented in this paper. A set of experiments have been upper bound on the sum of costs, and on specification of the
additionally carried out to observe the impact of our proposal actions.
in the whole management system as well as on the managed In change management, as far as the authors of this paper
IT infrastructure. are aware of and as mentioned in the introduction section, there
The remainder of this paper is organized as follows. In is no work that addresses the question of employing rollback
Section II we briefly review rollback support for computing as a mechanism to maintain the managed IT infrastructure in
systems. The proposed solution to incorporate rollback support a consistent state. The importance of the subject can be even
in change plans is presented in Section III, while the associated directly observed in the ITIL documents themselves, where
prototype is described in Section IV. The results of the back-out plans are explicitly mentioned as a requirement of
evaluation carried out in this research is then presented in change management. However, no proper support is found in
Section V. Finally, we close this paper in Section VI, where the current systems. In the next sections we will introduce our
conclusions and future work are discussed. proposed solution for rollback in change management.

II. BACKGROUND III. ROLLBACK S OLUTION


Rollback support is a complex subject in diverse computer In this section we present our solution for rollback support
science disciplines. Several aspects of computing systems in IT change management system. We first describe a general
(e.g., faulty underlying communications, dependencies among IT management architecture where the rollback support is
distributed components, services unavailability) make saving introduced. We then discuss how atomic actions take place and
consistent states and subsequent rolling back to them a task how system administrators are able to group critical activities
that can not always be properly or successfully accomplished. in atomic groups. Finally, we present the modeling of IT
Nevertheless, some mechanisms to support rollback have al- infrastructure classes to support the proposed approach.
ready been proposed, investigated, and implemented. In this A. IT Change Management Architecture and Rollback Support
section we review rollback-related work that has inspired the Components
design of our proposed solution.
Although there is not a single, widely employed archi-
At the device level, a common way to implement rollback
tecture to support IT change management, it is possible to
support is to download a device’s configuration file to a
identify a set of basic functional components that, grouped
configuration server, deploy a new configuration, and use the
together, form a general architecture. We introduce in such
previous one again if the new configuration turns the device
an architecture complementary elements to explicitly support
behavior unstable. An evolution of this solution can be seen
rollback in change plans. Figure 1 depicts the general IT
in devices where candidate configurations are stored inside
change management architecture highlighting the components
the managed devices themselves, dispensing with an exterior
required for the rollback support.
configuration server. Recently, the NETCONF protocol [7],
proposed by the Internet Engineering Task Force (IETF), has Change requester Operator
incorporated the notion of transactions in a configuration task,
which avoids the managed devices to evolve to states unknown
for the network operator.
Solely, rollback at the device level is not sufficient for com- Rollback
plex IT scenarios because often different devices and services Change Change Rollback support
designer planner planner generator
are dependent of one another. For example, if the installation
Change
of a new Web server that requires additional configuration deployer
of the border firewall fails, not only the server installation Config. Mgmt. SW Config. Definitive SW Rollback
itself needs to be undone, but also the configurations on Database Repository Library engine
the border firewall must be returned to the previous state. Deployment
Rollback at the network level (above the device level) is then Change management system system
required. In an earlier work, we have proposed a Policy-Based
Network Management (PBNM) system [8] where failures in CI CI CI CI CI CI CI CI
CI CI
the deployment of QoS policies return the managed devices to CI CI
Managed IT infrastructure
the previous state using an adapted version of the two-phase
commit protocol. Fig. 1. IT change management architecture
Andrzejak et al. [9] have investigated automatic workflow

generation that can adapt the underlying IT system in reaction The specification  of a new RFC begins when the change
to failures. However, this initial work does not present many requester describes, in a high-level document, his/her change
details about automatic correction of graphs in response to necessities. This is achieved by interacting with the change

348
designer, which is a tool that helps the change requester to performs changes over Configuration Items (CIs) of the IT
fulfill the RFC document in a clear and consistent way. It is environment. If any fail occur in the change process, the
important to remember that an RFC express what is required, rollback engine is invoked and executes the rollback procedure
but not how to achieve that. This is initially defined when for the failed action, following the marks provided in the
the operator, also interacting with the change designer, come rollback-enabled change plan.
up with a preliminary change plan. For example, consider Whether a change plan has been successfully deployed or
the statement “install a new Web-based project management a fail triggered a rollback procedure, the deployment system
system on server A”, which is typical of an RFC. Here, nothing must update CMDB at the end of the process. This ensures
defines whether a new Web server is needed or if an additional that the configuration database will provide an updated view of
database must be created prior to the installation of the Web the IT infrastructure, which is required by the change planner
application; this is, however, informed by the operator that, to compute future change plans.
consulting the Configuration Management Database (CMDB)
is able to check whether Web servers and required databases B. Marking Rollback-enabled Change Plans
are available. As mentioned before, the operator is responsible, in a
The output of the change designer is then a preliminary rollback-enabled IT change management system, to mark the
change plan that needs to be further complemented. The original change plan in order to complement it with rollback
change planner component is the one responsible for auto- information. In order to present how these marks are defined,
matically computing an actionable workflow that defines that we first need to observe how RFCs and change plans are
final change plan. The algorithm for such computation is out internally organized.
of the scope of this paper, but it has been already addressed A single RFC is composed of one or more operations. Each
by other authors [3] and is the subject of a complementary operation is an independent element that must be executed to
research of our group. The change planner, in its process of accomplish the change requested in the RFC. Since operations
complementing the preliminary change plan, needs to consult are independent from each other, two different operations of
two databases: the Configuration Management Database and the same RFC can be executed in parallel. Internally to each
the Software Configuration Repository. CMDB stores updated operation, a single change plan is found, which means that
information about the managed IT infrastructure, enabling one change plan is associated to each operation. In fact, all
the change planner to discover which elements must be change plans from the same RFC could be merged into a single
manipulated in order to fulfill the original RFC. The Software one to optimize the deployment process, but for the sake of
Configuration Repository, in its turn, maintain information simplicity we assume in this paper that RFCs with more than
about software dependencies, consumed during the instal- one operation will present one change plan for each operation.
lation/uninstallation process. For example, CMDB lists the Finally, each change plan is composed of a set of activities that
softwares already available on server A, but it is the software are chained to form the final actionable workflow.
configuration repository that informs, during the installation In order to support rollback-enabled RFCs, we define that
process, that the Web-based management system requires a some elements (RFCs, operations, or activities) can be marked
pair of Web server and database to work properly. as atomic elements using atomicity marks. The simplest case
In a change system with no rollback support, the workflow is the one where the whole RFC is marked as atomic. It means
computed by the change planner would be ready to be submit- that, if any problem happens during the execution of any of
ted, upon the operator order, to the deployment system. It then its change plans, all activities must go backward, moving the
executes, using off-the-shelf solutions, the changes over the IT infrastructure to the state previous to the RFC deployment.
IT infrastructure (according to the change plan). One central If no mark is defined at all, any failure will abort all change
point of this paper is that, in this case, if any fault occurs plans in execution, but no rollback action will be performed,
during the change plan deployment, the system will possibly leaving to the system administrator the responsibility to lead
enter in an inconsistent state, because no reactions to faults back the infrastructure to a consistent state.
are usually defined. In our approach, we address this issue Limiting the atomicity marks at the RFC level may be too
by complementing the change plan with rollback information. restrictive in some scenarios. Consider, for example, that one
That is accomplished through the rollback planner component. does not want to uninstall a Web server in the event of a failure
As can be seen in Figure 1, it takes as input an actionable in the process of installing an associated database. We thus
workflow as well as additional marks informed by the operator. further define that besides marking an RFC as atomic, one can
These additional marks allow the rollback planner to create an mark each operation of an RFC as individually atomic as well.
enhanced version of the original change plan, that now include In this case, only the change plan of a failing operation rolls
marks to support rollback actions if fails occur. back, leaving the change plans of other operations untouched.
Internally, the deployment system is composed by compo- Finally, atomicity can be defined at the activity level. Here,
nents: rollback support generator, change deployer and roll- if an activity defined as atomic fails, it will only rollback
back engine. The rollback-enabled change plan is submitted to itself, but the next subsequent activities will be executed. If
the rollback support generator that creates internal structures to a failing activity is not atomic, no rollback is executed, but
support eventual rollback actions, while the change deployer the associated change plan is aborted at all. An additional

349
mechanism is incorporated at the activity level: we define that aborted, skipping the execution of activity 6. Finally, Figure
a group of activities that are closely related to each other may 2-d show, therefore, the case where the whole change plan is
form an atomic group. Once any activity of such a group fails, made atomic. The same effect can be achieved by defining
all other activities in the same group must rollback. Using that the operation that generated this change plan is atomic, in
atomic groups, one can define an atomic operation in two this case without requiring the definition of an atomic group
different ways: (a) by marking, at the operation level, the of all operation’s activities.
operation as atomic, or (b) by grouping all activities of an
operation in a single atomic group. The option (a) is obviously C. Rollback Model
easier to choose, but the fact that (a) and (b) are a possible In order to enable change plans to include rollback support,
indicative that, in fact, an atomic operation is a particular case it is necessary to model change plan information with rollback
of an atomic group composed of all activities from a single in mind. As mentioned before, a change plan consists in a
change plan. workflow of activities that when executed lead the managed
In order to exemplify the use of atomicity marks, Figure 2 infrastructure to a new state. Therefore, for our solution, a
depicts how atomicity is defined at the aforementioned three model for change plan information must thus express action-
levels, i.e., at RFC, operation, and activity levels. able workflows that include rollback support. We design our
solution thus defining a Requests for Change and Change
(a) (b)
RFC RFC Plan Model. Our model is strongly based on the change
management guidelines presented in the ITIL Service Support
OP1 OP2 OP3 OP1 OP2 OP3 book [1], and on the approach to specify workflows defined by
the Workflow Management Coalition (WfMC) [10]. It is not
(c) (d)
our goal here to stress what information is required to fully
express workflows. Actually, that would be a whole research
1 1
in itself. Rather, we are interested in modeling the information
2 2
required for rollback support in a change management system.
Considering these assumptions, Figure 3 presents a partial
view of the defined model, highlighting the elements required
3 4 5 3 4 5 for the rollback support, while omitting others that are not
related to rollback.
6 6
RFC Operation

- name: String - name: String


- reason: String ϭ Ύ - priority: int ϭ ϭ


- priority: int - type: int ChangePlan
- status: int operation -. . . changePlan ϭ
- type : String ϭ
Fig. 2. Examples of atomicity marks and atomic groups allActivitySet
 
-. . .
Ύ
allActivity ϭ ϭ
ActivitySet
Figure 2-a shows an RFC composed of three operations: ϭ
OP1, OP2, and OP3. In this case, the whole RFC is marked Ύ
allTransition
Ύ
as atomic, implying that any problem on any operation will AtomicRFC AtomicOperation Activity
Ύ from Ύ
Transition
Information
revert all actions executed. Figure 2-b presents the same RFC Ύ to Ύ

and operations, but atomicity is defined differently now. If OP1


fails its internal actions will be undone, but it will not affect activitySet
BlockActivity
other operations of the RFC. The same happens with OP2. Atomic Activity
SubProcess
Specification
Block Activity
ϭ
Since OP3 is not marked, any fail in its actions will abort ϭ

the associated change plan without executing any rollback changePlan


SubProcess CPM: Change Plan Model
ActivityAtomicGroup
procedure. Figure 2-c presents marks at the activity level. In RFCM: Request for Change Model
- groupName : String RM: Rollback Model
this case, a change plan composed of activities numbered from
1 to 6 has two atomic groups: one formed by activities 1 and
2, and another formed by activities 4 and 5. Note that, in the Fig. 3. RFC and change plan model with rollback support
first group, the activities are executed sequentially. In this case,
if activity 2 fails, activity 2 itself and activity 1 are reversed. An RFC is composed of Operations that, in turn,
After that, the workflow continues evolving to the execution, are composed of one or more ChangePlans. RFC and
in parallel, of activities 3, 4, and 5. In the second atomic group, Operation hold more abstract information of a change
if activity 4 or 5 fail, both will be undone. In this case that request, and thus form the Requests for Change part of our
activity 3 is not affected at all. Once activity 3 finishes, and model. An AtomicRFC is an specialized RFC whose final
activities 4 and 5 rollback or complete successfully, activity 6 actions must be treated as a single transaction by the deploy-
will be ready to be executed. In the event of a failure in activity ment system, i.e., one marks an RFC as atomic by using the
3, it is important to emphasize that the whole change plan is AtomicRFC class. In the same way, an AtomicOperation


 

350
is an Operation whose associated actions will rollback in In this context, our implementation is based on the following
the event of a problem throughout their deployment. Web services solutions:
Each operation in an RFC has a change plan composed • At the final Configuration Items (CIs) side, we assume
of ActivitySets, which are groups of one or more that target elements that needs to be managed (e.g., hosts,
activities intended to implement a change plan. An servers, clusters, storage, etc.) implement a management
Activity can be either a low-level, non refinable interface as a Web service. It can, for example, be
activity (LeafActivity), which is the lowest level of materialized following the Configuration Description, De-
granularity that an action can represent, or grouped activities ployment, and Lifecycle Management (CDDLM) specifi-
(SubProcessDefinition and BlockActivity), cation [11]. Associated with the Web service management
which are activities composed of another activity set or by a interface, we assume that a Web Service Description
new change plan. The TransitionInformation class Language (WSDL) [12] document is also available, de-
models how activities are chained in the final workflow. The scribing the management interface itself;
classes ChangePlan, ActivitySet, Activity, • In order to deploy the required changes over an IT in-
LeafActivity, TransitionInformation, frastructure, actionable workflows are described in BPEL
SubProcessDefinition, and BlockActivity documents that can be read and executed by a BPEL
form the Change Plan part of our model. engine, such as ActiveBPEL [13]. The BPEL engine oper-
In order to define an atomic activity, one should mark ates as the deployment system of the previously presented
it using the AtomicActivity class. Since several atomic architecture. The communications with the BPEL engine
activities can be grouped together in atomic groups, each are accomplished using Web services as well;
atomic activity needs to provide the identification, in an string, • The change management system is implemented as a sim-
of the atomic group it belongs to. If no name is provided, the ple Web application accessed by both change requester
activity belongs to a group formed solely by itself. Notice and operator, and works as the Web service client of the
that it is not possible for a single activity to be part of more deployment system.
than one atomic group. If that was possible, the activity would
work as a mechanism to merge the rollback behavior of those A. BPEL Constructions to Support Rollback
groups. That is so because if one atomic group rolls back, the
common activity of both groups should rollback too, leading While executing a change plan, the deployment system
the activities of the second atomic group to rollback as well. must keep track of the organization in which actions will be
performed. To orchestrate the system such as expressed in a
D. Translating Marked Change Plans to Actionable Workflows given workflow, four BPEL constructs were used: sequence,
with Rollback flow, if, and links, which is a basic BPEL construction
to create links between activities.
In order to produce actionable workflows to deploy an RFC
The BPEL invoke activity was used to represent a work-
with rollback support, it is necessary an automatic mechanism
flow activity, allowing to call a remote Web service to perform
to generate such rollback plans. For each marked element
the given task. It can thus be used to perform remote operations
in the same atomic group, the rollback plan is automatically
using two-way (request-response) or one-way messages. In
generated by following two general steps: (1) reversing the
an one-way communication, the invoker sends a message
entire change plan, and (2) not including those activities that
and does not wait for any response. In the request-response
the marked element depends on. For example, if an activity A
communication style, a message is sent by the BPEL engine
is marked to participate in an atomic group AG1, the first step
and the processing of the workflow remainder is blocked
to generate the rollback plan is reverse the change plan. After
until a response arrives. In this case, the communication
that, it is possible to identify which activities that activity A
is synchronous and timeout-related issues must be taken
depends on, and then not include them in the rollback plan.
into account. Our prototype supports both synchronous and
Note that this proposed method reproduces, in terms, the
asynchronous communication combining sequence, flow,
idea of a common stack: in a normal execution, activities
if and links to guide the execution of workflows for an
are pushed into the stack. Otherwise, if any fail happens, the
effective change plan execution.
system will pop each element in order to perform the rollback.
Finally, in order to detect configuration problems and thus
trigger rollback actions, additional BPEL constructions have
IV. P ROTOTYPE I MPLEMENTATION
been employed. For example, an invoke activity can include
In order to prove concept, we have developed a prototype fault handlers to deal with errors associated with the invoked
system that implements our rollback proposed solution. Our service. In this case, the invoke activity must be added to
implementation is based on Web services technologies and a scope and errors can be caught by a catch all activity
standards, mainly due to (1) the interprocess communication at execution time, deviating the normal execution flow to a
over the Internet and (2) Web services composition such as different flow which handles the failure. As we assume in the
the Business Process Execution Language (BPEL), used to solution that rollback actions do not fail, no additional fault
coordinate distributed actions over an IT infrastructure. handler constructions are attached to a catch all activity.

351
B. Deployment System terms of rollback latency. Optimized orders of activities in
The deployment system is implemented in Java and orga- rollback procedures is subject of future work.
nized internally in three blocks already introduced in Figure 1: After computing the order of activities, the rollback itself is
the rollback support generator, the change deployer, and the executed by invoking reversing operations at remote endpoints
rollback engine. The complete diagram for the deployment to undo the previous activities. For example, if an original
system is presented in Figure 4. activity was an install instruction, the reverse of it will
be an uninstall instruction. Besides it, assuming that the
configurable items interfaces are implemented in Web services,
Rollback support generator we also assume that for each remote action there will be
Change deployer Rollback engine
Import another action able to undo the first one. If that not happens,
WSDL Build depl. Order undo however, the rollback procedure itself may lead the managed
Add rollback file actions
support system to another unpredicted state. It is thus crucial that
Validate
WSDL Execute Rollback
not only the rollback support works properly, but that the
Create depl. change plan endpoints present reversible actions as well. These reversible
descriptor
Convert to actions are align to Recovery-Oriented Computing (ROC),
BPEL where the undo/redo is one proposed recovery technique that
Update is able to cure a high percentage of failures [14].
CMDB
After a rollback, the execution flow may return to the orig-
Deployment system inal change plan, or then evolve directly to its end, depending
on how the atomic activities and groups have been defined
Fig. 4. Deployment system by the system operator. The last action, as already mentioned,
is to update CMDB (configuration management database) to
First, the rollback support generator receives a marked reflect the new state of the managed IT infrastructure.
change plan, and after reading the internal information, im-
ports the set of WSDL files from all the endpoints that will be V. C ASE STUDY & A NALYSIS
affected by the change plan. The WSDL files are then validated To prove concept and technical feasibility of our proposal,
to guarantee that all required resources and operations are we have conducted an experimental deployment considering a
available in the managed elements. In this step, a verification real-life scenario. In subsection V-A we provide a detailed
is also done in order to determine what kind of communi- view of the studied scenario, while in subsection V-B we
cation will be used to perform each activity (synchronous discuss the results achieved.
or asynchronous). Next, the workflow conversion is used to
convert the original marked workflow into a BPEL workflow. A. Case Scenario
In our implementation, the change plan file is already a Our case study is based on a company that provides services
BPEL document that contains no rollback support but only to its customer using the Internet. In order to couple with the
atomicity marks. In order to transform these marks in BPEL heavy demands of the provided services, the company employs
rollback structures, the add rollback support component is a high-performance cluster composed of 10 nodes. Each node
issued. Finally, with all information ready to be delivered for is equipped with a Dual Core Xeon processor and 2GB of
execution, the create deployment descriptor component create RAM memory. In addition to the cluster, an authentication
a complementary file called Process Deployment Desciptor server responsible for the customer’s authentication is present
(PDD) required by the ActiveBPEL engine (which we also as well. Finally, an HP 9000 server, configured using optimal
used in our implementation) to execute a whole actionable performance options, hosts a MySQL database used to persist
workflow. The complete set of files is then forwarded to the the information manipulated by the provided services. This
change deployer. IT environment is responsible for handling user requests
The change deployer expands the BPEL engine functionality considering appropriated performance thresholds, such as the
by adapting its resources to the deployment system intention. Emergency Load Threshold (ETL). ETL is monthly calculated
Primary, it builds a file packing all previously generated files and measured taking into account the average number of
(WSDL, BPEL and PDD). Finally, ActiveBPEL is called to customer’s accesses, and typically varies when the company
execute the change plan. In fact, ActiveBPEL is responsible releases a new service. Once ELT is exceeded, the availability
for executing the change plan and handling the faults. Once of the provided services may be severely compromised.
a fail is detected, the normal flow of the change plan is Once the company intends to release a new service, an
intercepted and the rollback engine is called. The first step increase in the access load is expected. That is calculated
is to order the activities that must be undone to accomplish through a poll estimative. In order to support this load increase
an atomic behavior. In the current version of our prototype, and maintain the system health, the company decides to
this order is computed just following the reverse order of upgrade its IT infrastructure to have ELT in 55%. Let’s assume
atomic activities executed so far. Possibly different order to that, according to the poll, increasing ELT to 45% would be
rollback the system may present, improving performance in sufficient to a new service release, but the company wants to

352



have a comfort edge with additional 10%. Considering this IT manager marks this operation as an atomic operation. The
d
upgrade strategy, the previous decisions are materialized in the second operation, however, is not marked; in fact, the change
 IT infrastructure accordingly to the RFC document presented manager delegates that to the operator that must define, among
 in Figure 5. the internal operation’s activities, those that must be grouped
in atomic groups.
 Z&
ŶĂŵĞ͗/ŵƉƌŽǀĞƚŚĞ>dĂŶĚŽŵƉĂŶLJ͛Ɛ The change plan for the second operation is composed of the
ZĞƐŽƵƌĐĞƐ
ĚĞƐĐƌŝƉƚŝŽŶ͗ZĞŶĞǁƐŽĨƚǁĂƌĞĂŶĚ installation of a new machine with better hardware resources,
ŚĂƌĚǁĂƌĞĨŽƌĂďĞƚƚĞƌƉĞƌĨŽƌŵĂŶĐĞ͘͘͘
ƉƌŝŽƌŝƚLJ͗,ŝŐŚ
and performance improvements on each old machine by in-
stallation additional memory. The first action of the change
ŚĂƐŽƉĞƌĂƚŝŽŶ
plan is to physically install and configure the hardware for
ƚŽŵŝĐKƉĞƌĂƚŝŽŶ KƉĞƌĂƚŝŽŶ
the new access cluster, including a RAM installation for each
 
ŶĂŵĞ͗hƉĚĂƚĞĂƚĂďĂƐĞ^ĞƌǀŝĐĞ ŶĂŵĞ͗ĐĐĞƐƐĂƉĂďŝůŝƚLJhƉŐƌĂĚĞ old machine. This action is performed by humans, and we
ĚĞƐĐƌŝƉƚŝŽŶ͗hƉĚĂƚĞƚŚĞĚĂƚĂďĂƐĞ ĚĞƐĐƌŝƉƚŝŽŶ͗/ŶƐƚĂůůͬŽŶĨŝŐƵƌĞŶĞǁ
ǀĞƌƐŝŽŶ͙ ĐŽŵƉŽŶĞŶƚƐ͙
assume, in this example, that no fail occurs. As the change
ƉƌŝŽƌŝƚLJ͗,ŝŐŚ ƉƌŝŽƌŝƚLJ͗,ŝŐŚ plan evolves, it executes the software installation/configuration
ƚĂƌŐĞƚ͗ĂƚĂďĂƐĞ^ĞƌǀĞƌ ƚĂƌŐĞƚ͗ĐĐĞƐƐůƵƐƚĞƌ
in parallel. The technical conditions are an important point to
ŚĂƐĐŚĂŶŐĞƉůĂŶ ŚĂƐĐŚĂŶŐĞƉůĂŶ define which activities will participate in the atomic groups,
which are set in our case study in a way to prevent the system
of being unavailable.

Considering this scenario, the operator evidenced that to
guarantee the access cluster availability and the new user

Fig. 5. Example of an RFC access demands, the change plan must consider the following
condition: the activity related to the new machine must succeed
In this RFC, two operations are defined: a) a database (install and configure software), and the activities related to
service update, and b) an access capability upgrade. The the old machines (i.e., a configuration in performance options)
first operation intends to improve the database reliability by may either succeed or fail. The result of the performance op-
updating MySQL to the Enterprise Edition (EE). The second tions configuration are not critical, because even with a failure,
operation installs a new machine in the access cluster, and tune the access cluster will remain available – whether or not a
the configuration of older machines in order to increase their rollback action is performed. In this case, the performance
performance. After being processed by the change planner, configuration will lead to an increase of 10% in ELT, which
the operations generates the change plans depicted in Figure is not sufficient to the new services release, but it does not
6. affect the current company’s services. Therefore, to justify the
operator evidences, some conditions are described as follows:
hƉĚĂƚĞĂƚĂďĂƐĞ^ĞƌǀŝĐĞKƉĞƌĂƚŝŽŶ ĐĐĞƐƐĂƉĂďŝůŝƚLJhƉŐƌĂĚĞKƉĞƌĂƚŝŽŶ

• If one activity related to an old machine fails, the system


ĐƚŝǀŝƚLJ
must rollback all other old machines to the previous
ŝĚ͗ϭ
ŶĂŵĞ͗^ƚŽƉƐĞƌǀŝĐĞ
consistent and working configuration. In this case, one
ƚĂƌŐĞƚ͗ĂƚĂďĂƐĞ^ĞƌǀĞƌ
machine configured differently from others may turn
ŝĚ͗ϭ
ĐƚŝǀŝƚLJ
customer’s services unavailable. Performing the rollback
ĐƚŝǀŝƚLJ
ŝĚ͗Ϯ
ŶĂŵĞ͗ĂĐŬƵƉĂƚĂ
ŶĂŵĞ͗ŽŶŶĞĐƚͬĐŽŶĨŝŐƵƌĞ
ŶĞǁŵĂĐŚŝŶĞĂŶĚZD
just for old machines, in case of one fail, guarantees
ƚĂƌŐĞƚ͗ĂƚĂďĂƐĞ^ĞƌǀĞƌ ƚĂƌŐĞƚ͗,WϵϬϬϬƌƉϳϰϳϬĂŶĚ
ƵĂůŽƌĞyĞŽŶŵĂĐŚŝŶĞƐ at least the cluster functionality with a better memory
specification;
ĐƚŝǀŝƚLJ
ŝĚ͗Ϯ
ŶĂŵĞ͗/ŶƐƚĂůůDLJ^Y>
• If the new machine installation/configuration fails, the
ƚĂƌŐĞƚ͗ĂƚĂďĂƐĞ^ĞƌǀĞƌ

ĐƚŝǀŝƚLJƚŽŵŝĐ'ƌŽƵƉ ĐƚŝǀŝƚLJƚŽŵŝĐ'ƌŽƵƉ ĐƚŝǀŝƚLJƚŽŵŝĐ'ƌŽƵƉ system must identify and rollback its actions. In this case,
ŝĚ͗ϯ ŝĚ͗ϭϮ
ŝĚ͗Ϯ
ŶĂŵĞ͗/ŶƐƚĂůůĂŶĚ
ŶĂŵĞ͗ŽŶĨŝŐƵƌĞ
ƉĞƌĨŽƌŵĂŶĐĞŽƉƚŝŽŶƐ
ŶĂŵĞ͗ŽŶĨŝŐƵƌĞ
ƉĞƌĨŽƌŵĂŶĐĞŽƉƚŝŽŶƐ
the change plan will not be interrupted and the access
ĐƚŝǀŝƚLJ ĐŽŶĨŝŐƵƌĞƚŚĞƐŽĨƚǁĂƌĞ
ŝĚ͗Ϯ ƚĂƌŐĞƚ͗,WϵϬϬϬƌƉϳϰϳϬ
ƚĂƌŐĞƚ͗ƵĂůŽƌĞyĞŽŶ
ŐƌŽƵƉEĂŵĞ͗'Ϯ
ƚĂƌŐĞƚ͗ƵĂůŽƌĞyĞŽŶ
ŐƌŽƵƉEĂŵĞ͗'Ϯ
cluster will work as before. The success of this activity
ŶĂŵĞ͗ZĞƐƚŽƌĞďĂĐŬƵƉ ŐƌŽƵƉEĂŵĞ͗'ϭ
ƚĂƌŐĞƚ͗ĂƚĂďĂƐĞ^ĞƌǀĞƌ guarantees the company’s new service release, since the
new machine configuration can elevate ELT to 45%.
ĐƚŝǀŝƚLJ
ŝĚ͗Ϯ Otherwise, the support for customer access demands, that
ŶĂŵĞ͗^ƚĂƌƚƐĞƌǀŝĐĞ
ƚĂƌŐĞƚ͗ĂƚĂďĂƐĞ^ĞƌǀĞƌ is required by the new service release, cannot be assured.

For these reasons, expressing atomicity in the activity level


guarantees at least the access cluster availability. In this case,
the single activity related to the new machine are set as
Fig. 6. Parallel change plans participant of ”Atomic Group 1” (AG1), and all activities
related to old machines, in turn, are set as ”Atomic Group
Considering the first operation as a critical one, the change 2” (AG2).


353
B. Failure Cases & Analysis and activity levels. This can turn the system more specialized,
In the first operation performed by the change deployer, making possible not only the operator to express atomicity
which is a database service update (Figure 6), it is possible to in a low-level, but also the change manager to define which
suppose a failure in the MySQL Enterprise Edition installation. actions of a given RFC must rollback.
Therefore, the change deployer invoke the rollback engine, The obtained results demonstrated to be coherent in the
immediately executing the following actions: sense that it guarantees what the change manager or operator
expressed. The use of atomic groups showed that activities can
1) Request the database server to rollback the installation:
be involved as a single transaction, not affecting activities of
wipe out all installation files related to the MySQL
other atomic groups. Moreover, the case study presented that
Enterprise Edition, resulting in the previous untouched
the rollback solution help change managers/operators to define
MySQL server;
atomicity in a more consistent way.
2) Request the database server to undo the backup process;
Since our main objective was to provide the rollback support
3) Request the database server to start MySQL again.
in a change plan, turning possible to go back into a previous
Note that the backup data activity (Figure 6) does not affect consistent state by a rollback engine, we have not focused
the system for the rollback procedure, meaning that whether on expressing compensation activities and predict atomicity
undoing the backup data or keeping it will make no difference. restrictions. In a future work, we intend to improve the rollback
The second change plan operation, which has the goal of model to express and perform compensation activities, which
increasing the access cluster capability, has activities marked can be very useful to achieve the RFC goal even with failures
in atomic groups as presented in Figure 6. Supposing a in different levels (i.e., failures in rollback actions), and predict
situation where the configuration of the first old machine fails atomicity restrictions due to rollback planner misuse.
(activity with id number 3), the change deployer identify the
failure inside the BPEL constructs, and invoke the rollback R EFERENCES
engine, which then orders the atomic group to rollback by [1] ITIL, “Information Technology Infrastructure Library (ITIL),” Office
remotely invoking rollback operations in all 10 old machines. of Government Commerce (OGC), 2006. [Online]. Available:
http://www.itil.co.uk/
The rollback requests happen in parallel, since the change plan [2] IT Infrastructure Library, ITIL Service Support Version 2.3. Office of
activities were also originally expressed in parallel. Government Commerce, 2000.
[3] A. Keller, J. L. Hellerstein, J. L. Wolf, K.-L. Wu, and V. Krishnan, “The
Supposing that both the first old machine and the new champs system: Change management with planning and scheduling,”
machine installation/configuration fail, the change plan will in 9th IEEE/IFIP Network Operations and Management Symposium
not succeed al all and the final result will not be sufficient to (NOMS 2004), Seoul, Korea, April 2004, pp. 395–408.
[4] C. Bartolini, J. Sauv, and D. Trastour, “It service management driven by
allow the release of the new service. Despite that, the failing business objectives - an application to incident management,” in 11th
requested change does not lead to the unavailability of all IEEE/IFIP Network Operations and Management Symposium (NOMS
other services previously available because the access cluster 2006), Vancouver, Canada, April 2006, pp. 45–55.
[5] R. Rebouças, J. Sauv, A. Moura, C. Bartolini, and D. Trastour, “A
is still up. The same case occurs if the the new machine instal- decision support tool to optimize scheduling of it changes,” in 10th
lation/configuration fails, but the old machines performance IFIP/IEEE International Symposium on Integrated Network Manage-
configuration succeed: the access cluster is still up, but ELT ment (IM 2007), Munich, Germany, May 2007, pp. 343–352.
[6] OASIS Standards, “Business process execution language version 2.0,”
was not increased to release the new service. Finally, suppos- Apr. 2007, http://docs.oasis-open.org/wsbpel/2.0/.
ing that the new machine installation/configuration succeeds, [7] R. Enns, “NETCONF Configuration Protocol,” RFC 4147, Internet
the result of the performance options configuration at the old Engineering Task Force, Dec. 2006.
[8] R. S. Alves, L. Z. Granville, M. J. B. Almeida, and L. M. R. Tarouco,
machines will not affect the new service release, since it can “A Protocol for Atomic Deployment of Management Policies in QoS-
increase ELT at 45%. Enabled Networks,” in 6th IEEE International Workshop on IP Opera-
In terms of scalability, the prototype generated the BPEL tions and Management (IPOM 2006), ser. Lecture Notes in Computer
Science, G. Parr, D. Malone, and M. Foghl, Eds., vol. 4268. Springer,
document in less than 1 second for the described case. Oth- 2006, pp. 132–143.
erwise, the deployment execution time always depends on the [9] A. Andrzejak, U. Hermann, and A. Sahai, “FEEDBACKFLOW-An
activities actions and how the change plan was created. Adaptive Workflow Generator for Systems Management,” in 2nd In-
ternational Conference on Automatic Computing (ICAC 2006). IEEE
Computer Society, 2005, pp. 335–336.
VI. C ONCLUSION AND F UTURE W ORK [10] The Workflow Management Coalition Specification, “Workflow
In this paper we have discussed how organizations imple- Process Definition Interface - XML Process Definition Lan-
guage.” [Online]. Available: http://www.wfmc.org/standards/docs/TC-
ment their changes, and the importance of having a change 1025 10 xpdl 102502.pdf.
management system able to identify fails and rollback the [11] The GGF CDDLM working group, “Configuration Description,
managed system. Most of organizations have a complex Deployment, and Lifecycle Management.” [Online]. Available:
https://forge.gridforum.org/projects/cddlm-wg.
IT infrastructure where changes are deployed by humans, [12] W3C Note, “Web Services Description Language 1.1 (WSDL).”
increasing the failure probability. For this reason, we have [Online]. Available: http://www.w3.org/TR/wsdl
proposed a solution to express atomic activities in a change [13] Active Endpoints, “ActiveBPEL Open Source Engine,”
http://www.activebpel.org.
plan, informing the system which activities must rollback to a [14] G. Candea, A. B. Brown, A. Fox, and D. A. Patterson, “Recovery-
previous consistent state in a failure case. Also, in our solution, oriented computing: Building multitier dependability.” IEEE Computer,
there is three ways to express atomicity: in RFC, operation, vol. 37, no. 11, pp. 60–67, 2004.

354

View publication stats

You might also like