You are on page 1of 13

An Optumis White Paper

A Paradigm for Integrated IT Systems Management

Sanjay Raina

December, 2010
Contents Introduction
IT Systems Management tools and technologies
Introduction 2 continue to be crucial to efficient delivery of IT
Problem Statement 2 services. The IT landscape is evolving all the time
Current Practice 3 and there are increasing demands placed on the
Optumis Concerto 4
management of IT. Systems Management tools
have failed to keep pace with these
Implementation 8
developments and often fail to deliver the
Business Benefits 13 potential value as promised by the vendors. The
Summary 13 main reason behind this is that IT Systems
References 13 Management tends to be disjointed and silo
based. This white paper presents a holistic
approach to IT Systems and Service
Management. The approach advocates a
declarative, data-driven framework for specifying
management structures and policy. This
combined with the notion of abstraction of
management data results in a integrated
paradigm that allows stakeholders to make
effective decisions about the complex IT
infrastructure and applications in a coherent and
consistent way.

Problem Statement
Today businesses rely heavily on IT to keep the
revenue streams flowing and to run the day to
day back office functions. It is therefore more
critical than ever that the tools and technologies
that manage the IT infrastructure are effective in
delivering high levels of availability and
productivity whilst keeping the costs down.

The Systems Management tools in use today

have their origins in the days of distributed, client
server technology. Since then the IT landscape
has undergone considerable evolution to N-tier,
Service Oriented Application architectures and
more recently towards Cloud based utility
computing delivery models. However, IT Systems
Management tools and practices have not kept
pace and there have not been corresponding
improvements in Systems Management
Copyright ©Optumis Ltd.
technologies. While the management tools projects to solve particular problems that span
were adequate in the distributed computing multiple functional areas. Secondly, each team or
era of twenty years ago, today's management function is driven by a narrow goal
infrastructure is considerably more complex of delivering their specific piece without any
and more dynamic. investment of thought on how their piece
impacts other functions. Thirdly, the disjointed
The Systems Management tools marketplace approach prevents the fostering of higher level
is dominated by the Big Four vendors, each abstractions that can create value simply by
with an arsenal of products [1][2][3][4] aimed cross-pollination of management information
to cover the entire Enterprise Management from different sources. IT organizations spend a
space and purported to solve the challenging disproportionate amount of time just to keep
problems of end to end Systems things ticking over and fire fighting, leaving very
Management. In practice, however, these little resource to spend on adding value to their
tools fail to deliver the promised value and service offering.
have proved to be difficult and costly to
implement and integrate. Often, tools from Current Practice
the same vendor have poor integration In the past, vendors have responded with more
capabilities and the only thing they share with tightly integrated Framework based products,
each other is the brand name! such as the older generation of IBM Tivoli [5]
and HP OpenView [3]. These products use an
When it comes to deploying Enterprise underlying layer built out of technologies such as
Management tools, typically organizations CORBA to improve the integration capabilities.
implement these as silo functions [7] The problem with such an approach is the lack of
managed by several specialized teams. There flexibility and vendor lock-in that prevents the
may be separate monitoring teams for use of best of breed tools. Consequently, the
servers, applications and databases. Other Framework tools have nowadays lost favor and
teams might be responsible for managing most vendors now espouse a best of breed
desktops, network and security. Yet more approach.
teams may be responsible for service desk
and operations functions. That there are so More recently, organizations have sought to
many different teams performing IT make use of process frameworks such as ITIL [9]
management functions is not a problem in to improve efficiencies. Such frameworks provide
itself, but the fact that there is no coherent the ability to organize people, processes and
framework or mechanism to orchestrate and technology so that IT management is optimized.
coordinate the activities of these disjointed However, while some organizations have
functions creates a number of problems. benefited from the adoption of such practices,
Firstly, there is the problem of different the benefits have been limited because, in most
teams and tools using different proprietary cases it is seen as an overlay of a governance
databases and data formats to store framework and doesn't fundamentally alter the
management data, with the result that way various technical components of IT Systems
integration although not impossible is Management interact with each other. What is
extremely laborious. Often, organizations required is a fundamental rethink of the
have to embark upon costly integration
structure and organization of Enterprise result, an application developer can build
Management tools and techniques. increasingly rich applications without worrying
about the specifics and compatibility of
Optumis Concerto underlying hardware or software. Before the
The problems and challenges faced by IT advent of operating systems and compilers, life
Systems Management are no different from was tough for the application developer, as
those faced in the past by other domains in programming involved getting to grips with the
IT. The complexity of the resources to be complexities and workings of the underlying
managed and the tools that manage them is hardware and devices. Abstraction is also used in
no different from the complexity of the network communication, where successive layers
hardware and devices used in the platforms of network protocols provide a more application
for developing and running applications. centric view of communicating endpoints than
Likewise, the problem of interoperability offered by the underlying network components.
between Systems Management tools and Without networking protocols, developing
functions is no more severe than the problem internetworked applications would mean having
of communication in a distributed computing to know the details of all components of the
environment. underlying network between two communication
endpoints - an impossible task.
We have taken some of the most common
techniques in computing that are in Enterprise Management can benefit from
widespread use and applied them to abstraction in the same way that application
Enterprise Management. The intent is to development and network programming has. An
develop holistic solutions by taking a broader Enterprise Management Abstract Machine
view of the Enterprise Management problem (EMAM) allows higher level management
rather than point solutions for specific applications and tools to be built quickly, without
problems. the intimate knowledge of the underlying
domain specific tools. Fig. 1 below shows layered
application and network stacks along with a
Enterprise Management as an comparable Enterprise Management stack.
abstract machine
The principle of abstraction has long been
used in computing as well as other fields to
solve the problem of complexity. Operating
systems use abstraction to hide details of the
underlying hardware and provide a collection
of common interfaces and services that work
on a variety of hardware through the use of FIGURE 1. ENTERPRISE MANAGEMENT STACK COMPARED
device drivers. Compilers take abstraction WITH APPLICATION DEVELOPMENT AND NETWORK
further by using an intermediate target PROTOCOL STACKS
abstract machine (e.g. Java Virtual Machine).
An interpreter or assembler then converts the Like the counterpart layers in the application and
intermediate code to machine language. As a network stacks, each layer in the Enterprise
Management stack supports the use of
constructs that use appropriate abstraction service centric view of management data. The
mechanisms to ensure they are independent top most layer provides high level reports and
of the specific characteristics of the layers dashboard views as well as interfaces to data
below. marts, primarily for the consumption of business
analysts and senior management.
The lowest layer consists of element
managers and can be compared to the device
driver layer in application and network stacks.
Cooperating state machines
It constitutes agents that produce raw event Contemporary Systems Management tools are
data from instrumented applications and programmed using policies and rules to model
infrastructure components, as well as agents the various activities within a management
that perform command and control of function. Unlike data processing or determinate
managed resources. This layer abstracts the programming, Enterprise Management is
specifics of element managers into common essentially reactive in nature and is best
constructs that offer a tool-agnostic view of represented as an event-driven system. State
management data. For instance, one could machines are an ideal way to implement complex
specify that a disk be monitored in abstract event driven functions and are characterized by a
terms as: collection of states and state transitions that
occur in response to events. State machines are
monitor(DISK, PctSpaceUsed, 95, Critical, 90, typically represented by a state transition
Warning) diagram as shown in Fig. 2 below.

This construct is then translated by an

interpreter to tool specific instruction using
the API of the specific tool. Management
operations can now be specified in common
terms without worrying about the
representation in underlying tools. The tools
can now be replaced or augmented without
affecting the higher level applications – all
that is required is a change to the interpreter. FIGURE 2. A STATE TRANSITION DIAGRAM

The same principle is applied to the Due to the large number of states and state
subsequent layers shown above. The next transitions, representing a complex end to end
layer provides an aggregated view of the Enterprise Management system as a single state
management data to applications. At this machine is an impossible task. In the past,
layer, management data from multiple techniques such as State Charts [6] have been
element managers can be combined to developed to overcome the state explosion
provide a more analytical interpretation of problem. We have used the concept of
the data. Note that the context is still cooperating state machines. Each state machine
technology related. The next layer titled represents one Enterprise Management function
Business Service Management consists of or process and they link together to form an end
abstraction that provides a business and to end model. Fig. 3 below shows a chain of state
machines representing the detection, information and data link layer adds information
reporting and resolution of a fault. about the physical media. Each layer acts as a
provider of service that is consumed by the layer

Header Transport

TCP Data
Header Internet

Frame data Link

Data manipulation
Efficient manipulation of data is an important
aspect of any abstract or physical machine.
A similar approach can be applied to Enterprise
Operands are commonly used in an
Management data that is successively enriched
instruction set (of abstract or real machine)
by layers in the stack. Each consuming layer
and allows instructions to perform operations
enriches the data further before providing it to
efficiently. Management data tends to be
the next layer.
passed around quite frequently amongst
various components in a Systems
Management solution and it is essential that
the data is optimally formatted and
structured. This is addressed by employing
two other concepts widely used in general
computing: normalization and encapsulation. FIGURE 5. EVENT MANAGEMENT DATA BEING
Normalization refers to the transformation of
the structure of the management data into a Enrichment involves filling in the missing
canonical form. Without normalization, information into the normalized event format as
considerable effort has to be expended in described above. The information can be
interpreting the data emanating from various supplied by external information sources such as
management tools. the CMDB, an operational data store or the
Incident Management database.
Encapsulation is most prominently used by
the TCP/IP protocol suite to provide The standardization and enrichment of
abstraction of network protocols and management data provides a number of benefits:
services. As shown in Fig. 4 below, data • Management Systems tend to generate
packets are encapsulated with headers at large volumes of data, much of which is
each layer. The TCP layer adds a TCP/UDP noise. This data needs to be aggregated
header to identify the source and destination and correlated to pin-point the root
access point. The IP header adds routing cause. Standardization of management
data formats plays a crucial role in this languages are examples of the declarative
regard. The standardized format paradigm. Specialized configuration files can also
makes matching of events efficient. be considered declarative, and even though they
The rules for duplicate detection and are not programming languages, they do enable
suppression become simplified. computation based on what rather than how.
Detection and prevention of event Another example of a declarative language is
storms is also simplified. It is also Prolog where programs are specified as facts and
possible to apply more granular rules, rules in a knowledge base. An inference engine
e.g. one can put very specific alerts then attempts to find solutions based on the
from a particular resource or from a rules and facts.
whole datacenter into maintenance.
• Due to the added context information Enterprise Management tools are generally
available, it is easier to perform programmed in an imperative manner using
business impact management. proprietary rule bases and databases. When
Enrichment of management data implementing Enterprise Management solutions,
enables more accurate and automated a significant amount of effort is spent on
processing of events within a encoding the control flow, i.e. specifying how
management system. New, service particular tasks are to be accomplished. We
impacting events can be generated advocate a declarative approach where the
based on location or service emphasis is on what rather than the how. So,
information from the CMDB or rather than specifying how to monitor a disk in a
Incident information from the Incident particular tool, using tool specific data structures,
database. we can specify the monitoring parameters in an
• Management data often traverses a abstract form, as shown below.
number of boundaries when various
functions are performed. The <DiskThresh>
information conveyed by the data is <Hostname>Ferrari</Hostname>
often interpreted by a multitude of <Diskname>C:</Diskname>
systems and personnel. It helps a <PctUsedWarn>90</PctUsedWarn>
great deal if the information being <PctUsedCrit>95</PctUsedCrit>
passed around is consistently </DiskThresh>
A driver component then takes this abstract
notation and converts it into tool specific
Declarative, data-driven instructions. The declarative approach can be
programming applied right across the board. The fragment
Most programs are written in an imperative below shows how an alert matching a certain
paradigm where the developer instructs the criteria can be specified to be routed to a
computer how to get a certain task done. In resolver group.
declarative programming, on the other hand,
the developer simply states what is to be <IncidentProfile>
achieved, and leaves it up to the system to <hostname> Ferrari </hostname>
get the job done. XML and related markup <resource> DISK </resource>
<threshname> PctUsed understand and you don’t need specialists in the
</threshname> different tools to manage and maintain the
<threshop> LessThan </threshop> management data. The data can be managed by
<threshval> 95 </threshval> a wider section of the IT service delivery
<resolver_group>GTI_GB_WENG</tick organization rather than just the specialists.
<priority>P2</priority> Finally, there is the advantage that all data can
<scim>Server OS, EMEA Intel</scim> now be made available to personnel, based on
</IncidentProfile> their role, for configuration and reporting
purpose. A user can update monitoring,
Similarly, the fragment below shows how an maintenance windows, enrichment data, Incident
alert matching a certain criteria can be resolver group information, notifications
specified to be suppressed during a calendar and calling tree, all in one place.
maintenance window.
<MaintenanceMode> Although the Enterprise Management Abstract
<hostname> Ferrari </hostname> Machine covers a wide range of functions, its
<resource> DISK </resource> implementation is expected to be a veneer of
<threshname> PctUsed software that runs on top of existing tools and
</threshname> systems. We do not intend to reinvent well
<thre shop> LessThan </threshop> established functions of Systems Management
<threshval> 95 </threshval> and most of the heavy lifting is expected to be
<suppressstart> done by existing tools and systems. This section
3-Aug-2010 11:00:00 describes how the key aspects described in the
</suppressstart> previous section can be realized. Two scenarios
<suppressend> are outlined to demonstrate the use of the
27-Dec-2010 12:00:00 concepts discussed.
<suppressday>Sunday</suppressday> In common with other abstract (and indeed
<suppresshour>05:00-- physical) machines, the operation of EMAM is
11:00</suppresshour> characterized by:
</MaintenanceMode> • A workflow component that executes the
logic of the computation being
This has a major advantage in that the policies performed. This may take the form of a
and rules for management have to be program of instructions compiled and
specified just once. The underlying tools can then processed by a CPU or, in the case of
be replaced at any time without having to an operating system a sequence of
rewrite the policies and rules for the new processes being scheduled from a work
tool. Integration to various tools is done via queue. In the case of EMAM, program
SOAP/WSDL or tool specific APIs. execution takes the form of a sequence of
state machines.
Another advantage is that the management
data in declarative form is easier to
• Operands used by the instructions in a product [9]. The fragment below shows a XAML
program. These take the form of local representation of a state machine.
storage (registers or stack) in
conventional machines. In the EMAM <StateMachineWorkflowActivity
these operands are typically alert x:Class="EMAMWorkflow.Monitor"
data, incident records, change records Name="Monitor"
etc. InitialStateName="Idle"
• Reference data is used by the xmlns="
workflow component. This is usually 6/xaml/workflow"
general purpose storage in a xmlns:x="
conventional machine where the 006/xaml">
results of computation are stored. In <StateActivity x:Name="Idle">
the EMAM, the reference data is <EventDrivenActivity
typically operational configuration x:Name="CheckThreshold">
data stored in a <HandleExternalEventActivity
configuration/operational data store. x:Name="HandleCheckThreshold"
State-machine Workflows <CodeActivity
A program in the EMAM is expressed in the x:Name="DoCheckCode"
form of a workflow of state machines. The
control of the program propagates through ExecuteCode="DoCheckCode_ExecuteCode">
state machines, with one state machine </CodeActivity>
triggering another. The programming of the <SetStateActivity
EMAM takes place by specifying the sequence x:Name="SetChecking"
in which these state machines are triggered. TargetStateName="Checking">
The program is specified declaratively, in the </SetStateActivity>
form of a database table or an XML based </EventDrivenActivity>
markup. Table 1 below shows a sample </StateActivity>
workflow specification. …………
State Current Event Next Next State
Machine State Condition State Machine The above XAML code can be loaded directly into
Monitor Idle Check Checking Monitor
Monitor Checking Breach Alerted Monitor
Microsoft Workflow Foundation to appear as in
Monitor Checking NotBreached Idle Monitor Fig. 6 below.
Monitor Alerted DupDetected Duplicate Monitor
Monitor Alerted Not Dup Unique Normalized
Monitor Duplicate Drop Idle Monitor
Normalize " " " "
" " " " "

The above specification can be represented in

an XML markup such as XAML. XAML is used
by the Microsoft Workflow foundation

Scenario 1: Fault detection and

FIGURE 6. XAML BASED WORKFLOW IN MICROSOFT The following scenario describes a fault being
detected by the monitoring system, a problem
Operands ticket being cut and then the problem
Management data that is processed by state remediated following a change management
machines comes in various types. Examples procedure. Each step corresponds to a state
include: machine and describes the operation performed
• Alert along with the operand and the reference data
• Incident used.
• Change record
Table 2 below outlines the sequence of state
• Service request
machines that constitute this scenario, followed
• Provisioning
by the description of activities performed in each
As the state machine workflow progresses,
step. Note that the activities within each
the operands are transformed into more
workflow may be automated or manual.
specific and more relevant management
Reference data State Machine Operand Ref Data
Detect fault Alert Thresholds, Alert
In addition to the operands, the state history
machines also use more permanent, Normalize Alert Normalization map
Correlate Alert Alert history
reference data. Such data may include IT Enrichment Alert CMDB, OMDB
Service Management data in a CMDB or more Problem ticket Alert, Problem Alert Ticket mapping
transient data in another operational data Escalate Problem notification Escalation table
store. The operational management data Notification Problem Ticket, Notification data,
Page, Email, SMS Call tree
store contains the declarative data previously Change Request Change Ticket CMDB
mentioned and meta data that may be Update Monitoring
Close Change, Problem, Change
mapped on, directly or indirectly, to tool Problem Tickets Tickets
specific data as shown in Figure 7 below.
Detect fault Correlation and root cause analysis
A monitoring tool samples metrics from a In this step, alerts are consolidated and root
resource and compares against the thresholds cause determined. Existing alerts are searched to
table. An alert is generated if a threshold determine if this is a repeat symptom of an
condition is breached. Thresholds are underlying problem. Lookup tables are used to
specified in a database and may be made perform correlation. Fields defined previously are
available synchronously within the tool, for used to perform the match against the lookup
instance via configuration file or tables. Also, maintenance windows may be
programmatically through tool specific APIs. consulted at this stage using the same matching
The generated event follows a tool specific criteria and the alert dropped if the matched
format. The figure below shows fields from an resource is under maintenance.
event generated by an agent and those from
an SNMP trap, both depicting the same fault. Enrichment
In this step, additional fields are added to the
alert. These fields can be technology related to
assist operations or business related to add
business context. Table 4 below shows the
location and service fields enriched.



In this step, the event fields are normalized
into a canonical form. The idea is that no
matter what tool or method is used to detect
the fault, its representation is the same, in an
abstract form and not dependent on the
underlying tool. Table 3 below shows alert
data in normalized form
Create Problem Ticket
Based on the mapping table, and using the alert
data, a new problem ticket is created. The
problem ticket follows a standard form just like
the alert, to ensure consistency. Whether an
alert was generated automatically as above or
the ticket created manually by a Service Desk
operator, the representation should be the same.
The fields can now be used consistently to The problem ticket now forms the basis for
perform matching and analytics at various tracking the alert and is used when performing
levels. These fields serve as a key in matching escalation etc.
against the different types of management
Escalation is a core function of the Incident
Management process. The escalation function
can be performed on the problem ticket using The final step in this sequence is to mark the
escalation data in a table such as below. Change as implemented and close the Incident.
The workflow will automatically close or clear the
TABLE 5. ESCALATION TABLE alert in the monitoring tools.
Priority P1 P2 P3 P4
Level of High, Production Degraded Minimal
Impact Critical, severely operations impact
Scenario 2: VM server provisioning
Fatal impacted
2 hrs First
This scenario depicts another common situation
response of requesting and provisioning a virtual server. As
4 hrs Work First
around response with the previous scenario, the management
24 hrs Mgmt Work First solution consists of a series of state machines.
notification around response
48 hrs Mgmt Work First The state machine workflow is outlined in Table 6
notification around response below with the associated operand and reference
1 wk Resolution Mgmt notif Work
around data.
2 wks Resolution
3 wks Resolution
Release Resolution
State Machine Operand Ref Data
Notification Create Request Service Request
Based on a calling tree and calendar Check Service Service Request CMDB, Service
Catalog Catalog
information the problem ticket can generate Check capacity Service Request CMDB
notifications. Create Change Service Request, CMDB
Request Change Ticket
Provision VM Change Ticket CMDB
Create Change Ticket Close Change, Change Ticket,
Service Request Service Request
Once the right personnel have been notified
and the resolution identified, a change record Create request
is created to perform the change. In our A Service Request is created manually by a
example here, the change involves a change requester. As before, the request is turned into a
to the monitoring thresholds as it was standardized form so as to make its processing
deemed to be a spurious alert. The change easier. This state machine workflow routes the
request follows the change management request to individuals in the organization for
process, including appropriate reviews, action, alerts the manager as necessary when the
approvals and assignment of change current owner does not respond to the request,
implementers. and escalates or transfers the request to the next
level of support. At this stage only a few fields
Update monitoring threshold
such as request number and request owner are
Once the change has been approved and
populated in the service request.
implementer notified, the monitoring
threshold is updated in the database. Note Supplement information from Service
that no change has been made to the Catalog
monitoring tool or any rule sets, and such a This step looks up the Service Catalog to fill in the
change can be performed by a non-specialist details about the servers. This step is comparable
since it is a simple data change. to the Enrichment step in the previous scenario.
Additional attributes include response deadline,
Close change, incident and alert server asset data etc.
Check capacity
Once the service request is sufficiently The approach described in this white paper is
qualified, the next step checks there is based on ideas and principles widely used in
adequate capacity on the physical general computing to overcome the problem of
infrastructure. Checks are performed to complexity and inter-operability. The approach
determine CPU, Memory and Storage capacity results in a more holistic solution to the problem
and appropriate personnel notified, if of Enterprise Management. A concept of
necessary. Enterprise Management Abstract Machine is
presented that utilizes state machine workflows
Create and manage Change Ticket and declarative, data-driven programming to
Once the right personnel have been notified decouple management procedures and data
and the checks performed, a Change Ticket is from the underlying tools. Such an approach
created to perform the change. results in a federated management model that
enables optimal use of people, processes and
Provision VM technology. Management applications and
This is essentially a manual step, in which the processes can be implemented quickly and
implementer creates the Virtual Machine. efficiently, without getting bogged down by the
mechanics of the tools.
Close change and service request
The final step in this sequence is to mark the References
Change as implemented and close the [1] BMC Patrol,
corresponding Service Request. listing/ProactiveNet-Performance-Management.html
[2] CA,
[3] HP OpenView Operations,
Business Benefits
The integrated approach to Enterprise [4] IBM Tivoli,
Systems Management provides a number of [5] IBM, Tivoli Management Framework, http://www-
key related benefits to the business. [6] D.Harel. Statecharts: a visual formalism for complex systems.
• The solution enables optimal use of Science of Computer Programming 8:231-274. North-
Holland 1987.
technology and human resources to [7] Macehiter Ward-Dutton, The New Face of IT Service
deliver significant cost reduction in Management, 2007.
[8] Microsoft System Center Operations Manager,
managing IT systems.
• Standardisation and systematic reuse manager.aspx
[9] Microsoft Windows Workflow Foundation,
of processes and procedures leads to
increased automation and efficient us/library/ms735921(VS.90).aspx
[10] Office of Government Commerce: Best Management
practice. Practice, IT Service Management,
• The solution significantly improves
productivity, allowing support staff to
improve service delivery and add
value rather than constantly fire