You are on page 1of 214

Introduction

Introduction to Realtime Publishers


by Don Jones, Series Editor

For several years now, Realtime has produced dozens and dozens of high-quality books
that just happen to be delivered in electronic format—at no cost to you, the reader. We’ve
made this unique publishing model work through the generous support and cooperation of
our sponsors, who agree to bear each book’s production expenses for the benefit of our
readers.

Although we’ve always offered our publications to you for free, don’t think for a moment
that quality is anything less than our top priority. My job is to make sure that our books are
as good as—and in most cases better than—any book that would cost you $40 or more.

I want to point out that our books are by no means paid advertisements or white papers.
We’re an independent publishing company, and an important aspect of my job is to make
sure that our authors are free to voice their expertise and opinions without reservation or
restriction. We maintain complete editorial control of our publications, and I’m proud that
we’ve produced so many quality books over the past years.

I want to extend an invitation to visit us at http://nexus.realtimepublishers.com, especially


if you’ve received this publication from a friend or colleague. We have a wide variety of
additional books on a range of topics, and you’re sure to find something that’s of interest to
you—and it won’t cost you a thing. We hope you’ll continue to come to Realtime for your
educational needs far into the future.

Until then, enjoy.

Don Jones

i
Table of Contents

Introduction to Realtime Publishers .............................................................................................................. i

Chapter 1: What is Application Performance Management? ............................................................... 1

More than Just On vs. Off ............................................................................................................................... 2

What Defines Functionality? ........................................................................................................................ 3

What Is Application Performance Management? ................................................................................ 4

The Goal of this Guide ................................................................................................................................ 5

APM by Example ............................................................................................................................................... 7

Why Measure Performance? ........................................................................................................................ 9

Measuring Across Domains ..................................................................................................................... 9

Measuring Transactions ........................................................................................................................ 10

APM Optimizes the Application Life Cycle .......................................................................................... 12

So, How Does APM Work?.......................................................................................................................... 14

The Role of Visualizations .......................................................................................................................... 17

What Benefits Does APM Provide? ......................................................................................................... 19

Root Cause Identification ...................................................................................................................... 19

Characterization of Problems .............................................................................................................. 19

Prioritization of Problem Resolution ................................................................................................ 19

Situational Awareness ............................................................................................................................ 20

Better Planning.......................................................................................................................................... 20

Critical Applications Require Critical Monitoring ............................................................................ 20


Chapter 2: How APM Aligns IT with the Business................................................................................. 21

IT and the Business Have Different Goals ............................................................................................ 22

Different Responsibilities ...................................................................................................................... 24


Mismatched Priorities and Metrics ................................................................................................... 24

No Common Vocabulary ........................................................................................................................ 25

Technologists Rarely See Budgets ..................................................................................................... 27

Reactive IT Is a Drain on Agility.......................................................................................................... 28

ii
Table of Contents

Understanding IT Maturity ........................................................................................................................ 29

Survival ......................................................................................................................................................... 31

Awareness ................................................................................................................................................... 32

Committed ................................................................................................................................................... 32

Proactive ...................................................................................................................................................... 33

Service-Aligned ......................................................................................................................................... 33

Business-Partnership.............................................................................................................................. 34

Why Is Maturity Important? ..................................................................................................................... 34

APM Requires Maturity. APM Creates Maturity. .......................................................................... 36

IT Changes with Each Stage .................................................................................................................. 36

IT’s Tools Grow More Predictive ........................................................................................................ 37

IT/Business Alignment Benefits Everyone ......................................................................................... 38

…But Isn’t APM Really About the Technology?.................................................................................. 39

Chapter 3: Understanding APM Monitoring ............................................................................................ 41

The Evolution of Systems Monitoring ................................................................................................... 43

Early Network Management ..................................................................................................................... 43

Simple Availability with ICMP ............................................................................................................. 44

Richer Information with SNMP ........................................................................................................... 44

Device Details with the Agent-Based Approach ................................................................................ 46

Situational Awareness with the Agentless Approach ..................................................................... 48


The Impact of Externalities................................................................................................................... 48

Direct and Indirect Monitoring ........................................................................................................... 51

Transaction-Based Monitoring ................................................................................................................ 52


Transaction Monitoring in Action ...................................................................................................... 54

Application Runtime Analysis .................................................................................................................. 56

End User Experience .................................................................................................................................... 57

EUE on the Enterprise WAN ................................................................................................................. 58

iii
Table of Contents

APM = ∑ the History of Monitoring ........................................................................................................ 59

Chapter 4: Integrating APM into Your Infrastructure ......................................................................... 60

Implementing APM Isn’t Trivial, Nor Is Its Resulting Data ........................................................... 61

Finding Meaning in Charts and Graphs ............................................................................................ 62

The Tiering of Business Applications .................................................................................................... 64

Business Applications and Monitoring Integrations ....................................................................... 67

Installing System Agents ............................................................................................................................ 69

Augmenting Agents with Application Analytics ................................................................................ 70

Configuring Devices for Network Analytics ........................................................................................ 72

Installing Probes ....................................................................................................................................... 74

Measuring Transactions ............................................................................................................................. 75

Overall Service Quality ................................................................................................................................ 78

APM’s “Magic” Is in Its Metrics ................................................................................................................. 80

Chapter 5: Understanding the End User’s Perspective ....................................................................... 81

Why the End User’s Perspective?............................................................................................................ 82

What Is Perspective? ............................................................................................................................... 82

Why the End User? ................................................................................................................................... 83

The Use Cases for EUE ................................................................................................................................. 84

Customer and Multi-Site Perspective ............................................................................................... 85

The Impacts of Geography ............................................................................................................... 87


Internal & External Robot Perspective ............................................................................................ 87

Internal User Perspective ...................................................................................................................... 88

Service Provider Perspective ............................................................................................................... 89


The Role of Transactions in EUE ............................................................................................................. 91

The C-N-S Spread........................................................................................................................................... 94

The C-N-S Spread Illuminates Environment Behaviors ............................................................ 95

A Use Case of the C-N-S Spread as Troubleshooting Tool ............................................................. 96

iv
Table of Contents

The Impact of Users Themselves ............................................................................................................ 98

Leveraging EUE for Improved Application Quality .......................................................................... 99

Chapter 6: APM’s Service-Centric Monitoring Approach ................................................................. 101

What Is the Service-Centric Monitoring Approach ........................................................................ 102

But What Is Service “Quality?” ............................................................................................................... 104

Understanding the Service Model ......................................................................................................... 105

Component and Service Health ......................................................................................................... 107

The Service Model .................................................................................................................................. 108

Service Quality ......................................................................................................................................... 111

Creating Your Service Model and Implementing APM ................................................................. 113

Step 1: Selection ...................................................................................................................................... 113

Step 2: Definition .................................................................................................................................... 114

Step 3: Modeling ..................................................................................................................................... 115

Step 4: Measurement ............................................................................................................................ 115

Step 5: Data Analysis ............................................................................................................................. 116

Step 6: Improvement............................................................................................................................. 116

Step 7: Reporting .................................................................................................................................... 117

Using Your Service Model ........................................................................................................................ 117

Tuning Monitor States .......................................................................................................................... 118

Eliminating Rapid State Change........................................................................................................ 119


The Service Calendar............................................................................................................................. 119

The Service-Centric Approach Quantifies Quality .......................................................................... 120

Chapter 7: Developing and Building APM Visualizations ................................................................. 121


Visualizations Are the Core of APM ...................................................................................................... 123

Useful Visualizations for Every Data Consumer .............................................................................. 124

Service Desk Employees ...................................................................................................................... 124

Administrators ........................................................................................................................................ 128

v
Table of Contents

IT Management........................................................................................................................................ 129

Business Executives............................................................................................................................... 131

Code Developers ..................................................................................................................................... 134

End Users ................................................................................................................................................... 138

APM Visualizations Bring Quantitative Analysis to Operations................................................ 140

Chapter 8: Seeing APM in Action................................................................................................................ 141

APM Helps Avoid the “War Room” ....................................................................................................... 142

Awareness ................................................................................................................................................. 142

Assessment ............................................................................................................................................... 142

Assignment ............................................................................................................................................... 142

Handoff ....................................................................................................................................................... 143

Transaction-Level Triage .................................................................................................................... 143

Infrastructure-Level Triage ................................................................................................................ 143

Characterization...................................................................................................................................... 143

Handoff ....................................................................................................................................................... 144

Solution ...................................................................................................................................................... 144

APM Streamlines the Solutions Process ............................................................................................. 145

Visibility ..................................................................................................................................................... 145

Prioritization ............................................................................................................................................ 145

Problem & Fault Domain Isolation .................................................................................................. 145


Troubleshooting, Root Cause Identification, & Resolution .................................................... 145

Communication with the Business .................................................................................................. 145

Improvement ........................................................................................................................................... 146


TicketsRus.com—A Day in the Life ...................................................................................................... 146

Everyone Benefits by Seeing APM in Action ..................................................................................... 160

Chapter 9: APM Enables Business Service Management .................................................................. 161

What Is Business Service Management? ............................................................................................ 162

vi
Table of Contents

It Starts with the Service Model............................................................................................................. 163

The Measurement of “Quality” ............................................................................................................... 165

How Does One Measure Quality? .......................................................................................................... 167

Service Levels........................................................................................................................................... 168

KPIs .............................................................................................................................................................. 168

User Impacts ............................................................................................................................................. 170

Revenue Impacts..................................................................................................................................... 171

Real-Time Monitoring Means Real-Time Metrics ........................................................................... 173

The Cost of Poor Quality ...................................................................................................................... 174

The Impact of the Business Calendar .................................................................................................. 175

BSM’s Impact on ITIL and Six Sigma Activities ................................................................................ 176

ITIL Integration ....................................................................................................................................... 177

Six Sigma Improvement Activities ................................................................................................... 178

BSM: The Bottom Line ............................................................................................................................... 179

Chapter 10: Your APM Cheat Sheet ........................................................................................................... 181

Part 1—What Is APM? ............................................................................................................................... 181

Defining APM, More than “On vs. Off” ............................................................................................. 183

Part 2—How APM Aligns with the Business .................................................................................... 185

What Changes with APM? ................................................................................................................... 186

Part 3—Understanding APM Monitoring .......................................................................................... 187


Part 4—Integrating APM into your Infrastructure ........................................................................ 188

Part 5—Understanding the End User’s Perspective ...................................................................... 191

Monitoring from the End User’s Perspective............................................................................... 192


Where Does EUE Fit?............................................................................................................................. 193

Part 6—APM’s Service-Centric Monitoring Approach.................................................................. 195

Flow Up, Drill Down .............................................................................................................................. 198

Part 7—Developing & Building APM Visualizations...................................................................... 199

vii
Table of Contents

Part 8—Seeing APM in Action ................................................................................................................ 200

Visibility ..................................................................................................................................................... 200

Prioritization ............................................................................................................................................ 201

Problem & Fault Domain Isolation .................................................................................................. 201

Troubleshooting, Root Cause Identification, & Resolution .................................................... 201

Communication with the Business .................................................................................................. 201

Improvement ........................................................................................................................................... 201

Part 9—APM Enables Business Service Management .................................................................. 202

Linking BSM to APM .............................................................................................................................. 203

APM Is Required Monitoring for Business Services ...................................................................... 204

viii
Copyright Statement

Copyright Statement
© 2010 Realtime Publishers. All rights reserved. This site contains materials that have
been created, developed, or commissioned by, and published with the permission of,
Realtime Publishers (the “Materials”) and this site and any such Materials are protected
by international copyright and trademark laws.
THE MATERIALS ARE PROVIDED “AS IS” WITHOUT WARRANTY OF ANY KIND,
EITHER EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE,
TITLE AND NON-INFRINGEMENT. The Materials are subject to change without notice
and do not represent a commitment on the part of Realtime Publishers its web site
sponsors. In no event shall Realtime Publishers or its web site sponsors be held liable for
technical or editorial errors or omissions contained in the Materials, including without
limitation, for any direct, indirect, incidental, special, exemplary or consequential
damages whatsoever resulting from the use of any information contained in the Materials.
The Materials (including but not limited to the text, images, audio, and/or video) may not
be copied, reproduced, republished, uploaded, posted, transmitted, or distributed in any
way, in whole or in part, except that one copy may be downloaded for your personal, non-
commercial use on a single computer. In connection with such use, you may not modify
or obscure any copyright or other proprietary notice.
The Materials may contain trademarks, services marks and logos that are the property of
third parties. You are not permitted to use these trademarks, services marks or logos
without prior written consent of such third parties.
Realtime Publishers and the Realtime Publishers logo are registered in the US Patent &
Trademark Office. All other product or service names are the property of their respective
owners.
If you have any questions about these terms, or if you would like information about
licensing materials from Realtime Publishers, please contact us via e-mail at
info@realtimepublishers.com.

ix
Chapter 1

Chapter 1: What is Application


Performance Management?
This Web site is experiencing unexpectedly high volume.
Please try again later.

You’ve seen these words before. Perhaps you were buying a just-released book or video
through an online Web site and it popped up in the middle of checking out. Maybe you were
trying to get tickets to that important sporting event or that one-night-only concert. What
about when the winter storm of the century hits your airport and thousands of people
scramble at once to find a new flight or a hotel room.

Each of these scenarios is strikingly similar to the others. An IT service struggles to keep up
with the load of its users, until that load finally overwhelms its capabilities. You, the end
consumer, are greeted with a pleasant message that effectively tells you…nothing. You
don’t know what happened. You don’t know the status of the problem or its resolution. You
don’t even know when that suggested “later” may be for you to try again. So, you—and
everyone else—find yourself hitting the Refresh button over and over again impatiently
waiting for a better response.

Or, in extreme situations, you stop doing business with that site entirely.

Each of these scenarios is also remarkable in how often they’re seen by the end customers
of Web and other IT-based services. When they work, the IT services used by businesses
are fantastically efficient in servicing customers. Yet when they don’t, the result is the
online equivalent of a “Closed for Business” sign hanging on the front door.

You might have experienced situations like this in other places. Perhaps the problem isn’t
in an online e-commerce system. Maybe a similar outage of service happened within an
internal business application, the functionality of which is critical to getting your job done.
Maybe an underlying IT infrastructure component such as name resolution or the network
itself experiences a problem. The result of that low-level issue manifests itself in ways that
are seemingly unrelated to the actual problem.

The central problem in all of these situations is an inability to properly manage application
performance.

1
Chapter 1

More than Just On vs. Off


If you’re an IT professional reading this guide, you’ve heard these stories many times
before. You know about the host of potential problems that an IT infrastructure can and
does experience on any particular day. You’ve experienced the nightmare situation where a
critical service goes down and no one can track down exactly why. You’ve sat in the “war
room” where highly-skilled individuals from every IT domain—network engineers,
systems administrators, application analysts—sit around the conference table for hours
attempting to prove that the problem isn’t theirs. Whether you’re an IT professional, or
someone who directs teams of them, you know that any downed service immediately
signals the beginning of a bad day.

The problem is that the idea of a “service that is down” is often so much more than a simple
binary answer; on versus off, working versus not working. As you can see in Figure 1.1, IT
services are made up of many components that must work in concert. Servers require the
network for communication. Web servers get their information from application servers
and databases. Data and workflow integrations from legacy systems such as mainframes
must occur. These days, even data storage must be accessible over that same network.

Figure 1.1: An IT service is comprised of numerous components that rely on each


other for successful operation.

2
Chapter 1

If any of those pieces experiences an unacceptable condition—an outage, a reduction in


performance, an inappropriate or untimely response, and so on—the functionality of the
entire service is affected. This can happen in any number of ways:

• The service or hardware hosting the service is non-functional


• A server or service that is relied on is non-functional
• One or more servers or services that make up the service are not performing at an
acceptable level
• An individual component or function of the service is non-functional or is not
performing at an acceptable level
All of these are situations that can and will impact the ability of your critical IT services to
complete their stated mission. No matter whether the actual service itself is down or the
cause is some component that feeds into the functionality of that service, the ultimate
result to the end customer is a degradation in service. The ultimate result to your business
is a loss of revenue, a loss of productivity, and the inability to fulfill the regular needs of
business.

What Defines Functionality?


With all these components in play, actually defining what you consider “a fully functional
service” grows much more complex. Consider some of the questions that you and your
users are forced to ponder when non-nominal behaviors occur:

• If I can’t access the service’s Web site, is it functional?


• If I can access the service’s Web site but am unable to login, is it functional?
• If I can access and login but am unable to complete a transaction, is it functional?
• If I complete a transaction but am unable to verify its completion, is it functional?
• If I experience an excessive delay in my completing a transaction, is it functional?
• If I can accomplish my tasks but my experience with the service is unsatisfactory, is
it functional?
All of these are valid questions, because the mission of any IT service is to provide an
expected level of value to its customers. That value comes from its ability to functionally
complete a user request. It also comes from the ability to do so within a time frame that is
acceptable to the user. The user must be able to interact with that service with a level of
trust that their transactions have been successful and that they aren’t wasting their time.

3
Chapter 1

Unfortunately, the cultural history in many IT organizations hasn’t always been so


proactive in the identification and resolution of performance-based issues. It is the
outwardly subjective nature of these questions why many IT services have had a
tumultuous history with their customers. Over that short history, IT organizations have
been notoriously simplistic in their view of service functionality. Is the service on today? If
yes, then move on to the next problem.

For many years this binary view of the IT environment was sufficient for most businesses.
As long as services were available, users could complete their goals. However, as
businesses have over time grown more and more reliant on their IT services as a critical
function of business, this immature “is it on?” approach to services under management can
no longer be acceptable.

Cross-Reference
Chapter 2 will explore this history in greater detail and discuss how the
maturity level of an IT organization bears heavily into how it goes about
preparing for and solving problems.

What Is Application Performance Management?


Fixing IT’s former “on versus off” approach to service management is therefore a critical
step. As such, smart organizations are looking to accomplish this through a more
comprehensive approach to defining their services, the quality of those services, and their
ability to meet the needs of users. Application Performance Management (APM) is one
systems management discipline that attempts to provide that perspective. Consider the
following definition:

APM is an IT service discipline that encompasses the identification, prioritization, and


resolution of performance and availability problems that affect business applications.

Organizations that want to take advantage of APM must lay in place a workflow and
technology infrastructure (see Figure 1.2) that enables the monitoring of hardware,
software, business applications, and, most importantly, the end users’ experience. These
monitoring integrations must be exceptionally deep in the level of detail they elevate to the
attention of an administrator. They must watch for and analyze behaviors across a wide
swath of technology devices and applications, including networks, databases, servers,
applications, mainframes, and even the users themselves as they interact with the system.

Figure 1.2 shows an example of how such a system might look. There, you can see how the
major classes of an IT system—users, networks, servers, applications, and mainframes—
are centered under the umbrella of a unified monitoring system. That system gathers data
from each element into a centralized database. Also housed within that database is a logical
model of the underlying system itself, which is used to power visualizations, suggest
solutions to problems, and assist with the prioritization of responses.

4
Chapter 1

Figure 1.2: An APM solution leverages monitoring integrations and service model
logic to drive visualizations, prioritize problems, and suggest solutions.

With its monitoring integrations spread across the network, such a system can then assist
troubleshooting administrators with finding and resolving the problem’s root cause. In
situations in which multiple problems occur at once—not unheard of in IT environments—
an APM system can assist in the prioritization of problems. In short, an effective APM
system will drive administrators first to those problems that have the highest level of
impact on users.

So, with this in mind, why are you here? Why read this guide?

The Goal of this Guide


The goal of this guide is to assist you with understanding the concepts and the promise that
such a system can bring. A successfully implemented APM solution can bring that rich level
of monitoring to your business-critical services. With monitoring integrations across every
portion of your IT environment, APM can root out issues as or even before they become
problems. An effective APM solution provides your administrators and your business with
a situational awareness deep into the otherwise “black box” IT services you provide for
your users.

5
Chapter 1

These ten chapters are designed to help you understand what APM is and how it helps IT
better align with the needs of its business. They will help you recognize the different types
of APM integrations, including the all-important end users’ perspective, and how the
different integration types tie into that overall picture of service quality. They will give you
an end-to-end understanding of APM in action, including the creation and use of interactive
dashboards for users, administrators, and executives. You’ll also come to understand how
APM’s information gathering brings data that can tie directly into business metrics through
Business Service Management (BSM).

In the next ten chapters, you’ll gain a detailed level of knowledge about the requirements
and the power of APM, centered around the following topics:

• Chapter 1: What Is APM? This first chapter will document the problem with
today’s traditional monitoring solutions and explain why APM is an effective
solution. It will discuss an introduction to APM and where it fits into the
environment.
• Chapter 2: How APM Aligns IT with the Business. APM provides great value, but
only when an organization’s culture supports what APM has to offer. IT
organizations must have a certain “maturity” if they are to get the highest levels of
value from APM. Those organizations as they mature will, at the same time, find a
greater alignment between their goals and the goals of their business.
• Chapter 3: Understanding APM Monitoring. As you’ve already learned, APM’s
monitoring integrations hook into vastly different areas within your IT
infrastructure. This chapter will discuss those integrations, how they work, and how
your existing monitoring infrastructure can augment an APM solution.
• Chapter 4: Integrating APM into Your Infrastructure. Following on Chapter 3’s
functional descriptions is a further explanation of APM’s logical integrations into
infrastructure components, network analytics, and application analytics. To build its
holistic awareness of a business service, APM must peer into every facet of that
service, from clients to mainframes and everything in between. This chapter
discusses the nuts and bolts of how that will happen.
• Chapter 5: Understanding the End User’s Perspective. You’ll notice in Figure 1.2
that one critical component to be monitored is actually the end user. End User
Experience (EUE) monitoring adds the users’ perspective of the system, illuminating
when users see problems that other areas of monitoring can’t see.
• Chapter 6: APM’s Service-Centric Monitoring Approach. With each of the
necessary monitoring integrations in place, it is now possible to build a model of the
service itself. This model creates a logical diagram of the service and its
components, providing a map for monitors to display and update their information.
The service model is also the structure that drives what kinds of data administrators
will see within their assigned visualizations.

6
Chapter 1

• Chapter 7: Developing & Building APM Visualizations. Within its visualizations is


where an APM solution shows its power. These visualizations provide a heads-up
display for notifying administrators when conditions are acceptable or unacceptable
across the landscape of your IT infrastructure. This chapter will discuss ways to
create useful dashboards for users, administrators, and executives.
• Chapter 8: Seeing APM in Action. With the information gained in Chapter 7, it is
possible to see an APM solution fully in operations. This chapter will discuss how
such a system works, how it enhances the processes for monitoring and
troubleshooting, and how a fully-realized APM solution streamlines your steps to a
problem’s resolution.
• Chapter 9: APM Enables Business Service Management. BSM takes the
technology focus of APM and relates its metrics to business goals. It adds a sense of
financial logic to APM’s availability, performance, and end user data, enabling
business leaders to find and measure the value in IT projects and services. This
chapter will discuss the linkages between APM and BSM and how the two work
together for better IT alignment.
• Chapter 10: The Shortcut Guide to APM. Ten full chapters is a lot of reading
material. Once you’ve read through the detail in these chapters, sharing that
knowledge with others is key. That’s the reason for Chapter 10. For those in your
organization who need to know APM’s concepts and its promise in a short form
factor, this chapter quickly summarizes the key takeaways from this entire guide.
To crystallize your learning around real-world problems and solutions, each of the
chapters in this guide (excepting this and the final one) will also include a chapter story.
That story will discuss a situation that you’ve probably experienced either as an end-
consumer, administrator, or director of an IT service. In that story, you’ll meet a set of
characters that will recognize the need for maturing their IT processes, decide to
implement APM with its monitoring integrations, and ultimately build a successful system.
Playing out over a series of chapters, you’ll come to understand how APM can be hooked
into an actual IT environment.

APM by Example
Now that you understand the path this guide will take in your understanding of APM, it is
perhaps best to continue with a short example. This example is designed to tease your
understanding of APM and prepare you for the chapters to come. Later chapters will
continue with larger and more detailed examples like this one to assist with your learning.

Consider a business service that is similar to the one discussed in Figure 1.1. This system
provides some level of service for customers on the Internet. It is made up of six major
components: the network, network-attached storage (NAS), a firewall, and three servers
that communicate with each other to fulfill the display, logic, and data storage needs of the
system.

7
Chapter 1

If this system is monitored through traditional device-centric management tools, certain


problem situations are relatively easy to troubleshoot. If the Web server spontaneously
powers down, a traditional management solution will quickly identify that the server no
longer responds to ICMP (“ping”) commands. Identifying the loss of an entire server due to
a loss of power can be categorized as a “simple” problem, the resolution to which is easy to
locate and apply. Power on the Web server, and you’re back in business.

Figure 1.3: A problem deep within the mainframe can impact the operations of the
entire system.

However, consider the situation in which the problem lies deeper within the application
itself. In this example, the problem is not the loss of an entire server or device. Here, a much
deeper problem exists. Rather than a simple server loss, the response time between the
application server and the mainframe instead slows down. This occurs due to a problem
within the mainframe. The decrease in performance between these two components
eventually grows poor enough that it impacts the system’s ability to complete transactions
with the mainframe. As a result, the upstream reliant servers such as the application
server, database server, and Web server can no longer fulfill their missions.

A problem like this is particularly difficult to troubleshoot because

• This situation doesn’t involve the loss of a server or network device.


• It cannot be measured through onboard system counters that measure traditional
metrics such as processor performance or RAM usage.
• It cannot be seen through network traffic analysis because the problem is involved
with a reduction in traffic rather than bandwidth contention or latency.

8
Chapter 1

Figure 1.3 shows a pictorial example of how this problem might manifest itself to the end
user. In the picture, you can see how the information exiting the mainframe does not make
its way to the application server in a timely fashion. Because of this slowdown in
performance, transaction timeout values and thresholds are eventually exceeded. This
causes the application server to no longer be able to serve the needs of the Web server,
which itself can no longer serve the user. In the end, the user experiences a loss of service
of the entire Web site, one that is difficult to trace back to the initial problem.

Why Measure Performance?


Why measure performance? Simply put, because you must. Measuring the performance of
your servers gives you information about their internal workings. Measuring the
performance of the network enlightens you to the levels of traffic going across the wire.
Measuring database activity helps you understand the levels of resources that your
applications require. Most importantly, measuring performance is a proxy for
understanding business performance—the ultimate capacity of your business systems to
service the needs of your customers.

Measuring application performance in the short run is a powerful troubleshooting tool,


helping you identify where applications and infrastructures are not behaving properly.
Long-term measurements are also useful in assisting continuous improvement teams with
expanding existing systems. Looking at performance over the long-haul helps you
understand when your existing infrastructure needs to scale.

Measuring Across Domains


The problem with most traditional monitoring solutions (see Figure 1.4) is that they’re
designed to work exclusively within a single problem domain. Network monitoring
solutions give you information about network utilization but can provide no detail about
processes on a server. Windows PerfMon counters can tell you the level of processor
utilization and memory consumption but have no awareness of the underlying network
conditions other than the data going in and out of its local network card.

9
Chapter 1

Figure 1.4: Stovepiped monitoring solutions don’t provide a holistic view of the
entire system.

Needed are solutions that integrate the monitoring information from each component.
Such a system, like what was previously shown in Figure 1.2, will leverage the use of
monitoring integrations across multiple domains into a single, centralized database for
processing. In that database will be a model of the business service itself, creating the
necessary logic that enables the system to understand and report on the data it is
collecting.

Measuring Transactions
Yet this alone still isn’t enough. Truly measuring performance is more than just enabling
PerfMon counters on a Windows server or logging NetFlow statistics from a Cisco network
device. Even comparing one device’s set of information with another gives you only a
limited perspective on the environment as a whole. Counters like these give you
information about devices and interactions between devices. Nowhere in their data do they
provide information about the applications that are installed atop those devices. Their
aggregate data cannot illuminate the individual communications between, for example, two
servers that comprise a service. To collect this data, it is additionally necessary to look at
the individual transactions that occur between service components.

10
Chapter 1

Figure 1.5: Transaction monitoring can watch the sequence of events between
system components and alert when problems occur.

You’ve already seen an example of how transaction monitoring assists in the


troubleshooting of a problem. In the previous section, a problem was identified where the
communication between two servers, the application server and the mainframe, was
impacted due to a problem with the mainframe. Although that specific problem on the
mainframe might have been caught using traditional monitoring—perhaps because a
daemon crashed or a socket stopped responding—traditional server monitoring solutions
don’t typically look at the communication that enters and exits a server. Thus, traditional
monitoring solutions could never have correlated how the mainframe’s problem impacts
the other servers.

Figure 1.5 shows an example of how transaction monitoring might have recognized that a
problem was occurring. The sequence diagram there shows the interactions between the
different components that make up our example service. There, the user successfully
attempts to place an order on the Web site. The Web site then attempts to create a
shopping basket for the user. This action requires a verification of inventory levels prior to
completion, an action that isn’t completed in time. Consequently, the sequence ends with
the application server timing out its request and an overall failure in the system.

Chapter 3 will go into more detail about this idea of transaction monitoring, followed by an
in-depth discussion on its relation to the end users’ perspective in Chapter 5. But for now,
recognize that multiple and simultaneous mechanisms for monitoring business services are
required if you are to obtain that desired level of situational awareness.

11
Chapter 1

APM Optimizes the Application Life Cycle


Like projects, servers, and in fact the entire IT infrastructure, applications themselves
experience a sort of life cycle all their own. With applications, however, the trendline
associated with costs and benefits appears quite different than that for hardware and
project resources. Applications, especially large-scale complex applications that are likely
to be monitored through an APM solution, tend to have a life cycle that divides time
between its cost period and its benefit period.

Figure 1.6: The application life cycle’s impact to the business in both perfect and
poor implementations.

Figure 1.6 shows an example of how the life cycle for an application can be recognized.
Identifying the need, scoping the project, developing the application, designing its
architecture, and eventually implementing the solution are all required cost elements that
must occur to set the application in place. Once in production, “perfect” applications tend to
require comparatively little marginal cost period over period.

For a perfect implementation, the impact to the business is displayed by Figure 1.6’s green
line. There, the application begins with zero benefit to the organization and all cost. Prior to
its production deployment, no one within the organization is obviously making use of the
application. At the same time, organizational resources are consumed for those
aforementioned development activities. Once the application is brought into production, its
benefits begin to increasingly outweigh the marginal costs required to keep it running. If
that application is an internal tool for business employees, this benefit curve goes up as
employees find value in it features. For external applications the benefit curve goes up as
potential customers begin using the application and creating income for the business.

12
Chapter 1

However, few applications experience that “perfect” curve between development and
production, cost and benefit. The alternating red curve also seen in Figure 1.6 represents
the fits and starts that occur with applications that are poorly brought into production.
Perhaps the project wasn’t scoped properly and more users are found to need the
application than previously thought. Maybe the application creates an impact on the
network that causes outages or performance issues with other services. Sometimes success
itself becomes problematic. Newly-deployed applications can be so successful that the
initial customer excitement over its release sends it down, crashing under the weight of its
own exuberant users.

All of these represent areas in which issues with application delivery are caused through
poor management of anticipated application performance. These issues occur both within
the application as well as external to it. Too often performance is not considered as a
primary requirement during application development, forcing performance management
to occur late in the development life cycle. This omission of performance management
early in application development adds cost to applications and extends their “cost” period.
An added goal of APM is to provide the data foundation where performance impacts from
the environment are understood at a macro scale. Here, monitoring integrations from all
across the network environment are consolidated to provide the data necessary to create
good designs with new applications right from the get-go.

Consider again the example from earlier in this chapter. In that example, some unknown
condition on the mainframe was eventually found to be the root cause for a multiple-server
failure. If fixing this problem requires a redevelopment of the core application, an outage
like this represents a cost to the business. It means the trendline for the application
switches from the “green” line to the “red” line, extending the time required to declare
success with the application’s deployment.

However, leveraging data gathered through an APM solution could have prevented such a
problem from occurring in the first place. Perhaps the mainframe was already serving
other customers and approaching the limit of its processing capabilities. Perhaps the piece
of code built for the mainframe was not optimized in its processing requests, causing the
mainframe to work particularly hard in processing every request. Any of these
performance-related issues could have been potentially tracked down prior to the failure
actually occurring.

13
Chapter 1

So, How Does APM Work?


This entire guide is built to take you through the entire workflow of an APM
implementation, from start to finish. But at a high level, an APM implementation starts with
the deployment of a unified database into which information from each of the monitoring
integrations will go. It serves as the location where metrics are fed based on behaviors that
are monitored within the system. Individual monitoring integrations deposit their data into
this location where it is processed and the useful pieces are stored. With so much data
across so many devices and applications that can be collected, a major function of that
database and its application logic is to identify what information is useful and what can be
discarded.

How does the system identify what is useful and what isn’t? Part of your implementation
project will be to define just those parameters. This is done firstly through the creation of a
logical representation of your services and infrastructure called the service model. As an
extremely detailed picture, creating this logical representation can be one of the most
challenging parts of an APM implementation.

Figure 1.7: Mapping physical components into the service model’s logical
representation.

Installing an APM framework and its software results in what amounts to a blank canvas.
On that canvas, you will sketch out your environment’s architecture including all the
elements that make up the systems to be managed. The completed service model becomes
that logical representation of the components in your infrastructure. Figure 1.7 shows a
simplistic example of how this might work. Here, each of the application’s physical
components and geographical locations has been mapped to a dependency diagram. The
arrows in this dependency diagram show where elements within the model rely on other
elements for their processing. At the top level, the service itself requires information or
processing provided by the three servers in the environment. Each of those three servers
requires the support of both network and storage components. Also shown is the user’s
experience divided between each of its multiple locations.

14
Chapter 1

In this model, coloring is typically used to identify the health and status of each of its
individual elements. You’ll see that each of the dots representing an element is colored
green. This easy-to-understand heads-up display provides a way to identify whether that
component is functioning to desired levels. Obviously, when problems occur with a
component, its green color will shift to yellow or red, denoting a caution or warning
condition. Not shown in Figure 1.7 are the individual rules that are used in the background
to identify the “greenness” or “redness” of the element. For example, underneath the
network element may be custom-built rules that that identify bandwidth or latency
conditions that are unacceptable or may indicate a pre-failure condition. When any of those
rules are tripped by the monitor, its color changes to alert the problem condition.

A service model for any APM implementation is intended to be dynamic, organically


changing over time as services or service components come and go. As the model grows in
detail, it at the same time grows more powerful. Once defined, it graphically displays each
of the dependencies that make up your application infrastructure. Should a dependent
component experience a problem condition, the status of that component as well as those
that depend on it can change. For example, the service’s reliance on the mainframe means
that any problem on the mainframe immediately rolls up to become a problem with the
entire service itself. A problem with the network automatically impacts each of the servers’
status, and ultimately the service as well. This roll-up and drill-down concept provides you
with a mechanism to quickly trace to which components of the service are experiencing a
problem, and drill down to the specific reasons underneath each element.

To obtain this data, multiple types of installed monitoring integrations are required. These
integrations provide the hooks into various components that make up the service like code
frameworks, databases, applications, and Web services. Installing and managing these
monitors is a next big step in an APM implementation. Depending on the architecture of
your applications and the components that make up the infrastructure, many monitors may
be required to gather the right amount of data. Table 1.1 lists a few common monitors and
integration points.

15
Chapter 1

Web Sites HTTP / HTTPS

SMTP

DNS

OracleForms

Siebel

SAP

Framework / Analysis Message Queue / Microsoft MQ

XML/SOAP

Tuxedo

Citrix

Windows Terminal Server

Applications & Databases J2EE

.NET

Oracle

SQL

Sybase

Informix

DB2

Network Protocols NetBIOS

SMB

DCOM

RPC

SSL

Table 1.1: A sample list of potential integration points for an APM solution.

16
Chapter 1

Obviously, once installed, a major component of the day-to-day administration of an APM


implementation is involved with the creation and tuning of the rules underneath each
green dot. These rules identify the specific behaviors that your organization considers
healthy versus those considered unhealthy for each component. They determine the state
of each monitored system and when to alert that problem behaviors are occurring. Similar
to the organic adjustments to the model itself, the creation and tuning of the model’s
underlying rules is also an ongoing activity.

The Role of Visualizations


The final component of an APM solution is the actual display of information itself. Although
the service model provides one view into the health and status of system components and
their linkages, it isn’t necessarily useful for everyone. Needed is another layer of
visualizations that leverage the structure and data of the service model but display it in a
way that is more useful to its ultimate consumer.

This concept was first described in the book The Definitive Guide to Business Service
Management. There the term digestibility is used to help the reader understand how
different views make sense for different readers. This concept of digestible data relates to
presenting the kind of data that is interesting, useable, and useful to its targeted class of
individual.

Consider three types of individuals who may be interested in data that is generated by an
APM solution. The first individual may fill the role of systems administrator. That
individual may be interested in understanding large-scale conditions that occur across
system components. They may have an interest in network conditions and the status of
services that are in operations. When problems occur, they want to know specifically
where that problem occurs so that they can drive towards a fix.

The second class of individuals who stand to gain through an APM solution are the
developers of the system itself. A developer may be unconcerned with the day-to-day
operations of a system once that system is in full production. Knowing which elements of
the system are up versus down is not part of the job role of that system’s developer.
However, they do get involved when an administrator needs deep troubleshooting
assistance with a problem, or when issues with the system require code updates or fixes.
Application developers are likely to want to view more detailed information about the
individual transactions between service components. They might want to see which
transactions occurred versus which were unsuccessful. Information about the performance
of page refreshes on an application’s Web site is of much greater use to its developer, as
that individual can find and fix the specific problem at the code level.

17
Chapter 1

The third individual who might have interest is the end user themselves. End users of
systems, especially those large-scale systems that are likely to be monitored with an APM
solution, want to know when problems occur. This chapter started with a discussion of how
users don’t like seeing unhelpful messages like “This Web site is experiencing unexpectedly
high volume. Please try again later.” during an outage. An APM solution can be
simultaneously used to notify end users when problems occur. It can be used to help them
understand when performance conditions are lower than normal, and when users can
expect to see a return to full operations.

Figure 1.8 shows a collage of potential visualizations that can be of interest to each of these
classes of users. You’ll see there the high-level stoplight charts that describe when system
components are behaving at or below expectations. This view is handy for the
administrator to know when system components experience problems. Also there is a
detailed view of a set of individual transactions. This view provides the developer the
necessary information they need to trace issues with performance or functionality. The
third image shows an extremely high-level view of a global system, detailing areas where
that system may be experiencing problems. This view provides the right level of detail for
the end user, giving them the knowledge that problems exist.

Figure 1.8: Three examples of visualizations that are tuned to the needs of their user
class: administrator, developer, and end user.

18
Chapter 1

What Benefits Does APM Provide?


Many of APM’s direct benefits are easy to see just through the explanation in this first
chapter. APM provides a level of awareness about IT services that is unmatched with
today’s technology. It hooks into virtually every component of your organization’s
application infrastructure, bringing back useful telemetry across each component into a
single location for processing and later visualization. Breaking down these benefits a little
further, consider a few of the extra benefits you can expect to see through a successful
implementation.

Root Cause Identification


When an application or service experiences a problem today in your infrastructure, what is
your first step in identifying that problem? Do you “circle the wagons,” gather everyone in a
room, and begin working through potential solutions? Do you reboot the servers and hope
for the best? Or, do you leverage specific and actionable data that explains where on-
system behaviors have changed?

All problems in an IT environment stem from some kind of change within the environment
itself. This is the case because computers are deterministic. Some action or change must
occur within the environment that drives the problem to occur. With the right monitoring
integrations in place, it is possible to characterize the performance of your applications
over time. It is then possible to use that characterization to track when an unacceptable
behavior occurs, and immediately point the finger to where the root cause lies.

Cross-Reference
Chapter 6 will discuss this root cause analysis process in greater detail.

Characterization of Problems
Returning once again to the example problem in this chapter, characterizing the
mainframe’s problem was possible because its nominal behaviors were encoded into the
APM solution’s service model. Those nominal behaviors, such as acceptable processor
utilization and acceptable delay in responding to inventory requests, provided a basis by
which its later unacceptable behaviors could be alerted on.

When you’ve gone through the exercises necessary to characterize the acceptable
behaviors of the elements in your IT environment, you provide that basis for alerting when
unacceptable ones exist. Leveraging this ability with the dependency linking that makes up
the service model provides a way to show how one component’s behavior impacts others.

Prioritization of Problem Resolution


Each of the components that make up your application infrastructure is also likely to have a
defined number of users. These users interact with the service to accomplish whatever
tasks are automated by the system. Thus, when that system goes down, those users are
prevented from accomplishing their tasks. When greater numbers of users are impacted by
a particular component’s failure, this increases the priority of fixing that component’s
problem.

19
Chapter 1

A problem with this in many IT organizations today is simply not knowing how many users
are affected by each business application or component. APM’s structured approach to
defining the overall architecture provides a way to easily roll up the number of affected
users by component. This ultimately provides a way to prioritize which problems should be
resolved first and which can be left for later resolution.

Situational Awareness
The data that passes along your network’s wires is not something that can be directly
looked at with the naked eye. Using traditional network sniffing tools to watch this data is
also problematic due to the sheer quantity of data that flies by during any discernable
period of time. Thus, better approaches that look at data from the network’s perspective in
combination with each server or application’s perspective gives administrators a better
situational awareness of what’s going on in their networks. Combining this information
with the rich monitoring support through any of APM’s integrations means that the
business can know the status of its applications at all times.

Better Planning
Lastly is the critical need for future planning. Too often, IT organizations go through
planning exercises using a subjective approach, assigning augmentation dollars based on
gut feelings or one-time problems rather than historical behaviors. Taking the long-term
approach with APM data brings IT a mechanism for identifying where system elements
need augmentation or wholesale upgrades. The data provided by APM integrations enables
budgetary decisions to be made based on objective data.

Critical Applications Require Critical Monitoring


The goal of this chapter—and indeed this guide—is to help you understand the critical
need for managing your applications’ performance and behaviors. You’ve already learned
the very basics of what is possible with APM solutions today. The rest of this guide will
continue the discussion, with each chapter building on the information from the last.

However, before you can truly start your APM implementation, an important first step is in
understanding your organization’s level of process maturity. That level of maturity drives
how you solve problems, how you react to situations, and what level of structure you have
in place. You’ll come to recognize that organizations who operate with a relatively low level
of process maturity suffer under the weight of overwork, waste, and the lack of automation.
Chapter 2 will discuss just those topics.

20
Chapter 2

Chapter 2: How APM Aligns IT with the


Business
“It’s simply not fair,” says the executive to his IT manager, “every time this happens, it’s their
fault and not ours, right? I know it’s our responsibility to plan around their popularity, but
this is getting ridiculous.”

“Well, most of the time,” responds the IT manager, “Sometimes it really is our fault. This is a
complicated system, one that we’ve been incrementally upgrading and expanding over the
years. I mean, by this version, some of that code has got to be spaghettied together so poorly
that it’ll be impossible to figure it out.”

The executive continues, “But isn’t there anything we can do to predict this sort of excessive
activity, to plan for it?”

“No. At least not with what we’ve got today. I’ll argue that the site is technically still up and
running. We’re not ‘technically’ down.”

“I find that hard to believe…”

Frustrated, the executive sits back in his chair and swings to look out the window as the IT
manager leaves his office. It is just this sort of conversation that irritates him about the
technology focus of this business. But what can he do about it?

---

Dan Bishop is COO of TicketsRus.com, a national retailer of tickets for concerts and sporting
events. TicketsRus.com is one of the largest online retailers of tickets, selling tens of thousands
each and every day for everything from the smallest backwater rock band to the hottest pro
basketball teams during their end-season games. Although TicketsRus.com sells a portion of
tickets through their phone operations facility, a massive phone bank located in Waco, Texas,
today’s vast majority of tickets are sold through their online Web site currently homed in
Denver, Colorado.

That Web site is the singular greatest piece of TicketsRus.com’s intellectual property, and a
primary reason behind their “convenience fees,” which serve as the profit for the business. The
TicketsRus.com Web site is a massive, custom-developed online service that tracks incoming
requests, sells tickets, and enables “gatekeeper” functions to prevent overload during periods
of high traffic. It even prints and mails the tickets once customers check out. An almost fully-
automated system, you could argue that TicketsRus.com’s Web site is TicketsRus.com.

And, today, that Web site is having a problem.

---

21
Chapter 2

You’ve experienced situations like this before. Dan’s problem with his company’s primary
mechanism for making money is a situation felt all across the IT landscape. In today’s
online e-commerce climate, more and more companies are leveraging the Internet as a
primary—if not singular—location for hosting their wares. With the Internet, inventory
and labor costs are dramatically reduced, as are the costs of the brick-and-mortar
storefronts that now no longer must be leased and maintained. With the automation
brought about by a computing infrastructure, far more productive work can be done with
far less manual intervention.

Yet moving one’s operations to an online facility incurs its own set of risks. In the case of
brick-and-mortar, the loss of a store due to a power outage, a massive snowstorm, or a run
on available products means that customers can still go elsewhere for their needs. In the
case of online, the Internet presence is singular. It must operate at all times, with an
acceptable degree of performance, and in such a way that it gives confidence to its
customers that they’re getting value out of the experience.

---

This is exactly today’s problem with TicketsRus.com, and it’s not their fault. In today’s
problem, an extremely popular artist has come out of retirement for a new tour, and that
artist’s adoring fans have completely overloaded the system to look for tickets. The
simultaneous inrush of new business has effectively shut down the site, turning what should
have been a profit success into Dan’s current operational nightmare.

And he’s not sure if his IT Manager really understands the gravity of the problem…

IT and the Business Have Different Goals


There’s a central problem intrinsic to many IT organizations. This problem relates to IT’s
ability to consider itself an integral part of the business, and ultimately the profitability of
that business. The problem isn’t necessarily sourced from IT itself. In its relatively short
history, “the people who fix the computers” have long served as a secondary function of
business. For a very long period of time, the only time IT professionals were needed—or
even seen—on the business floor was when something broke. Having a problem with your
computer today? Call the IT Help desk line (see Figure 2.1) and someone will magically
appear at your desk in a few hours.

22
Chapter 2

Figure 2.1: IT is seen as “the people who fix,” a common sight in many businesses.

When the business didn’t need IT, these groups of people usually found themselves
shuffled away to other parts of the building. Taking over closets and storage rooms behind
locked doors, there this group awaited the next problem to be fixed.

Over time, this break/fix mentality begins to grow deeply ingrained into both the members
of IT as well the rest of the business who rely on them for services. When IT operates in a
break/fix mode, they usually find themselves reacting to problems. A critical server is down
today? Here come the IT “white knights”, riding in to work through the night and ultimately
save the day.

But at the same time, the break/fix mentality’s “hero effect” actually becomes a liability to
the business. IT organizations that see themselves as the heroes to be called when
problems occur probably aren’t spending the right amount of time preventing those
problems from occurring in the first place. If that critical server was actually reporting a
problem for weeks before it finally crashed, IT is no hero in getting it running again—
they’re actually the problem.

Why this disconnect between IT and the business? Other than a historical position inside
the company’s locked storage closets, what are the causes behind IT’s reactive mindset?
Differing responsibilities and mismatched priorities with the rest of the business, a lack of
common vocabulary, and a missing vision into the business’ dollars and cents are all
common factors.

23
Chapter 2

Different Responsibilities
In the story at the beginning of this chapter, the business of TicketsRus.com is brokering
tickets between artists and sports teams and their end consumers. TicketsRus.com makes
its business by providing a convenient service to its customers, making it easy for them to
find and purchase the tickets they want for the events that interest them.

To this end, TicketsRus.com likely has a massive marketing department. The job of that
team is to make potential customers aware that their service exists. They probably have a
sales department who find new events, artists, and teams to sell on their Web site. Their
executive management team’s primary responsibility is to ensure that the company runs
optimally with good profit and expected return. Each of these groups has a primary mission
that aligns with creating and maintaining the flow of TicketsRus.com’s business.

In contrast, TicketsRus.com’s IT department has a different goal entirely. Their stated goals
are quite distinctive in scope. TicketsRus.com’s IT department is responsible for and
charged with maintaining the operations of the computer systems for the rest of the
company. That charge includes the massive online presence where the company makes
most of its profit. When TicketsRus.com makes a profit, the IT department continues to
keep the computers running. When TicketsRus.com doesn’t make a profit, the IT
department continues to keep the computers running.

Mismatched Priorities and Metrics


It is this mismatch of responsibilities where many problems with prioritization can occur
between IT and the business. When break/fix trumps profitability, the business ultimately
loses in the long run. In its current state, the priority of TicketsRus.com’s IT department is
to ensure that their computing infrastructure is up and operational. As a major part of that
infrastructure, maintenance of the online presence is a primary responsibility as well.

However, there’s a problem when the metrics associated with what is considered “up and
running” are not well defined. Whereas the IT Manager sees the current situation as a
temporary hiccup in the otherwise smooth running of the online system, this individual
likely isn’t aware that this short hiccup could become the source of major revenue loss for
the business. Because he hasn’t planned for such a contingency, he truly isn’t aware of the
gravity of the situation.

24
Chapter 2

Figure 2.2: At a high level, APM can measure when user load negatively impacts
overall system performance.

This isn’t necessarily to say that the IT Manager’s lack of planning is entirely his fault. When
the IT Manager hasn’t been handed down the correct kinds of metrics to use in measuring
success, he won’t be looking in the correct places to find it. As you’ll find in this guide, one
of the tenets of Application Performance Management (APM) is to provide a mechanism for
defining just those metrics. Lacking a system in place that can look at system performance
as the sum of its parts (see Figure 2.2), it is difficult or impossible to accurately measure the
success of that system. APM and the solutions that enable it provide just those
measurements.

No Common Vocabulary
IT also suffers from its high level of technical vocabulary that isolates it from other
members of their business. The graph in Figure 2.2 makes sense when it is defined within
the scope of metrics that make sense to IT: % Processor Use, Transactions/Second, Java JRE
Method Timeout, and so on. Metrics such as these, however, are useless when attempting
to provide information to the non-technical members of the business. This breakdown in
communication further illustrates the chasm between IT and the business because business
leaders cannot relate their desired goals to IT in subjective terms that translate to objective
metrics.

25
Chapter 2

This common vocabulary isn’t necessarily limited to technical versus non-technical


members of the business. Intrinsic to the IT organization itself are various disciplines, each
of which has its own vocabulary for describing the system as they see it:

• Individuals within network teams see metrics from the perspective of data crossing
the wire.
• Systems administrators are primarily interested and have the greatest vision into
whole-server metrics.
• Developers need to peer into runtime environment metrics to see whether their
code is optimized for the environment.
A major problem with this stratification of IT personnel is that no one group can alone
comprehensively describe the behaviors across every component of a system. If a system
problem spans multiple domains, teams must work particularly hard towards finding a
resolution.

Figure 2.3: APM provides a type of Rosetta Stone, aligning each IT discipline’s focus
under a unified solution.

An APM solution assists with this language problem by providing what could be considered
a Rosetta Stone between each IT discipline, their individual focus, and their own
vocabulary. Although individual integrations within an APM solution are likely to be
managed by their responsible discipline—network integrations by network teams, code
optimization integrations by developers, and so on—the unified system provides a central
gathering point for all metrics. This centralization provides a single location where an
application can be measured across each of its IT disciplines at once. Such an analysis can
be further correlated across all disciplines as well.

The end result is that a fully-realized APM solution enables IT to operate as a single unit,
with system problems being quickly directed to the teams that have the greatest capability
to fix the problem.

26
Chapter 2

Technologists Rarely See Budgets


Technologists are called thus because of their focus on technology. And while a technology
focus and a budgetary focus needn’t necessarily be mutually exclusive, they often tend to
become so as an individual’s depth in technology increases. Unlike virtually every other
function of business, the individual members of IT are often not privy to the specifics that
make up the business’ or their department’s budget.

In the long run, this lack of financial information removes IT’s empowerment to solve
problems based on their budgetary impact. When IT is incentivized towards resolving
broken system components, they’ll fill their day with accomplishing just that task. Those
repair operations, however, might not be the best thing for the system over the long haul:

• Today’s band-aid repair actually clouds the troubleshooting process for tomorrow’s
outage.
• Today’s quick fix masks the much larger recognition that a wholesale system
upgrade is needed to keep up with the load.

Figure 2.4: APM and its integrations provide the raw data that feed Business Service
Management’s financial view of the system.

The relation of a service’s quality to the IT and business budget technically falls within the
purview of Business Service Management (BSM), a topic that will be discussed in Chapter 9.
However, there is a very important relation between BSM and APM in that BSM requires
the metrics gained from an APM solution to populate its business-centric view of the
system. You’ll find that although APM provides the technology metrics, its combination
with business financial logic is what powers BSM’s view of the world. Figure 2.4 shows an
example of the linkage between these key components.

27
Chapter 2

Reactive IT Is a Drain on Agility


Lastly, and most importantly, a reactive approach to maintaining a system is ultimately a
drain on the business’ ability to get the job done. When IT looks at problems and solutions
from the limited perspective of up/down or functioning/non-functioning, they’re not
looking into the deeper issues. These deeper issues may not necessarily manifest
themselves into actual visible outages but are a drain on the application’s ability to
complete its mission.

Consider again the problem situation first explained in Chapter 1. There, a slow response
time between the mainframe and the application server eventually grows to impact the
system as a whole. By nature, these kinds of system events often occur over a period of
time, growing in scope—and delay—until a minor situation becomes a major problem.

Figure 2.5: An APM solution’s high-level client network server view can illustrate
where areas of delay may soon cause a problem.

Figure 2.5 shows an APM system’s view of aggregate transaction performance between the
application server and the mainframe. With this view and others in place, it is possible to
draw a trend line towards a future failure before the failure actually occurred. This capacity
for defining possible pre-failure states enables IT to resolve problems before they actually
happen and before users notice. As you’ll learn in the next section, this proactive approach
to operations is representative of an IT organization at a high level of maturity; one that
drives value back to the business rather than reacting to it.

28
Chapter 2

Understanding IT Maturity
For an organization to efficiently make use of the kinds of information that an APM solution
can provide, it must operate with a measure of process maturity. IT organizations that lack
configuration control over their infrastructure don’t have the basic capability to maintain
an environment baseline. Without a baseline to your applications, the quality of the
information you gather out of your monitoring solution will be poor at best and wrong at
worst.

But how does an IT organization know when they’ve got that right level of process in place
to best use such a solution? Or, alternatively, if an organization recognizes that they don’t
have the right level, how can an APM solution help them get there?

One way to evaluate and measure the “maturity” of IT is through a model that was
developed in 2007 as part of a Gartner analysis titled Introducing the Gartner IT
Infrastructure and Operations Maturity Model (2007, Scott, Pultz, Holub, Bittman, McGuckin).
This groundbreaking research note defined IT across a spectrum of capabilities, each
relating to the way in which IT actually goes about accomplishing its assigned tasks. An IT
culture with a higher level of process maturity will have the infrastructure frameworks in
place to make better use of technology solutions, solve problems faster, plan better for
expansions, and ultimately align better with the needs and wants of the business they
serve.

Process maturity within an organization is defined as quite a bit more than simply having
the ability to solve problems. Within Gartner’s maturity model, the capacity of IT to solve—
and prevent—ever more complex problems was defined largely by its level of process
maturity.

An Example of Immaturity
It is perhaps best to explain this concept of immaturity through the use of an
example. Consider an organization that completely lacks any documentation
of its internal systems. Such an organization is also likely to lack formal
change control processes by which others are notified about changes to those
systems. In such an organization, a system can be configured and later
reconfigured at the whim of a single administrator. If an administrator or
developer finds a problem on the system, they resolve the problem as they
see it, notify no one, and continue about their day.
At first blush, the rate at which problems can be identified and resolved in an
environment can seem extremely beneficial. Administrators or developers
who find issues can quickly resolve issues as they see them, without the need
for complex and time-consuming paperwork, workflow, approvals, and
documentation. Such an organization can run exceptionally “lean and mean”
with their infrastructure, as the overhead associated with the process itself is
nonexistent.

29
Chapter 2

However, such an organization also lacks accountability. It lacks cross-


communication between members. It also lacks the basic infrastructure
necessary to validate the configurations on each component in the IT
environment. If one administrator is working on a problem and a second
finds the same problem, time is wasted as the two individuals enact
simultaneous change.
Often, the lack of cross-communication causes further problems down the
road. Perhaps the problem condition was actually necessary for the
troubleshooting of a completely separate problem. Perhaps the problem
wasn’t a problem at all, but a symptom of a much larger problem. In the
worst of cases, the lack of configuration control inhibits IT’s power in seeing
the signs of problems before they cause impact to the user population.
In short, although an immature IT organization might be more agile in actual
problem resolution (for example, clicking the right button), they’ll achieve
those gains at the cost of dramatically less agility in preventing the problem in
the first place.
Gartner defines six stages in which an IT organization can exist: Survival, Awareness,
Committed, Proactive, Service-Aligned, and Business Partnership. Figure 2.6 depicts these
six levels of I&O maturity, with a high-level description of each of the four dimensions of
assessment: people, process, technology and business management..

As organizations move from one stage to the next, they will find more documentation of
processes with less replication of work, greater and more advanced levels of configuration
control, different incentives for determining what is considered success, greater maturity
in monitoring, and the implementation of toolsets that enable richer planning and more
effective budgeting. With Figure 2.6 in mind, let’s take a more detailed look at the phases,
how organizations behave, and what benefits they get from each.

30
Chapter 2

Figure 2.6: The Levels of Gartner’s I&O Maturity Model.


Source: Gartner, Inc.
Survival
Gartner defines Level 0, Survival as “little to no focus on IT infrastructure and operations.”
The Survival stage is arguably the best defined by its name alone. Organizations as well as
IT infrastructures in the Survival stage experience a level of (barely) controlled chaos.
Servers, desktops, and network infrastructures are all individually managed with no
documentation of their configuration or areas in which changes can be announced to
others inside and outside the organization. Although Survival-stage environments are
relatively common in smaller businesses, they are by no means defined by size.

In the Survival stage, IT organizations tend to focus on the use of native or freeware tools
for managing their infrastructure. They are constantly putting out fires within technology
they don’t understand. Monitoring and management elements are not in place, which
generally means that IT is notified about problems when users call to complain. IT
organizations in this phase tend to lack the basic understanding of the systems they
manage, let alone the deep understanding necessary to do well with an APM solution. Due
to the break/fix approach to problems, the rare APM implementation here often goes
unused once implemented as no time exists to actually employ its capabilities.

31
Chapter 2

Awareness
Gartner defines Level 1, Awareness as “Realization that infrastructure and operations are
critical to the business; beginning to take actions (in people/organization, process and
technologies) to gain operational control and visibility.” While the Survival stage is typified
by simply making it through from day to day, many organizations eventually develop the
cultural enlightenment that “there must be a better way”. This awareness is manifested
through a realization that their IT infrastructure and its operations are a function of the
business. They may further realize that their organization will need to take action to
formalize processes, standardize on technologies, and control the people and culture of IT
if they wish to mature.

This phase, called the Awareness phase, can in many ways be considered a bridge between
the fully chaotic activities of the Survival phase and the beginnings of structure in the
Committed phase. Here, processes and technologies yet remain ad hoc, however the
culture surrounding IT operations recognizes and begins to embrace the need for better
ways of accomplishing their daily tasks.

Committed
Gartner defines Level 2, Committed as “Moving to a managed environment, for example,
for day-to-day IT support processes and improved success in project management to
become more customer-centric and increase customer satisfaction.” At some point, IT
organizations and the processes that bind them eventually begin to grow the very basics of
structure. In the Committed stage, organizations begin to actually implement tools for
assisting then with the management workload. Problem resolution in this phase is yet
accomplished through a break/fix mentality; however, the level of consideration for
environment-wide solutions begins to grow beyond zero.

In the Committed stage, simplistic problem management solutions such as work order
tracking systems may be incorporated. Yet in this stage, the specifics of their use are often
not enforced through an agreed-upon set of rules. Work order tracking systems here are
used for the individual technician workflow, not necessarily for the tracking of
configuration changes. In the Committed stage, monitoring may be implemented, but in this
stage, that monitoring is limited to the core availability of the device itself.

In this stage APM solutions will not necessarily drive a direct benefit to application
performance, as performance is not yet valued over simple availability. Environments here
are yet focused on managing the inflow of problems as they come in, and as such, don’t
have the time to actually focus on analytics and problem prevention. Smart organizations
can leverage the implementation of more basic APM integrations during this period as a
mechanism to quickly drive the organization to a higher level of maturity. Such an
implementation in this stage will require the corresponding process and workflow
necessary to turn APM data into useful product.

32
Chapter 2

Proactive
Gartner defines Level 3, Proactive as “Gaining efficiencies and service quality through
standardization, policy development, governance structures and implementation of
proactive, cross-departmental processes, such as change and release management.” Once
an IT organization’s culture makes the conscious decision to move away from firefighting
as a way of life, it can be considered on the path to the Proactive stage. It can be argued that
most IT organizations today exist somewhere between the Reactive and the Proactive
stages, with varying levels of process and workflow in place.

A major determinant between these two stages is related to the number of individuals who
have successfully removed themselves from the direct resolution of problems. The time for
these individuals is freed towards looking at rational, automated, and environment-wide
solutions for preventing problems before they impact the user population. Here, the proper
levels of monitoring are likely in place to validate more than simple up/down availability.
Usage trends are monitored and analyzed, with thresholds for alerts in place to notify
responsible individuals. Automation tools are additionally used to enable repeatable
actions to occur on systems when conditions occur. Automated remediation capabilities
may be introduced in this stage as well. Found also in this stage are mature processes for
problem management as well as asset, change, and configuration control.

For organizations here, the implementation of an APM solution can arguably have the
greatest benefit to their business. Once fully in this stage, the IT organization understands
the wholesale changeover from the “break/fix” to the “keep it running” mentality. Lacking
in this stage are the real linkages between individual devices and components of the
greater system. As such, a system-wide view of applications and business services is still
lacking in maturity. Implementing APM here can quickly move an organization to the next
level of maturity.

Service-Aligned
Gartner defines Level 4, Service-Aligned as “Managing IT like a business; customer-
focused; proven, competitive and trusted IT service provider.” A major determinant
between the Proactive stage and the Service-Aligned stage is related to the organization’s
primary focus. When an organization continues its focus on individual technologies as
opposed to how those technologies integrate into a deliverable whole, that organization
remains in the Proactive stage. There, they are proactively resolving problems, but they are
still focusing on the problems and problem prevention. When that organization leaps
towards managing the services they deliver to the business in whole, they have successfully
arrived in the Service-Aligned stage. In this phase, you’ll often see IT delivering their own
customized services to the business with unique names and focuses rather than merely
referring to product names they acquire from vendors.

33
Chapter 2

Getting to the Service-Aligned stage can be critically important to today’s businesses,


especially those who have a large stake in e-commerce operations. Service-oriented
thinking and the service-oriented approach to monitoring and management means that the
loss of individual elements has less of an impact. Redundancy and compensating
mechanisms are usually in place to ensure service reliability in the case of a single failure.
In this stage, IT also finally begins to understand not only their costs but also their
quantitative benefits back to the business. They can value their services appropriately and
back up those valuations with analytic data arriving from their own monitoring solutions.
Although Service Level Agreements (SLAs) are often seen in previous stages, it is only here
where their quantitative fulfillment can be truly recognized, often in real time.

In this stage, an APM solution—or one that functionally resembles it—is likely in place.
Solutions like APM are necessary in order to gain the situational awareness IT needs to
best manage its environment as an overarching system. At the same time, IT find itself
using that system with the goal of recurring improvement, looking for and resolving non-
optimized areas before users are impacted.

Business-Partnership
Gartner defines Level 5, Business Partnership as “Trusted partner to the business for
increasing the value and competitiveness of business processes, as well as the business as a
whole.” Once IT fully loses its identity as a separate function of business, they can be
considered a partner with that business as opposed to merely servicing its interests. In the
Business-Partnership stage, IT metrics are business metrics, as is the reverse. The role of IT
is as enabler for business processes, and as such, those processes are not considered
without IT as a primary stakeholder. IT is also a co-equal in business planning, as new
endeavors invariably include a technology component.

Most organizations never achieve the Business-Partnership stage of IT, as recognition is


required both from IT as well as the surrounding business for elevation to this stage to
occur. However, those organizations that make it to this stage find their solutions—in this
case, both APM and BSM—provide them with real-time validation of success or failure. In
the Business-Partnership stage, IT can be considered no longer just the “utility” but as the
business itself.

Why Is Maturity Important?


All things considered, an organization at higher levels of maturity will tend to have a
greater capacity for understanding and using its APM solution data. Thus, an APM solution
can be useful for both validating an organization’s existing maturity as well as assisting in
the rapid movement from one stage to the next.

For an example of this, take a look at Figure 2.7. There you’ll see an example visualization
from an APM solution. The information in the figure displays the expected response time
for a specific Web service call, broken down among the amount of time consumed by the
client, network, and server components of the request.

34
Chapter 2

Figure 2.7: An APM solution’s Response Time Predictor visualization.

Completing a request of this type will require an amount of time from each of these three
elements of the IT environment:

• A client will need to process the request internally, preparing it for transmission on
the network.
• The network will need to transfer the request from the client to the server, over any
number of hops, with each transfer and hop incurring an additional time cost.
• The server itself will need time to ingest, process, and prepare a response to the
request.
Intrinsic to this request are a number of variables that require an IT organization with a
high level of maturity if the information is to be of value. To gain the greatest amount of
value from this information, such an organization must have:

• A mature level of configuration control such that the exact configuration of the
environment is known.
• A mature level of process control such that the interfaces between client and server
components can be traced to known threads.
• A mature level of environment control such that additional environment behaviors,
such as network bandwidth and latency as well as network-consuming external
forces can be ruled out.

35
Chapter 2

APM Requires Maturity. APM Creates Maturity.


The chapters of this guide that follow will discuss how these visualizations can be used in
greater detail. For now, know that an APM solution provides a data-driven benefit to the
business in two ways. First, an APM solution provides the necessary level of monitoring to
enable IT to better facilitate the needs of the business. This reason is what this chapter is all
about. By implementing an APM solution, you very quickly gain the ability to drill deep into
the individual components of your business applications towards fixing problems or
finding areas of improvement.

Secondly, and arguably more importantly, smart organizations can leverage an APM
solution itself to rapidly develop process maturity in an otherwise immature organization. By
reorganizing your IT operations around a data-driven approach with comprehensive
monitoring integrations, you will find that you quickly begin making IT decisions based on
their impact to your business’ applications. You will better plan for augmentations based
on actual data rather than the contrived anticipation of need. You will better budget your
available resources based on actual responses you get out of your existing systems.

In the end, leveraging an APM solution for your business services and applications will
make you a better IT organization.

IT Changes with Each Stage


With IT’s movement from one stage to the next, the entire culture of the organization
changes as well. IT at higher levels of maturity has the capacity to accomplish bigger and
better projects. But IT at higher levels of maturity also thinks entirely differently about the
tasks that are required. Figure 2.6 does a good job of explaining how that thought process
evolves.

• The ways IT looks at itself. In earlier stages of maturity, IT sees itself as a fully-
segregated entity from the business. In many cases, IT can see itself as a different
business entirely! Individuals in IT find themselves concerned with the daily
processing of the servers and the network, to the exclusion of the data that passes
through those systems. As IT matures, the natural culture of IT is to begin thinking
of itself as a partner of the business, and ultimately as the business itself.
• The ways IT looks at data & applications. Data and applications in the immature
IT organization are its bread and butter. These are the elements that make up the
infrastructure, and are worked on as individual and atomic elements. IT in earlier
stages will find itself leveraging manual activities and shunning automation out of
distrust for how it interacts with system components. Applications in early-stage IT
are most often those can be purchased off the shelf, with customization often very
limited or non-existent. Later-stage IT organizations needn’t necessarily build their
own applications; however, they do see applications as solutions for solving
business processes as opposed to fitting the process around the available
application.

36
Chapter 2

• The ways IT looks at the business. Immature IT organizations are incapable of


understanding how their activities impact the business as a whole. Lacking a holistic
view of their systems, they focus on availability as their primary measure of success.
Yet business applications require more than a ping response for them to be truly
available to users. More mature IT organizations find themselves implementing
tools to measure the end user’s experience. When that level of experience is better
understood, IT gains a greater insight into how their operations impact business
operations.
• The tools IT uses. The tools of IT also get more mature as the culture grows in
maturity. IT organizations with low levels of maturity are hesitant to incorporate
holistic solutions often because they can’t see themselves actually using or getting
benefit from those solutions. As such, immature IT organizations lean on point
solutions as stopgap resolutions for their problems. The result is that collections of
tools are brought to bear while unified toolsets are ignored. Mature IT organizations
have a better capability to understand the operational expense of an expanding
toolset, while being more capable—both technically and culturally—of leveraging
the information gained from unified solutions.

IT’s Tools Grow More Predictive


As the maturity of IT’s tools grows, so does the predictive capacity of those tools. It was
discussed in Chapter 1 that solution platforms such as those that fulfill APM’s goals extend
their monitoring integrations throughout the technology infrastructure of a business.
Because APM’s reach is so far into each of a business application’s components, it grows
more capable than point solutions for finding the real root cause behind problems or
reductions in performance.

Consider again the situation in Figure 2.7. A complex performance issue in a business
application can occur across client, server, and network components in the environment.
The client can experience delay due to other processing or issues with the underlying client
infrastructure. The network can be oversubscribed with traffic, or client network
requirements can be greater than existing network components can handle. Servers can be
non-optimized in their processing or simply be overloaded.

Independent point solutions generally monitor only one of these three components at a
time, making the consolidation of data across separate systems with separate databases,
consoles, and formats extremely difficult. The resulting graphic in Figure 2.7 that breaks
down such an issue by its impact in each problem domain presents a way to quickly
identify the location of the problem.

Focusing further into that graphic presents the new picture that is Figure 2.8. This image
shows how such a graphic might be constructed through monitoring integrations installed
to network devices, servers, and even to the clients themselves. The result is a holistic
picture of transaction time itself, broken into its disparate elements. For more information,
drill-down capabilities intrinsic to the interface provide a way to discover more details
about each portion as necessary to resolve the situation.

37
Chapter 2

Visualizations

Monitoring Integrations

Client Bandwidth / Server


Processing Latency Processing

Clients Networks Servers

Figure 2.8: Transaction timing occurs across client, network, and server components.

IT/Business Alignment Benefits Everyone


When IT and the business are aligned in their goals and expectations, everyone benefits. IT
finds itself assisting the business in actually creating and maintaining business rather than
simply focusing on the health of systems. The business gains because a set of highly-skilled
technologists now participates equally in identifying business opportunities, optimizing
processes, and sharing in the goals of everyone else. An aligned IT organization thinks less
about which device to purchase or fix and more about how that device integrates with the
rest of the business infrastructure.

This alignment happens along a number of technology axes. Alignment enables IT to better
scope projects for greater success, defocusing on projects or technologies that enhance IT
but stand in the way of business workflow. This business impact to IT projects ensures that
those projects are visible to business leaders. Such visibility enables those leaders to be a
greater stakeholder in IT projects, further ensuring that their incorporation makes sense
for the future. Lastly, alignment provides a way to convert a reactive IT organization to a
proactive one.

38
Chapter 2

It has been said that 70% of the average IT budget is earmarked for existing projects
(Source: Budgeting for Information Technology,
http://www.501cio.com/articles/200709_ITBudgeting.html), leaving only 30% for new
projects on an annual basis. For the new projects, roughly 60% fail to meet their original
goals or schedule (Source: Two Reasons Why IT Projects Continue to Fail,
http://advice.cio.com/remi/two_reasons_why_it_projects_continue_to_fail). Primary
reasons for failure include cost overruns, missing schedule goals, and end solutions that are
“riddled with defects and don’t accomplish the business goals for which they were
designed.”

A major source of the problem occurs when IT isn’t capable of scoping projects in a way
that makes sense for the rest of the business. This scoping problem can relate to:

• Unnecessary technology expansions or implementations that don’t positively impact


business operations. In effect, adding technology for technology’s sake.
• Inserting technology that actually reduces the performance or efficiency of users.
• Incorrectly planning for anticipated user load, network impact, and impact on other
business or infrastructure services.
• Insufficient regression testing of upgrades against the system as a whole.
APM solutions provide a metrics-based approach for identifying the before- and end-state
of entire systems. Individual database transactions can be traced back to specific requests.
The performance of newly-inserted lines of code can be compared with those from
previous versions to ascertain their efficacy. Mainframe and server processing can be
measured over extended timeframes to validate improvements. The aggregation of
individual monitoring integrations provides a platform for validating the success of IT
projects and preventing their impact on others.

…But Isn’t APM Really About the Technology?


It is. This chapter has focused heavily on the maturity gains that can be achieved through
the implementation of an APM solution. Yet if you look through the various APM solutions
on the market today, you’ll quickly see a heavy technology focus. As future chapters will
discuss, APM’s technology focus is its bread and butter. You’ll find yourself using an APM
solution for determining transaction rates and isolating network performance issues,
among others.

39
Chapter 2

At the same time, you’ll find that APM’s expanded situational awareness enables the smart
IT organization to become more business-aligned. Nowhere is this more pronounced than
in businesses that rely heavily or exclusively on their technology infrastructures. E-
commerce businesses are particularly impacted. This is the case because in businesses like
e-commerce, the technology is the business. As such, having that enhanced vision into your
business’ technology underpinnings means knowing your customers—and your
storefront—that much better.

Chapter 3 will continue this introduction with a more technical look at underpinnings that
comprise APM’s monitoring integrations themselves. You’ll understand the history of
monitoring as well as how monitoring has evolved over time to become what it is today.
Chapter 3 will discuss how multiple levels and types of monitoring are necessary to gain
that holistic awareness you want out of an APM solution.

40
Chapter 3

Chapter 3: Understanding APM Monitoring


Its Memorial Day weekend, and the volleyball game is about to start. Coals in the barbeque
grills are coming up to temperature, and the weather is cooperating to make it a perfect late-
spring day. Prepping for the BBQ, friends and family are milling around, swapping stories, and
relaxing in the sun.

Then it happens, always just as the rest of life is smooth-sailing and work is the last thing on
the mind: BEEP, BEEP.

“Uh-oh, there goes your vacation-ending device,” says a friend to John Brown, IT manager for
TicketsRus.com. John looks down to read the text now displayed on his pager, “I can see it in
your face. Something’s down, you probably don’t know what it is, and the only way to figure it
out is to set down that cold one and march right into work.”

John shakes his head at the pager, “This thing is killing me. Since we installed the new
monitoring system, I swear I’m getting pages like this every couple of days. Half the time it’s
nothing. The other half the time it’s something completely different than what shows up on
this stupid thing. You know, monitoring is great, but this kind of monitoring is taking my life
away from me.”

“You going in?”

“Yep. Got to. If this is correct, the problem could be a big one, and you know what happens
when fans can’t get their tickets...”

John’s friend jokes, “We don’t want that to happen! I still remember that day when I and
everyone else couldn’t get tickets to the big game through your site. That problem was so bad,
it made the news!”

“Don’t remind me,” grumbles John, remembering that painful event in his past. A bug in the
code between the inventory and e-commerce subsystems for TicketsRus.com decided to rear
its ugly head just as tickets were released for the Finals. The bug, which for some reason only
caused problems at high loads and for certain types of events, had been introduced earlier in
the year with a software update. Because user loads had been light for the following months,
it took literally days to track down the error. TicketsRus.com, this team’s sole source for Finals
tickets, was criticized by its suppliers and even the press. It nearly lost a major source of
income out of the problem.

41
Chapter 3

To rectify the problem, John’s boss Dan mandated that a monitoring system be put into place.
Since then, John has come to regret his selection of monitoring system, an inexpensive but
limited solution that delivered alerts on server outages and not much else. The net result is
that John’s nights got a lot more sleepless and an e-commerce system which “felt” fine before
now alerted him and his teams on a near-constant basis.

John tells his friend as he heads for his car, “I’ve gotta’ run. Take care of the burgers for me.
Who knows when I’ll be back…”

---

John’s problem is not uncommon. There’s a fundamental problem intrinsic to most


traditional monitoring solutions. Namely, these types of solutions are almost completely
reactive in nature. Using a traditional network monitoring solution, the system alerts on
the problem only after the problem occurs. Such a system looks constantly at your network
to identify where a change has occurred that signals a problem—a device stops responding
to “ping” requests, a network connection slows down, a server’s processors become
overused by a particular process. When that change occurs, it sends an alert to
administrators, notifying them of the problem.

This information is excellent for knowing when something isn’t right with your IT
infrastructure. It gives you the information you need to know that a problem exists. But this
kind of information is solely limited to answering the question, “What happened?” Knowing
that a particular server, service, or device appears down is one thing. Understanding
exactly why it went down is quite another.

That’s not to say that reactive monitoring isn’t useful in an IT environment. In fact, nothing
could be further from the truth. Consider the story that started out this chapter: Without
some form of monitoring in place, John would never have known that something was amiss
in his data center. Prior to the Memorial Day incident, not having that monitoring in place
would have easily turned a small problem into a big one. A simple outage could have gone
unnoticed for minutes or hours while TicketsRus.com’s customers were unable to purchase
the products they needed at the time they needed it.

Traditional network monitoring is an excellent solution for organizations that operate in


the early phases of IT maturity. This technology gives them a basic understanding of the 1s
and 0s passing back and forth across their network and within its connected systems. Yet
traditional network monitoring can only go so far. As businesses and their IT organizations
mature in capabilities, their philosophy of service delivery must mature as well. Although
up/down monitoring might work well for a simple environment, its level of granular detail
is wholly inadequate to keep an online, 24 × 7, highly-available storefront open for
customers.

42
Chapter 3

In effect, added success begets added due diligence. Like in the case of the Finals mentioned
earlier, as your suppliers and customers rely on you for greater things, they expect greater
things as well. Application Performance Management (APM) and its advanced monitoring
integrations enable you to provide that greater level of service.

The Evolution of Systems Monitoring


The previous chapter of this book spent a lot of time discussing the concepts of IT
organizational maturity. Although that conversation has little to do with monitoring
integrations and their technological bits and bytes, it serves to illuminate how IT
organizations themselves must grow as the systems they manage grow in complexity. As an
example, a Chaotic or Reactive IT organization will simply not be successful when tasked to
manage a highly-critical, customer-focused application. The processes, the mindset, and the
technology simply aren’t in place to ensure good things happen.

To that end, IT has seen a similar evolution in the approaches used for monitoring its
infrastructure. IT’s early efforts towards understanding its systems’ “under the covers”
behaviors have evolved in many ways similar to Gartner’s depiction of organizational
maturity. Early attempts were exceptionally coarse in the data they provided, with each
new approach involving richer integrations at deeper levels within the system.

IT organizations that manage complex and customer-facing systems are under a greater
level of due diligence than those who manage a simple infrastructure. As such, the tools
used to watch those systems must also have a higher level of due diligence. As monitoring
technologies have evolved over time, new approaches have been developed that extend the
reach of monitoring, enhance data resolution, and enable rich visualizations to assist
administrative and troubleshooting teams. This chapter discusses how this evolution has
occurred and where monitoring is today. As you’ll find, APM aggregates the lessons learned
from each previous generation to create a unified system that leverages every approach
simultaneously.

Early Network Management


In the beginning, there were only a few computers. Those computers each accomplished all
the tasks necessary for its stated mission. Then came the “network,” which brought about
an explosion of interconnections among computers. These computers worked together and
communicated with each other to distribute their processing load. The network brought
about a systemic shift in mindset from centralized to distributed processing. It also
dramatically increased the number of moving parts required for an application or service
to function. With data processing now occurring across more than one piece of hardware,
the health of each individual component directly impacts the success of the entire system.

43
Chapter 3

Simple Availability with ICMP


The earliest network management solutions were singularly focused on the availability of
computer systems and the network that connected them. As Figure 3.1 shows, the earliest
attempts at measuring the availability of a component were coarsely focused on whether
that server responded to an ICMP “ping” packet.

“ping”
Are You Up?

Yep.
Application
Server

Network
Management
Server

Figure 3.1: Early network monitoring was singularly concerned with system and
device availability.

This solution works well for low-criticality environments because it is elegantly simple. If I
ping the server every 2 minutes, I’ll know that that server has gone down no greater than 2
minutes after the outage occurs. Implementing basic availability monitoring is a key step
for organizations that want to move from Gartner’s Survival phase (“little to no focus on IT
infrastructure and operations.) to its Committed phase (“moving to a managed
environment, for example, for day-to-day IT support processes and improved success in
project management to become more customer-centric and increase customer
satisfaction.”).

But basic availability metrics can only go so far. Servers that are experiencing a processor
spike condition may be wholly incapable of processing useful data. Responding to an ICMP
“ping” request is an extremely low-level interrupt that requires virtually zero processing
power. Thus, even an unhealthy server can usually successfully respond to a ping request.
As such, basic availability metrics generally cannot identify when a server is not down but
merely hung.

Richer Information with SNMP


Focused initially on networks, more information was later deemed necessary to identify an
environment’s health. Due to the complexities of network traffic management, networks
were one of the first parts of the IT environment to gain more granular information. One
early solution began with the development of the Simple Network Management Protocol
(SNMP) back in the late 1980s.

44
Chapter 3

Figure 3.2: With SNMP, any SNMP-aware information can be gathered through a
request/response interaction.

With SNMP, a framework was established for requesting and receiving detailed
information from networked devices. That framework operates on a request/response
basis, with a network manager requesting information from an onboard SNMP agent
through a network call (see Figure 3.2). The network manager identifies the category of
information requested by its Management Information Base (MIB) Object Identifier (OID).
This OID is a unique identifier for the specific piece of information being stored by the
client. For example, in Figure 3.2, the network management server is attempting to GET the
information located at OID cpsModuleModel.3562.3. Globally unique across all devices, the
contents of that OID can be anything:

• Network statistics
• Device configuration information
• Sensor information
• System or device performance metrics
As configured by an administrator, it is the job of the network manager to determine which
information is interesting and should be polled. That information is stored within the
network management system’s database for later review by an administrator. The network
management system is configured to alert administrators when information is received on
inappropriate behaviors. Similarly, clients can alert the network management system
unilaterally through an SNMP trap when special conditions occur that require more
immediate attention.

45
Chapter 3

Device Details with the Agent-Based Approach


SNMP was an excellent solution for determining information about devices all across a
network, and is still heavily used today. However, SNMP’s design wasn’t without its own
architectural limitations. Although SNMP remains in use today for the monitoring of
network hardware, that original network focus limited its acceptance in the realms of
servers, server OSs, and applications.

There are numerous reasons for SNMP’s limited scope and reach here. In order for an
SNMP-enabled network manager to gather information from any element on the network,
that element must have SNMP awareness. Thus, every device, operating system (OS),
application, and service must internally convert its own instrumentation data into the
format that SNMP understands. Also problematic is the poll-based nature of SNMP. SNMP is
configured to poll devices for their information on a regular basis, with the network
management server the source of those polls. This can create a bottleneck as the level of
monitored devices scales and limits the resolution of its data. Finally, until only recently,
SNMP lacked key security features.

To combat SNMP’s intrinsic limitations within non-network devices, OS and application


manufacturers designed agent-based solutions. These solutions gathered otherwise
unavailable availability and performance data from servers. In order to successfully access
this data on systems, these agent-based solutions required the installation of client
software. Here, the client’s job was to leverage its on-system privileges to gather necessary
data and eventually transfer it to a centralized monitoring solution (see Figure 3.3).

Figure 3.3: Agent-based solutions leverage on-board clients to gather and transfer
monitoring information to centralized monitoring servers.

46
Chapter 3

Agent-based solutions are predominantly device-specific, focusing on server and


application metrics for elements that are available on the system. Their inline installation
means they have a deep level of access to on-system event and performance metrics. Agent-
based solutions can provide details about the activities on a system as well as their
resource use.

As with SNMP solutions, agent-based solutions are in widespread use today. For servers,
services, and applications, these solutions enjoy benefits over and above SNMP-based
solutions because their information does not need translation into a format that is
understood by SNMP. For example, if an Oracle database decides to store performance
information one way, while a Siebel installation elects another method, both can be easily
encoded into the agent software. The agent can collect this information irrespective of
original vendor, source, or format, and translate it into a format that is useable by the
central monitoring solution.

Agent-based solutions also enable a much greater resolution of data, enabling monitoring
to scale with the needs of high-performance and high-criticality systems. The result is a
high degree of data resolution associated with metrics on the individual system itself.
Figure 3.4 shows an example of a graph that can be created out of such data, showing how
the health of the server and an installed database are graphed over time. As later chapters
will discuss, visualizations like this roll up low-level metrics to provide a high-level
understanding of a component’s health and service quality.

Figure 3.4: Multiple agent-gathered metrics can be consolidated to create an overall


picture of a component’s health.

47
Chapter 3

Situational Awareness with the Agentless Approach


Agent-based monitoring solutions filled a critical gap in early methods. They provided a
system-centric approach to gathering performance information. This system-centric
approach enabled administrators to be alerted when systems experienced problems other
than core availability, such as when performance behaviors degraded beyond acceptable
thresholds. The agent-based approach also provided a more system-friendly framework for
measuring performance across homegrown applications, with some solutions tying into
code frameworks for additional exposure.

However, the strengths of the agent-based approach also give way to its primary weakness.
A naturally system-centric solution, the agent-based approach only looks at information on
that system itself and from the perspective of the system. Relating instrumentation data
across multiple systems wasn’t natively possible with an agent-based approach unless that
relation was done at the central monitoring solution. This proved problematic. For many
environments, the limitations of the agent-based approach grew more obvious as the count
of hardware instances required to build an application or service grew in number. Most
specifically, using only an agent-based, server-centric approach, it was impossible to
visualize impacts coming from the rest of the environment.

The Impact of Externalities


Environment-based impacts aren’t a new problem. If you’ve ever attempted to use the free
broadband Internet at your local coffee shop on a busy day, you’re familiar with the impact
of external forces. Although your computer might be running perfectly, its performance on
the network is greatly impacted by the Internet surfing of everyone else on that network. In
this case, if your monitoring is limited to the agent on your computer, that agent has no
capability to understand the problem because its scope is limited to just your system.

Relating this situation to a business application scenario, let’s take another look at the
simplistic system discussed back in Chapter 1. In that system, shown again in Figure 3.5, a
number of elements integrate to provide a service for end users. However, in this second
scenario, a network-based tape backup device is also on the network. Due to a
misconfiguration by an administrator, a large backup has been initiated against the
mainframe during the workday.

48
Chapter 3

PROBLEM
Tape Drive Initiates
Over-the-Network
Backup
RESULT
Backup Traffic
Saturates Available
Storage
Bandwidth

User Firewall
Web Server Application Server Mainframe
Database Server

Figure 3.5: The actions of an unrelated device can cause performance problems on
the monitored system.

Large-scale tape backups are usually scheduled to occur during times of low processing
requirements. One reason for this is because the backup process can require an incredible
amount of network bandwidth if not properly tuned or segregated to alternative networks.
In this case, the entire customer-facing system is affected by a mistake made on a
completely unrelated device. Information gathered by agents on each of the devices cannot
easily show that a problem is occurring. Yet, the net result is a substantial reduction in
performance across the entire system.

It is for situations like this that an agentless approach is additionally necessary. Figure 3.6
shows an example report from an agentless monitoring solution. This report shows that the
primary consumer of network bandwidth is related to VPN traffic. The results from this
report can and should be cross-referenced with information from agent-based
visualizations to get a better situational awareness of network conditions.

49
Chapter 3

Figure 3.6: An agentless monitoring approach reports on aggregate traffic across the
environment.

Although situations like this tape backup mistake are unlikely to happen in a well-managed
production network, other environment behaviors can and do have impact on application
performance. Perhaps an e-commerce system experiences a flood of requests, limiting the
data it can successfully process in a particular period of time. Or, the overuse of a separate
and unrelated system on a shared network impacts the performance of a customer-facing
application. Only through the simultaneous use of both agent and agentless monitoring can
an IT organization get a complete understanding of its environment’s behaviors.

50
Chapter 3

That’s a Lot of Traffic!


As you can imagine, the level of traffic that such a system must monitor is
enormous. With dozens or hundreds of devices communicating at extremely
high speeds, the primary responsibility of an agentless solution is usually to
determine what traffic can safely be ignored.
One of the ways that network monitors like the ones explained in this section
limit that traffic is through the use of rules. These rules allow the device to
safely watch for traffic based on characteristics. An effective agentless
solution should include a large number of filters that can be used to create
such rules. Those filters can relate to but are not limited to:
 IP address (unicast or multicast), or IPX socket
 TCP/UDP characteristics
 MAC address
 Microsoft RPC/DCOM ID
 Novell Netware SAP message
 Database ID
 SNA Application ID
 SOAPAction string within an HTTP request
 Sun RPC Program Number
 URL path
 Identified protocol
As is obvious, more options available for filtering traffic will mean greater
performance in monitoring that traffic as it goes by. One primary step in
implementing this kind of monitoring will be the characterization and
isolation of the types of traffic you want to monitor.

Direct and Indirect Monitoring


The agentless approach leverages both direct and indirect network integrations in order to
see this data as it crosses the network. Direct network integrations gather their
information through the use of network probes. These probes are attached between
devices at various parts of the network. They passively watch traffic as it goes by, reporting
their findings back to the centralized network monitoring solution.

Indirect network integrations operate in a much different way. Rather than installing
devices directly on the network, physically bridging connections between devices, indirect
network integrations interface with specially-enabled network devices to directly gather
their statistics. Common protocols such as NetFlow, JFlow, SFlow, and ipFIX function across
different network components to gather flow-based network traffic information from
devices.

51
Chapter 3

Monitoring Storage
Server

User Firewall
Web Server Application Server Mainframe
Database Server

Figure 3.7: Network probes and on-device monitoring protocols such as NetFlow
enable agentless monitoring across the entire network.

A singular difference between the direct and indirect methods is whether an actual
device—the “probe” itself—must be physically installed between connections. Obviously,
the installation and maintenance associated with physical probe devices adds an
administrative burden to their use. However, probes can be installed virtually anywhere,
making them highly flexible in heterogeneous networks. This is in contrast to the easy
administration associated with indirect integrations. Devices must natively support such
integrations, so not all areas of the network may be accessible. Figure 3.7 shows an
example of both types.

Transaction-Based Monitoring
And yet even with these two types of monitoring integrations in place, mature
environments still found themselves lacking in the depth of visibility into applications.
Although agent-based monitoring provides information about individual systems and
agentless monitoring fills out the picture with network statistics, a much deeper level of
understanding is still necessary.

That “deeper level of understanding” arrives with a type of monitoring that digs past
aggregate network statistics to peer into the individual transactions themselves between
elements of a system. By looking at individual transactions that occur between system
elements, it is possible to look for areas where code performance or inter-server
communication is a fault.

52
Chapter 3

To give you an idea of how transactions work, think about the last time you clicked on a
link for an image in your favorite browser. Clicking that link directed your browser to
request the download of the image. In a few seconds or less, that image was later rendered
on your browser for you to view.

Figure 3.8: Multiple conversations (“transactions”) must occur for a single image to
be downloaded from server to browser.

But what goes on in the background when such a request is made? What kinds of
conversations are required between your local computer and the remote server for that
image to successfully make its way across the Internet to your laptop? In actuality, the
conversation between client and server can be amazingly complex. Figure 3.8 shows an
example of the communications that must occur for the image euromap.gif to be
successfully downloaded off the Internet. Requests, replies, and acknowledgements are all
required steps for what seems simple on the surface.

53
Chapter 3

Transaction Monitoring in Action


In Figure 3.8, you can easily see the layers of complexity involved with downloading just a
single image. Now expand that necessity across all the servers—database, business logic,
mainframe, and presentation—that make up the application or service you want to
monitor. A simple user request to add an item to their shopping cart can immediately
launch a chain of events across multiple servers in the environment:

• A Web server queries a business logic server to process the request.


• A business logic server queries a mainframe for product inventory and price.
• A mainframe verifies product inventory and provides a response to a business logic
server.
• A business logic server increments that user’s shopping cart token by one.
• A business logic server instructs a Web server to refresh the user’s profile with the
updated count.
• A Web server then refreshes the page, reporting the successful addition back to the
user.
Consolidating all those transactions into a visualization that makes sense for the
administrator is no easy task. The system must collect the right information; it must also
present that information in a way that is digestible for its user. Effective APM solutions
enable multiple mechanisms for visualizing the communication between components,
including Thread Analysis Views like what is shown in Figure 3.9 and Conversation Maps
similar to Figure 3.10.

Figure 3.9: A Thread Analysis View provides a look at transaction details by time.

54
Chapter 3

The information in a Thread Analysis View is gathered through the analysis of particular
types of traffic occurring between identified application components. In Figure 3.9, an
HTTP GET command is being analyzed along with backend SQL queries related to the
request. Identifying the source and destination of the transaction along with its payload
and description assists the troubleshooting administrator with deconstructing the
transaction into its disparate components. This process is similar in function to the analysis
of a network trace done with network management tools. However, unlike the network
trace’s focus on individual packets, here the analysis is elevated to the level of the
transaction.

Figure 3.10: A Conversation Map illuminates which components are talking with
whom.

Elevating the analysis even further creates a Conversation Map view. This high-level view
assists the administrator with a look at the components involved in a transaction as well as
characteristics about the communication. This information is useful for identifying which
participants might be the cause of a performance issue or other problem.

These graphics are obviously only a small portion of the visualizations that can be created
through the use of an effective APM solution. Chapter 7 will focus exclusively on
visualizations like these, while Chapter 8 will highlight a specific troubleshooting example
through the use of an extended example.

55
Chapter 3

Application Runtime Analysis


Yet another area of integration relates to the applications themselves. Most business
applications are comprised of numerous applications, code frameworks, and middleware
elements that work in concert to provide a service. These separate but integrated
components are necessary as each provides some function for the end solution.

Figure 3.11: Product-specific hooks provide deep insight into their behaviors.

Let’s re-imagine the simplistic environment once again, this time attaching some well-
known products to the otherwise generalized terms “Web Server,” “Application Server,”
and so on. Figure 3.11 shows this environment once again, showing a Cisco firewall, an
Apache Web server running custom Java code, an Oracle database, and Siebel middleware,
all connecting back to a zSeries mainframe. Although the actual product names themselves
are unimportant for this discussion, the fact that shrink-wrapped products are components
of this environment is.

APM’s transaction-level monitoring enables the capacity to peer into the individual
conversations that occur between servers and applications in your environment. Yet
today’s enterprise applications themselves also support pluggable mechanisms for
gathering instrumentation data directly from the application itself. This instrumentation
data can provide additional insight into the inner workings of the applications in your
infrastructure.

Consider the situation where a custom-built Web site is created atop an Apache Web
engine and built in part using the Java language. In this case, determining the inner
performance characteristics of the Web server and language might be best served by
querying directly to Apache’s and Java’s internal metrics frameworks. This internal
information can be merged with transaction statistics to gain an understanding of where
processing delay is occurring—at the client, on the server, or within the network.

56
Chapter 3

End User Experience


Concluding this discussion on APM’s monitoring integrations is a discussion on
monitoring’s “last mile.” The behaviors of your application that are seen at the client itself
are in many ways the most important facet of any APM solution. Think for a minute about
this chapter’s evolving conversation on monitoring. When looking at a large-scale business
application, it is, for example, possible to

• Measure the memory use on a server using on-board agents


• Measure the transaction rate between middleware systems and databases
leveraging agentless network monitoring
• Measure the processing rate within a language framework or application
However, by itself, none of this information directly gives you any information about what
the user experiences when they click the “Add to Cart” button on your Web site.

It can be argued that End User Experience (EUE) monitoring encompasses some of today’s
most advanced monitoring technologies. It is, after all, one of the most recent of the
available enterprise monitoring approaches. EUE monitoring provides its value by
measuring the performance of the application from the perspective of its ultimate
customer—the end user.

EUE functions in three very different but very important ways. The first of these is through
the introduction of client-based monitoring at the user’s location itself. This can occur
through an agent’s installation to an end user’s location or through the use of special
probes. By co-locating an agent with the end user, that agent is enabled to monitor the
known behaviors of your system. It can then look for situations in which end-state
performance back to the user has degraded past acceptable thresholds. By locating
monitoring agents at the client itself, the individual transactions associated with end user
behaviors can be mapped out and timed to validate that your end users are experiencing
the right level of service.

A second way to measure the end user performance of your applications is through the use
of automated “robots.” Also located in areas where end users make use of your application,
these robots run a set of predefined scripts against your application. These scripts leverage
synthetic and actual transactions that are very similar to the types of actions a typical user
would perform against the system. For example, if users click through a Web site, attempt
to add items to a shopping cart, and ultimately check out, these types of actions should be
simulated by the automation robot.

57
Chapter 3

Figure 3.12: EUE integrations can be co-located with the users themselves or run
through robots for consistent automation.

Since these actions and their scripts are well-defined, the timing associated with their
processing is also a known quantity. As the robot runs the same scripts over and over, it is
possible to quickly determine when service quality diminishes for the end user. EUE is a
powerful component of APM monitoring, one that is arguably its greatest value proposition
for your business applications.

The third way to measure EUE experience involves the use of physical probe devices that
reside in-line with network endpoints. These special hardware units are particularly useful
in situations where metrics cannot be gathered directly from other network components.
This may be due to ownership or other reasons. More on this third method as well as on
probes will be discussed in Chapter 4.

EUE on the Enterprise WAN


Although EUE is an obvious play for Internet-based applications, a similar set of benefits
can be found with enterprise applications that operate across a Wide-Area Network
(WAN). Similar in function but different in scope to the agentless approach are the
enterprise WAN monitoring functions that EUE provides. These functions are more
geographic in scope rather than operational; however, they are no less useful for
organizations that span multiple sites, countries, and continents.

Incorporating EUE’s into the enterprise WAN may require the installation of agents or
robots across many or all the individual sites within the WAN network. By installing these
components to multiple locations in the environment, the end users’ perspective can be
measured based on geographic location and any network behaviors that are experienced in
that locale.

58
Chapter 3

For example, consider the situation where a business application is homed in Denver,
Colorado but is used by employees through the United States, EMEA, and Australia. That
same application might function with no problems for the users in the United States. The
latency found in those connections might mean that even low-bandwidth connections
support the application’s use with no issue to the end user.

However, EMEA and Australia users might see a different situation entirely. Due to the
realities of physics, even the fastest connection between the United States and other
continents adds a specific quantity of network latency. If your business application is
latency intolerant but bandwidth insensitive, the users in those locations could be on the
fastest connection possible but still see a low-quality experience with your application.

In short, EUE’s ability to quantify the user’s experience is crucial to maintaining your
customer satisfaction.

APM = ∑ the History of Monitoring


This chapter began by explaining how a history of monitoring is necessary to truly
understand APM. This is the case because APM’s expansive monitoring capabilities
encompass the summation of that history. Businesses and their network environments
have evolved their network, system, and application monitoring over the years to include
integrations at virtually every layer of the application infrastructure. Monitoring server and
network metrics are augmented through transaction data. Transaction data is further
augmented by a focus on the user’s perspective. Centralizing that data into a unified
solution across every integration point brings power to what would otherwise be a
collection of point systems.

With this information associated with monitoring’s potential, the next topic really relates to
how it can be implemented into your existing network environment. Chapter 4 discusses
the processes and practices associated with integrating APM into your existing application
infrastructure. Tiering its monitoring integrations across users, applications, databases,
and mainframes requires multiple integrations. Chapter 4 will assist you with
understanding how to accomplish those tasks correctly.

59
Chapter 4

Chapter 4: Integrating APM into Your


Infrastructure
“You want me to enable this on all the equipment? Switches, routers, everything?”

“Yep. That’s the plan. Point everything back to 192.168.0.55. That’s the server we’ve set up to
collect all this information.”

The network engineer sits back in his chair, patently annoyed with the request, “You realize
how much effort this will involve, connecting to every network device all across the network to
do this one change. When do you need this done?”

“How about…now…?” responds John Brown, IT manager for TicketsRus.com. Now it’s John’s
turn to get annoyed at how this conversation is going. He’ll admit that this is a big request,
but that’s why he pays his network engineer a salary, to make these sorts of things happen.

John thinks a bit as he sits across the desk from his network engineer. He’s been reading more
of late about Gartner’s concept of IT maturity, most especially since that last incident on
Memorial Day. In learning more about the differences between a reactive IT culture and one
that proactively solves problems, he’s come to realize how survival-oriented his organization
really is. That Memorial Day incident should have never happened…

“So tell me again about this ‘magical’ monitoring solution you’re dreaming up, John. You
know we’re already collecting MRTG statistics off all the network equipment. You look at the
same graphs I do. We know when we’re seeing problems with bandwidth or when our ISP isn’t
meeting their agreements,” continues the network engineer, obviously irritated about this
intrusion into his team’s traditional boundaries, “I hate to be blunt, but why should my team
care about this new technology?”

John fires back, “Because it’s this ‘new technology’ that’s going to save this company. You too
got called in for the Memorial Day incident. You remember how long it took us to track down
the solution.”

“It was the developer’s fault!”

“True,” responds John, mentally noting this conversation for the engineer’s performance
appraisal coming up later this year, “but the extra 2 hours we spent sitting around the table
pointing fingers at each other didn’t get us up and running any faster. That’s what the APM
solution is for, finger-pointing prevention.”

60
Chapter 4

The engineer chuckles, “…and emergency Memorial Day quit-the-barbeque-and-drive-in


prevention too!”

That breaks the ice, giving them both a laugh out of their situation. From the outside looking
in, TicketsRus.com appears like a single entity to its customers, providing a unified storefront
for selling tickets. As a technology company, however, the reality of its internal struggles
between teams is a constant battle.

John leaves the engineer’s office and reflects a bit on the conversation as well as all the other
similar conversations he’s had with server administrators, developers, and even fellow
members of management. This APM installation is more than just a technology insertion. Just
getting this solution installed has been a lesson in professional growth as much as clicking
Next, Next, Finish. John realizes that the sheer process of fitting his “magical” monitoring
solution into TicketsRus.com’s culture is in and of itself a maturing activity.

He just can’t wait to see what the thing will look like as a finished product.

Implementing APM Isn’t Trivial, Nor Is Its Resulting Data


As you can see in this chapter’s story, integrating an APM solution into your environment is
no trivial task. Although best-in-class APM software comes equipped with predefined
templates and automated deployment mechanisms that ease its connection to IT
components, its widespread coverage means that the initial setup and configuration are
quite a bit more than any “Next, Next, Finish.”

That statement isn’t written to scare away any business from a potential APM installation.
Although a solution’s installation will require the development of a project plan and
coordination across multiple teams, the benefits gained are tremendous to assuring quality
services to customers. Any APM solution requires the involvement of each of IT’s
traditional silos. Each technology domain—networks, servers, applications, clients, and
mainframes—will have some involvement in the project. That involvement can span from
installing an APM’s agents to servers and clients to configuring SNMP and/or NetFlow
settings on network hardware to integrating APM monitoring into off-the-shelf or
homegrown applications.

In this chapter’s story, the network engineer refers to John’s APM solution as “magical,” yet
the reality of its resulting situational awareness couldn’t be any further the opposite.
Rather than a source of any subjective mysticism, an APM solution enables a level of
objective analysis heretofore unseen in traditional monitoring.

61
Chapter 4

Figure 4.1: APM’s integrations enables real-time and historical monitoring across a
range of IT components, aggregating their data into a single location for analysis.

The realities of that objective data are best exemplified through APM’s mechanisms to
chart and plot its data. Figure 4.1 shows a sample of the types of simultaneous reports that
are possible when each component of an application infrastructure is consolidated beneath
an APM platform. In Figure 4.1, a set of statistics for a monitored application is provided
across a range of elements. Take a look at the varied ways in which that application’s
behaviors can be charted over the same period of time. Measuring performance over the
time period from 10:00 AM to 7:00 PM, these charts enable the reconstruction of that
application’s behaviors across each of its points of monitoring.

Chapter 8 will analyze graphs like this in dramatically more detail, running through a full-
use case associated with a problem’s resolution. But, for now, let’s take a look at some of
the information that can be immediately gleaned through the types of visualizations seen in
Figure 4.1.

Finding Meaning in Charts and Graphs


Starting in the upper-right graph of Figure 4.1, at approximately 3:00 PM, the application
experienced a dramatic increase in its number of HTTP Errors over a baseline of zero. This
information in and of itself assists the troubleshooting administrator with recognizing that
a problem has occurred. However, all by itself, it doesn’t necessarily point towards that
problem’s solution.

62
Chapter 4

With this information in hand, it is possible to cross-reference the problem’s timeframe


with other metrics that were collected at the same time:

• During and immediately prior to that same period—identified by the red vertical
bar in each graphic—you can quickly determine that the problem was preceded by a
spike in Web server processor use. This spike is found from the data in the upper-left
graph.
• During and immediately after the problem, a spike in Web server response time was
also experienced. This data can be distilled from the information presented in the
top-middle graph.
• Perhaps as a result of the problem, or as one of its root causes, simultaneous drops
were experienced in network link utilization (third row, middle), percentage of slow
transactions (second row, left), and accounting transactions (fourth row, left).
With this information in hand, one can theorize that a natural correlation between these
events has occurred. Some situation on the Web server appeared to cause the short spike in
the HTTP error rate. That spike in error rate caused a simultaneous drop in processing, one
that may have been noticeable by the application’s users. The net result is a behavioral
change on the part of the distributed application.

At this point, you should be thinking, “This analysis comes to a pretty obvious conclusion
about the problem’s culprit.” Yet that’s exactly the point that’s important to recognize: Your
initial assumption of a system problem is often not the problem itself but merely a symptom of
the problem.

This example started out by looking at a recognizable spike in the HTTP error rate. Yet
without a monitoring system in place to notify on this problem, your actual starting point
for a problem like this might instead be with the accounting group and a phone call to the
service desk. Perhaps that group noticed a short slowdown in their data processing.
Perhaps the network’s MRTG statistics showed a slight and unexplained dip in link
utilization. Maybe a user called in with a concern that the application “was working slow
today.”

In all these cases, quickly identifying the correlation between the root cause of a problem
and the down-level fallout from that problem is only possible using statistics across the
gamut of that application’s elements. To access this data, however, requires the
involvement of individuals across the IT organization. Implementing APM’s integrations
into infrastructure components, network devices, and applications requires concerted
effort. This chapter will discuss some ways in which those integrations are commonly
inserted into your existing technology infrastructure.

63
Chapter 4

The Tiering of Business Applications


Chapter 3 discussed how today’s applications are much larger in scope and dramatically
more complex than those of yesteryear. Today’s business environment requires processing
that is at the same time extremely data-driven, highly-available, and interconnected with
multiple systems both in and out of control of that application’s administrators. This
complexity is necessary due to the distributed processing requirements of many customer-
facing businesses today.

Consider for a moment the architectures that are required to build such systems today.
Massive levels of redundancy are required, which at the same time bring higher levels of
availability while adding higher levels of complexity. Tracing down a system problem
grows significantly more difficult when that problem could have occurred on any one of
many servers in a cluster.

The data needs of applications are similarly more challenging. Customer-facing


applications require massive amounts of data along with corresponding levels of analysis
during each click of the mouse. Today’s applications may require the analysis of user usage
patterns in real time, enabling the dynamic presentation of data based on the expected
needs of users. This added processing can slow the user’s experience if not properly
architected.

Customer-facing systems also require real-time updates while leveraging third-party


“cloud-based” solutions such as credit card processing. These external connections require
additional care to protect the environment while ensuring the right levels of service from
external service providers.

To give you an example of how a complex system like this might look, see Figure 4.2. This
example represents a large-scale system not unlike what you might expect out of an e-
commerce company like TicketsRus.com. This system contains multiple services and
interconnections, including some that are out of the direct control of local administrators.

64
Chapter 4

Figure 4.2: An example customer-facing e-commerce system with linkages to


inventory mainframes, external credit card processing, and internal state/basket
management.

65
Chapter 4

In this system are a number of elements, each with a specific role to fill:

• In the first tier sits an externally-facing Web cluster that provides the front-end
servicing of business clients. That cluster handles the presentation load from
incoming clients, resting atop a second-tier Kerberos-based authentication system
for the processing of user logins and passwords. This cluster is the primary point of
entry for users to connect into the environment.
• Servicing the cluster is a set of second-tier systems. User data such as state and
shopping basket information is stored within a separate second-tier ERP system.
Inventory-processing functions have been offloaded onto a set of Java-based
application servers.
• Mainframe and order management systems are in the third-tier. Here, product
inventory information is stored as well as the processing of business logic
associated with orders. This includes functions such as inventory, management of
user shopping baskets, suggestions for alternate and/or complimentary products,
and so on.
• A final system in this third tier, the 3rd Party Credit Card Proxy, manages the local
processing for credit card orders. The job of this proxy is to work with the external
credit card processing service, forwarding credit card information and receiving
approvals.
• In the fourth tier is the routing equipment necessary for external connections to
suppliers—used for real-time inventory updates and other supplier
communication—as well as the third-party credit card processing facility.
This system is obviously highly comprehensive in the services it provides for its users and
its business. It can display a list of inventory on a Web page. It can process user mouse
clicks and present alternate and complimentary products to users based on their click
habits. It can aggregate user-desired products into shopping baskets, and process their
credit cards for the completion of purchases. It can even tie into supplier extranets for the
automated updating of product information in real time. Whether for a company like
TicketsRus.com or any company requiring a customer-facing e-commerce system, this
example architecture is designed to provide the necessary kinds of functionality.

This tiering of clients to Web servers, Web servers to application servers, and application
servers to databases is a common architecture in many of today’s complex systems.
Interconnecting these systems are networks, firewalls, and security devices that ensure the
secure connectivity of data. That data is stored onto centralized storage in multiple places.

It is these types of systems that make excellent starting points for an APM solution. As a
revenue driver, such solutions must remain highly available while at the same time
providing their customers with an acceptable level of performance during their interaction.
This is critically important in the case of customer-facing solutions, because when your
system cannot perform to your customers’ demands, you may find yourself losing business.

66
Chapter 4

Business Applications and Monitoring Integrations


With the structure of Figure 4.2’s business application in mind, consider the points of
integration where you might want monitors set into place. You will definitely want to
watch for server processing. You’ll need to record your network bandwidth utilization and
throughput. You need to know transaction rates between mainframes and inventory
processing.

All these monitors illuminate different behaviors associated with the greater system at
large, and all provide another set of data that fills out the picture you first saw in Figure
4.1’s charts and graphs. Now take a look at Figure 4.3, where some of these monitoring
integrations have been laid into place.

67
Chapter 4

Figure 4.3: Overlaying potential monitoring integrations onto a complex system


shows the multiple areas where measurement is necessary.

68
Chapter 4

Installing System Agents


But what exactly makes up that picture of a system’s health? How does the system
integrate all this data to create a much larger picture of the application’s quality of service?
That information is first gathered by the individual monitoring integration. Consider first
the data gathered through the agent-based approach.

Using agents that are installed directly onto individual servers, it is possible to gather
metrics directly off those servers. Each server OS has its own mechanism for gathering and
reporting on server-specific performance characteristics: The Microsoft Windows OS uses
the Windows Management Instrumentation (WMI) service for gathering such information,
storing it in a special area of the Windows registry, and presenting it to external servers
through either external WMI queries or its WS-Management Web service. Event log data is
stored in proprietary logs and presented through similar interfaces. Linux and UNIX
servers leverage a combination of tools—vmstat, iostat, netstat, nfsstst, others—for the
gathering and dissemination of data. Event log data can be gathered and distributed
through the Syslog daemon.

Figure 4.4: % Processor Time for a Web server, as gathered by an agent on the Web
server itself.

In all these cases, installed agents on each server are empowered with gathering the right
kinds of data and reporting this data back to central APM servers. The net result is the
creation of graphs similar to Figure 4.4, where % Processor Time for the Web server is
shown over a 24-hour period of time. In that graph, it should be apparent that the Web
server’s utilization grew dramatically beginning at 10:00 AM during the day of
measurement.

69
Chapter 4

The process of actually installing agents to the correct servers in your environment is one
important facet of your APM installation and should be included as an action in your
project plans. Depending on the type of APM solution selected, that agent may require a
manual installation or may be automated through its console. Obviously, each added agent
to a system involves slightly more of its resources that are consumed for management
functions. To reduce the number of agents on a system, some APM solutions enable the
ability to leverage data from other monitoring solutions. Using this “monitor-of-monitors”
approach, the APM platform gathers its data instead from another in-place monitoring
solution. In the end, your selection of APM solution should consider its abilities to integrate
with the existing management platforms that may already be present in your environment.

Beware Resource Use by Agents Themselves


Agents themselves should also be carefully screened to ensure that their
processing doesn’t negatively affect the overall processing on the server. The
actions completed by the agent in collecting performance and log data will
require a measure of processing power on the computers where they are
installed; however, the level of resource use by any APM agent should be
minimal. Effective APM agents should reside in the background and not have
a dramatic impact on server performance. Prior to deploying any APM
solution, test the functionality of its agents and verify that those agents
themselves will not cause a negative impact to your servers’ performance.

Augmenting Agents with Application Analytics


Agents on the server needn’t gather only system-focused performance metrics. Smart
agents are those that have been augmented with the capacity to gather performance and
other behavior characteristics from common middleware applications and databases as
well.

For example, an environment’s Order Management System might run atop an Oracle
database. It is entirely possible that server-centric statistics such as % Processor Time
cannot discretely capture the internal behaviors of the Oracle database. Perhaps that
database is processing a large number of “bad” records that impede its ability to correctly
work with the good ones. Maybe the application’s individual queries are not correctly
optimized. In both of these situations, it is feasible that an Oracle-specific behavior doesn’t
directly manifest into server-centric metrics. Necessary are deeper integrations, built-in to
the installed agent, which can query Oracle’s native performance statistics for additional
data.

70
Chapter 4

Figure 4.5: Agent integrations into the Oracle database itself display more specific
data about the database’s behaviors.

Figure 4.5 shows a graph of information gathered from just that Oracle database. Here, the
same time period is shown as in Figure 4.4 but the gathered information instead shows the
percentage of “slow” transactions as defined by the administrator and experienced within
the Oracle database. As with Figure 4.4, the area of concern has been highlighted between
the red bars immediately after 10:00 AM.

In this second graph, it should be obvious an increase in the rate of slow transactions is
highly correlated with Figure 4.4’s increase in processor utilization. The combination of
these two graphs illuminates a greater level of detail about how one application’s behavior
can have an impact on overall system performance. In this case, perhaps there is extra
processor use required to deal with slow transactions. Or, the individual transactions being
processed by the database are particularly complex or are not properly optimized for
performance.

When looking at an APM solution, look for one whose agents are enabled to support your
needed middleware applications and databases. By incorporating application- and
database-specific integrations directly into agents, it is possible to discover more detailed
information about the inner workings of these otherwise-opaque systems.

Although this capability is useful when the number of supporting applications and
databases is small, it grows substantially more useful as their count increases in the
environment. Figure 4.6 shows a rollup visualization that details the level of performance
across a set of applications.

71
Chapter 4

Figure 4.6: Viewing the performance across a set of applications in a single


visualization.

The graphic here details the behaviors across ten different applications that may be
involved in a particular infrastructure or business service. Specific to each application is
information about the number of users who may use that application, the total amount of
data used by the applications, its aggregate performance, as well as the number of users
that may be affected by that application should it experience a problem.

The percentages displayed in the column marked “Application Performance” relate to that
application’s instantaneous performance. To create these percentages, an application
administrator or predefined template must identify which performance metrics make
sense for that application’s measurement. Also needed are the threshold values for
identifying when an application is not performing to expected levels. Your APM solution
should provide the internal mechanisms for identifying these thresholds.

The net benefit of these calculations arrives during normal operations. When one or more
applications’ performance levels degrade past acceptable levels, administrators can use the
information in the “affected users” column to prioritize their resolution effort to those with
the highest impact on users.

Configuring Devices for Network Analytics


As you’ve already learned, servers and their applications are only one source of a system’s
overall behaviors. You simply can’t get a comprehensive view of the network without
additional integrations into the network itself. Those integrations enable a look at
conditions such as bandwidth utilization, latency, and the all-important Web site
performance statistics, among others.

The actual gathering of these types of network statistics can be enabled through a number
of solutions, many of which are likely already available on your network hardware today.
Virtually all of today’s business-class network hardware natively includes the support for
SNMP integration. With this information in hand, a centralized network monitor can pull
statistics directly from networking equipment through SNMP polls.

72
Chapter 4

Yet SNMP is only one solution. SNMP alone cannot provide the right kind of information
associated with network traffic “flows,” meaning the overarching conversations between
elements on the network. Flow data monitoring can be considered a superset of that seen
by measuring individual packets. It provides a view of network traffic at a higher level than
at the individual packet, yet not to the level of inter-server transactions.

To provide this level of data, additional protocols have been developed such, as Cisco’s
NetFlow, that report on high-level “flows” rather than packet-level inspection. Although
still not looking at this data from the level of the server-to-server transaction, flow
information gives the troubleshooting administrator a better sense of their network’s high-
level traffic patterns.

Figure 4.7: Another view of the system from the network’s perspective shows that
network performance is nominal during the period of measurement.

An example of this kind of data is displayed in Figure 4.7. Here is shown a view of the
impact of the network during a slightly offset period of time. In this graph, it is possible to
quickly see that the network experienced a slight dip in performance between 2:00 AM and
5:00 AM. That performance returned back to baseline by 10:00 AM, when the database
began experiencing problems. This information is critically useful for the troubleshooting
administrator, as it quickly shows that high-level network performance doesn’t appear to
have a direct impact on the system’s problem.

As with application monitoring, aggregate views of the network are also possible once
network integrations are laid into place. One such view is shown in Figure 4.8. Here,
aggregate network statistics across multiple sites are shown in a single view, with
associated trending arrows attached to each.

73
Chapter 4

Figure 4.8: Aggregate network statistics across multiple sites are shown, with
trending arrows showing prior behaviors.

Graphs such as this are only possible when network statistics from across the environment
are gathered into a single location. By integrating each individual network device’s
statistics into an APM framework, an administrator can quickly pinpoint network errors
and transfer rates, further isolating a problem to particular network segments.

Installing Probes
Chapter 3 introduced the concept of probes and the efficacy of their use in network
monitoring. Your decision to use network probes will be first dependant on the capabilities
of your networking equipment. Networking equipment that cannot support built-in
integrations (which are rare these days) or network security policies that prohibit the
passing of monitoring data are both situations where probes may be necessary.

Probes are by nature operationally expensive to use in a production environment due to


their in-line installation. A network probe by definition is a separate physical device that
watches the traffic that passes by a particular network segment. Figure 4.9 shows a
network diagram of how one can be installed between an internal LAN and an external
WAN point of demarcation. As such, their installation requires a manual change to the
network. Thus, limiting their use to situations where they are specifically required is
considered a best practice.

74
Chapter 4

ct ng
n
ne ori
io
on nit
o
M
C
Figure 4.9: A passive network probe is installed between an internal LAN and its
connection to an external WAN.

One situation where probes can provide special data not otherwise possible through in-
device integrations is in measuring traffic across network links that are not within local IT
control. For example, Figure 4.2 showed a connection between a third-party credit card
proxy and a router to the extranet that is shared with the credit card processing service.
Often, these kinds of connections are not within the direct control of the local IT
organization, which makes problematic the installation of on-device monitoring
integrations. In these cases, network probes can be physically installed between otherwise-
uncontrollable connections to monitor their traffic for inconsistencies.

Figure 4.3 showed a particularly useful example of such an installation. Here, a probe can
be installed with the intent of monitoring and alerting for Service Level Agreement (SLA)
breeches between the business and the external processing network. This adds a level of
due diligence during SLA breech negotiations as well as added protection against the
impact of external actions such as outages or performance losses.

Measuring Transactions
Yet another perspective on system behaviors occurs when the monitoring solution is
pointed towards the connection between the Web site and its Inventory Processing System.
The view enabled through this integration combines otherwise packet-based data to look
at the individual transactions between these two servers. Although the packet-based data is
useful for recognizing the overall behaviors of the network in relation to that individual
connection, transaction measurements provide more details about the specific
“conversations” between servers and services on two different hardware components.

75
Chapter 4

Take a look at Figure 4.10 for another chart that details this view of individual transaction
rates. Here, you can see that immediately prior to the time in question, a spike in
transactions began to occur between the two systems. Perhaps the timing of this
transaction spike gives some added details about the original processor utilization spike
from Figure 4.4. Perhaps an added spike in use by users was a cause of the problem rather
than an internal issue.

Transaction monitoring is critically necessary in distributed systems, most especially those


that use the Web for the display of data. Web-based data is highly transaction-oriented, so
great levels of detail can be gathered by watching those transactions as they go by on the
server. As you’ll discover later on, this aggregate transaction monitoring can be expanded
even further by looking at the individual conversations as they pass on the wire.

Figure 4.10: Elevating the view from individual network packets shows additional
information about the conversations between servers.

When thinking about this concept of transactions, it is important to consider the individual
conversations between the different elements on a system. Consider, for example, the types
of conversations that could potentially occur between a generic Web server and its
Inventory Processing System. Those conversations have both a source and destination IP
address, but they also operate over a known set of TCP or UDP ports. Within a single set of
ports, individual Web services at both ends can transfer multiple types of information, with
the ultimate end consumer of this information being different services on each server.

Your APM solution should include a set of network filters that works with both agent and
agentless monitoring to identify these conversations. Only by leveraging the right network
filters can a system gather meaning through the combination of individual packets of
information. Those filters must understand the protocols being used by both sides of the
communication as well as the well-formed data used in their conversations.

76
Chapter 4

Continuing the example from Figure 4.10, let’s assume an administrator wishes to see the
actual “words” in the transactional communication between the Web cluster and the
Inventory Processing System. To do so, they must drill down even further. By correlating
the network probe’s “external” view of a transaction’s performance with the “internal”
application analytics perspective from a server-resident agent, it is feasible that a view
similar to Figure 4.11 can be created.

Figure 4.11: Zooming in on a particular time slice of transactions provides detail


about the “words” in their communication.

Figure 4.11 shows a truncated view of the details of that conversation. Here, the discussion
between the two servers is drawn out in detail. The Web server attempts to contact the
Inventory Processing System, with the processing system eventually responding. Statistics
associated with delay timing is displayed in addition to the conversational details. The
result is a kind of time-oriented log of the conversation within and between the two
servers, showing exactly where areas of delay are found.

A result of this level of detail is that conversational elements and areas of unacceptable
time lag can be identified within the individual code elements of each server’s Web
services. For example, if a particular callback from one server to another shows a high rate
of delay, while the network itself can be eliminated as a source of lag, it grows very easy to
flag the situation to developers. Ultimately, the end goal is to quickly identify the problem
and come to a resolution that can be patched into the system.

77
Chapter 4

The actual implementation of transaction measurements happens in parallel with the


installation of agents and the incorporation of agentless monitoring across the network.
When network monitoring is configured to watch for specific conversations between
selected servers, this information can be captured in detail. At the same time, onboard
agents have the ability to monitor data as it passes in and out of the network interface
cards (NICs) on the servers themselves. In your APM solution, transaction-based
monitoring should be a function of both agent collection as well as administrator console
configuration. Another task in your implementation will be the identification of the types of
traffic to collect as well as the devices between which to collect it.

Overall Service Quality


One end goal of all this monitoring is the ability to create an overall sense of system
“health.” As should be obvious in this chapter, an APM solution has a far-reaching capability
to measure essentially every behavior in your environment. That’s a lot of data. A resulting
problem with this sheer mass of data is in ultimately finding meaning. Essentially, you can
gather lots of data, but it isn’t valuable if you don’t use it to improve the management of your
systems.

As a result, APM solutions include a number of mechanisms to roll up this massive quantity
of data into something that is useable by a human operator. This process for most APM
solutions is relatively automatic, yet requires definition by the IT organization who
manages it.

The concept of “service quality” is used to explain the overarching environment health. Its
concept is quite simple: Essentially, the “quality” of a service is a single metric—like a
stoplight—that tells you how well your system is performing. In effect, if you roll up every
system-centric counter, every application metric, every network behavior, and every
transaction characteristic into a single number, that number goes far in explaining the
highest-level quality of the service’s ability to meet the needs of its users.

78
Chapter 4

Figure 4.12: The quality of a set of services is displayed, showing a highest-level


approximation of their abilities to serve user needs.

This guide will talk about the concept of service quality in much greater detail over the next
chapters, but here it is important to recognize that implementing service quality metrics
also requires the involvement of the APM solution’s implementation team. Consider the
graphic shown in Figure 4.12. Here, a number of services in different locations are
displayed, all with a health of “Normal.” This single stoplight chart very quickly enables the
IT organization to understand when a service is working to demands and when it isn’t. The
graph also shows the duration the service has operated in the “normal” state, as well as a
monthly trend. This single view provides a heads-up display for administrators.

Yet actually getting to a graph like this requires each of the monitoring integrations
explained to this point in this chapter. The numerical analysis that goes into identifying a
service’s “quality” requires inputs from network monitors, on-board agents, transactions,
and essentially each of the monitoring types provided by APM.

Also required is the configuration of threshold values by administrators at each level of


monitoring. Figure 4.12’s view of system health comes through the aggregation of
individual down-level health monitors, each of which must be implemented and configured
by administrators. One implemented, administrators must determine your application’s
baseline performance values in comparison with those that they consider “acceptable.”

For example, if processor use on servers becomes unacceptable when it goes over 80% for
a 5-minute period of time, this behavior must be specifically ingested into the APM
platform. The same holds true for network behaviors: If network bandwidth utilization
over 85% is considered an unacceptable situation, it too must be configured into the
system. The aggregation of each of these thresholds ultimately combine to create the
service quality metric shown in Figure 4.12.

79
Chapter 4

Although this process can seem extremely time-intensive, effective APM solutions speed
the process by incorporating industry-standard templates for common thresholds. The
inclusion of these templates assists administrators with a starting point for later
customizing the unique characteristics of their environment. More on this concept of
service-centric monitoring will be covered in Chapter 6.

APM’s “Magic” Is in Its Metrics


This chapter started out with a conversation between TicketsRus.com’s IT manager John
Brown and his network engineer. In that conversation, the engineer referred to John’s APM
solution as some form of “magical” monitoring. Yet the engineer couldn’t be further from
the truth in his representation of APM’s efficacy. An APM solution is by nature extremely
objective. It enables the centralized gathering of vast quantities of data for numerical
analysis. When users call to complain that “the server is slow today,” an APM solution
enables IT teams to understand why, or in many places, already be working on the
problem.

You may have noticed that there is one major omission in this chapter’s discussion on
implementing APM. End-User Experience (EUE) monitoring is one topic not found in this
chapter. Due to its relatively new entrance into the market as well as its potential for wide-
sweeping changes in the way services are monitored, this technology gets an entire chapter
to its own. The next chapter will discuss EUE in detail, talking about its technology
underpinnings, where it fits into the environment, and the types of data that can be
gathered as a result of its implementation.

80
Chapter 5

Chapter 5: Understanding the End User’s


Perspective
TicketsRUs.com IT Director John Brown is feeling a bit…overwhelmed…with charts today.
Maybe it was the late night last night, or the early rise this morning. Or, maybe someone
secretly switched the black lid with the red one on the coffee pot again, which, he thinks to
himself, is not a very funny joke.

In any case, John finds himself staring blankly at a set of charts from his recently-implemented
APM solution, finding little in their meaning this morning. He looks through charts of
individual server performance. He flips through those relating to his networking behaviors
and finds nothing there of interest either. He even peeks into the transaction breakdowns
between his servers, ridiculously comprehensive in their level of captured data, but ultimately
too low-level for his management-oriented mind to comprehend. Those charts are for a
developer’s brain, not his.

Leaning forward in his chair, he fixates on one in particular:

In looking at this graphic, he finds that he cares little about what that chart actually
represents. “Some measurement of the ticketing system had this little bump during the
workday hours,” he thinks to himself, “that’s interesting, I guess.”

“These charts are giving us the information we want,” he thinks to himself, “They help my
admins find and fix problems. They help my network engineers track down bottlenecks. Heck,
they even found the piece of code that caused that big problem a few months ago. We’d have
never tracked that down without the transactional views. Yet something still nags me…

“…I want to know how my users are doing.”

81
Chapter 5

John’s problem today has nothing to do with lack of monitoring. With his new APM solution in
place, quite the opposite is true. Fully implemented, John’s APM solution now gathers metrics
from servers and their individual applications. The network itself is represented, both from
the perspective of individual servers as well as the WAN as a whole. Yet in all of these metrics
he’s gathering, he’s missing the one piece that ultimately represents success: His users’
experience.

Just then the phone rings. It’s Dan Bishop, the COO and John’s direct superior. Things are never
good when Dan calls, “I’m hearing scattered reports that our wait time on the system has
spiked to over 20 minutes per purchase. What’s up?”

“Nothing that I can see here, Dan,” John reports.

“Well, your numbers might not show it,” Dan continues, “but an old golfing buddy of mine just
called in and reported the same. Track it down and let me know. Oh, and put two tickets for
next week’s concert at the arena on hold for him, will you?”

Why the End User’s Perspective?


Chapter 3 of this guide walked you through the entire history of IT monitoring as we know
it. Starting with the basics of “ping” responses, through SNMP polls, agent and agentless
perspectives, and concluding with application analytics and transaction gathering, the
history of monitoring has evolved dramatically over time. With each evolution, the areas in
which monitoring integrates with your systems grow richer while their data grows more
useful to the business. As continued in Chapter 4, each successive approach adds yet
another layer to the overall view into a computing environment.

Yet Chapter 3 and 4’s discussion concluded at the very point where experience-based
monitoring actually starts to get interesting. With the development of End User Experience
Monitoring (EUE), automated solutions for watching your business systems get their first
looks into the actual behaviors experienced by an application’s users. Gathering metrics
from the perspective of the user themselves brings a level of objective analysis to what has
traditionally been a subjective problem. If you’ve ever dealt with the dreaded “the servers
are slow today” phone call, you understand this problem.

What Is Perspective?
This guide has used the term “perspective” over and over in relation to the types of data
that can be provided by a particular monitoring integration. But what really is perspective,
and what does it mean to the monitoring environment?

It is perhaps easiest to consider the idea of perspective as relating to the orientation of a


monitors view, which determines the kinds of data that it can see and report on. Although
the computing environment is the same no matter where a monitor is positioned, different
monitors in different positions will “see” different parts of the environment.

82
Chapter 5

Consider, for example, a set of fans watching a baseball game. If you and a friend are both
watching the game but sitting in different parts of the stadium, you’re sure to capture
different things in your view. Your friend who is sitting in the good seats down by the
batter is likely to pick up on more subtle non-verbal conversations between pitcher and
catcher. In contrast, your seats deep in the outfield are far more likely to see the big picture
of the game—the positioning of outfielders, the impact of wind speed on the ball, the
emotion and effects of the crowd on the players—than is possible through your friend’s
close-in view.

Relating this back to applications and performance, it is for this reason that multiple
perspectives are necessary. Their combination assists the business with truly
understanding application behaviors across the entire environment. An agent that is
installed to an individual server will report in great detail about that server’s gross
processing utilization. That same agent, however, is fully incapable of measuring the level
of communication between two completely separate servers elsewhere in the system.

Why the End User?


Thus far, this guide has discussed how the vast count of different monitors enables metrics
from a vast number of perspectives: Server-focused counters are gathered by agents,
network statistics are gathered through probes and device integrations such as Cisco
NetFlow, transactions and application-focused metrics are gathered through application
analytics; the list goes on. Yet, it should be obvious that this guide’s conversation on
monitoring remains incomplete without a look at what the end users see in their
interactions with the system.

This view is critically necessary because it is not possible—or, at the very least,
exceptionally difficult—to construct this experience using the data from other metrics.
Relating this back to the baseball example, no matter how much data you gather from your
seat in the outfield, it remains very unlikely that you’ll extrapolate from it what the pitcher
is likely to throw next.

For the needs of the business application, end user experience (EUE) enables
administrators, developers, and even management to understand how an application’s
users are faring. First and foremost, this data is critical for discovering how successful that
application is in servicing its customers. Applications whose users experience excessive
delays, drop off before accomplishing tasks, and don’t fulfill the use case model aren’t
meeting their users’ needs. And those that don’t meet user needs will ultimately result in a
failure to the business.

83
Chapter 5

The Use Cases for EUE


Though failure is a strong word, the reality is that EUE is important most especially for
those applications that service customers outside the business. As with the ongoing story of
TicketsRus.com, when an outward-facing application stops performing to expectations,
customers go elsewhere, with results that are often disastrous to the business.

This line of thinking introduces a number of potential use cases where EUE monitoring can
benefit an application’s quality of service. EUE monitoring works for valuating the
experience of the absolute end user as well as in other ways:

• Quantifying the performance characteristics of connected users as well as


differences in performance between users in different geographic locales
• Simulating user behaviors through the use of robots for the purpose of predicting
service quality degradations
• Identifying where internal users, as opposed to the absolute end user, are seeing a
loss of service
• Keeping external service providers honest through independent measurements of
their services
Figure 5.1 shows a reproduction of Chapter 4’s example e-commerce system. In this
version of the image, however, four areas where EUE monitoring can potentially be
integrated are highlighted. Those four areas correspond to the four bullets previously
mentioned, and are explained in greater detail in the next sections.

84
Chapter 5

Asia-Pac United States EMEA


User User User

Robot
External User
Web Cluster

Kerberos Auth. Java-based ERP System


System Inventory Processing
Internal System
Accounting
User

Inventory Order Management 3rd Party Credit


Mainframe System Proxy System

Extranet Credit Card


Router Extranet Router

Figure 5.1: Multiple use cases exist for targeting EUE monitoring.

Customer and Multi-Site Perspective


The first and most obvious target for EUE monitoring relates to the actual end users
themselves. As you can see in Figure 5.2, this integration occurs with users and across the
various geographic locations where users may exist. Here, the behaviors of users are
captured by a suite of special monitors that watch for and report on the behaviors of those
users.

85
Chapter 5

Figure 5.2: EUE can measure user behaviors across multiple connections in multiple
geographic locations.

How does this watching and reporting occur? In short, by creating a log of each user’s
activities. Consider for a moment how an Internet-facing application works. In the example,
the application’s user interface (UI) is Web-based, served through a front-end Web cluster.
For a user to work with that Web-based application, the cluster must generate and present
Web pages to the user. The user interacts with those Web pages by clicking in specified
locations, with each click resulting in some response returned back to the user.

A benefit of working with Web-based applications is that each click can be encapsulated
into its own transaction. When the user clicks on a Web page link, that click begins a long
chain of events. The Web server interacts with down-level services to gather necessary
data. Those down-level servers may then work with others even further down the
application’s stack. Eventually, through some combination of effort, the right data is
gathered. That data is then passed back to the front-end Web servers, which render new
content for the user.

By measuring which links the user clicks on, as well as the response time in receiving and
rendering resulting data back to the user, it is possible to identify the quantity of time
consumed by each step in the process. Later, this chapter will talk more about the spread of
time between the different system elements—client, network, and server—but for now,
recognize that EUE monitoring for end users works because the action of each user is
encapsulated into a Web transaction that can be measured.

86
Chapter 5

The Impacts of Geography


Although it is commonly accepted that the Internet is equally accessible by every
connection, the quality of each connection is in actuality quite different. For example, a user
in the United States may find that their experience with an Internet-facing application is
acceptable. This may occur due to the high quality of Internet connections in the US as well
as the geographic locality between user and application. When an application and user are
in a well-connected location of Internet service, such as the same country, their connection
tends to be of a higher quality.

Conversely, users that connect to a US-based application from the Asia-Pacific or EMEA
regions must route their communication through transcontinental connections and over a
much longer distance. The quality of those connections as well as their length of travel can
impact the overall experience of the user. By measuring an application’s performance from
a series of different geographical locations, it is possible to recognize when affecting
networking conditions exist.

As with the earlier example, because measurements here are made at the Web server, all
inbound connections can be measured against each other. Determining the time required
for a full user action to be completed illuminates much about the quality of the connection,
and thus the user’s experience.

Internal & External Robot Perspective


Yet even with this tracking at the Web server, not all users behave in the same way, and no
user behaves with complete predictability. A user’s interaction with an Internet-facing
application tends to change throughout their use: They walk away from their computers or
work on something else for a period of time. They take longer to read through one Web
page as opposed to another. They cease their interaction with the application altogether
without walking through a full use case.

With Internet-based applications, these non-standard behaviors are more the norm than
the exception. Users are used to the “always-on” nature of the Internet, electing to work
with its applications as if they too were always on—logging in, logging out, stepping away
mid-transaction, moving on to another task, and so on. Because of these erratic behaviors
another more predictable “end user” perspective is necessary. That perspective is provided
through internally- and externally-placed robots (see Figure 5.3).

Figure 5.3: Robot users can repeat the same action to derive a baseline of
performance and watch for deviations.

87
Chapter 5

The primary job of a robot is to simulate the behavior of a standard user. Such a robot is
programmed with a series of actions, the completion of which is accomplished in a known
period of time. Repeatedly running through that set of actions creates a baseline response
profile for the application. The actions are known and are run over and over, so
administrators can then be alerted when performance deviates from the baseline.

For example, if a robot is preconfigured to click through a series of pages on the External
Web Cluster with the goal of finding and eventually purchasing an item, it is possible to
determine the average period of time required to complete the actions. When that period of
time deviates from the baseline at some point later on, it can be assumed that an issue or
problem is occurring somewhere in the application.

Although robots alone are not likely to assist in locating the problem, they can operate as a
bellwether for downstream problems. Identifying a change in the overall performance back
to the user often means that a problem or other issue should be reviewed using other
monitoring metrics.

Internal User Perspective


The ultimate end user, however, isn’t the only group of individuals that interact with an
application. An entirely different set of users, usually internal to the business, have the job
of maintaining that application. These users tend not to be involved in the IT management
of the application. Instead, they are associated with managing the workflow related to the
products or services being sold by the business organization.

Using an example from TicketsRus.com, the primary mechanism for their ticket sales is
through their Internet-based application. Many individuals in the IT organization—
administrators, engineers, developers, and so on—are responsible for maintaining the
technology that powers that application. Yet a completely different force of individuals is
also necessary for ensuring that the right tickets are brokered for the right events and to
the right people. These accounting, sales, and management individuals require their own
interface into the application that has nothing to do with its ultimate rendering of Web
pages for customers.

Figure 5.4 shows how EUE can assist these individuals. Here, an Internal Accounting User
interacts with the Order Management System to ensure that the right tickets are always
available for purchase. The actor in this figure may be one person or an entire department
of individuals that are spread across a country or the globe. Targeting EUE monitoring in
this location gives the troubleshooting administrator another set of data to identify user-
visible behaviors on the system.

88
Chapter 5

Figure 5.4: EUE can be incorporated into other parts of the application to monitor the
behaviors of internal users as well as external.

EUE monitoring at the level of the Order Management System can identify when that down-
level system experiences a loss of performance. It is possible to compare the information
gathered at this level with other information at the External Web Cluster to isolate
potential problems. Such EUE monitoring for internal users can occur at any level in the
application’s stack where performance is a concern. Each addition of monitoring at
different user endpoints provides yet another set of performance measurements that are
useful in measuring overall quality of service.

Service Provider Perspective


Today’s business applications are rarely atomic in their architecture. One business’
application often needs to communicate with others for data, processing, or orchestration
of customer orders and requests. Further, all Internet-facing applications depend on their
Internet connection for service, and that service must come from somewhere. In all these
situations, an external party is responsible for providing a service for the application.

Whenever an outside party is contracted for those services, a binding agreement is usually
laid into place to define their quality. An Internet or Cloud Computing service provider
guarantees certain levels for bandwidth and latency as part of the price paid for their
services. A credit card processing facility guarantees a proscribed level of uptime. Suppliers
with direct application connections must meet minimum requirements.

Yet in many cases, the organization doing the monitoring of that service level is the
provider itself. For example, an Internet provider guarantees a particular service level but
does so based on the metrics that they themselves measure. You can imagine the conflict of
interest that occurs in this case when the business providing a service is also in the
business of measuring their success with that service.

89
Chapter 5

In this case, another form of EUE integration becomes useful. Targeting EUE here provides
a secondary set of measurements when it becomes necessary to independently assess
external-party service levels. Figure 5.5 shows another example of an external service that
could be monitored by such an EUE integration. There, the third-party Credit Card Proxy
along with its Extranet Router is contracted to handle payment card services for the e-
commerce system. This is a common service that is contracted out to external parties
because of the complexities of payment card handling.

Figure 5.5: Validating service provider connections is another effective use of EUE
monitoring.

In this situation, payments for any services or items on the e-commerce Web site route
through that external system. Yet this external system often lies outside the direct control
of the local IT organization. Since the business derives all of its income through this
interface, it is considered mission critical to the business. As such, it is a best practice to
implement independent monitoring to verify its service quality.

This kind of monitoring can and likely should occur in-line with any external system that
participates in the application. The resulting metrics can be used in independently
identifying any violations to Service Level Agreements (SLAs) as well as in negotiating
chargebacks when vendors don’t meet their agreed obligations.

The Use of Probes in Service Provider Monitoring


The use of service providers for portions of an application’s infrastructure is
commonplace in business today. Not common are direct monitoring
integrations into that service provider’s network or server equipment. If, for
example, you want to measure the bandwidth and latency across your
Internet connection, your ISP is not likely to give you their internal
passwords to gather data directly from their hardware.

90
Chapter 5

In cases like this, the use of in-line probes is useful in gathering the right
information. Probes were first introduced back in Chapter 4 as one
mechanism for integrating APM monitoring into otherwise-unavailable
infrastructure components. There, Figure 4.9 (provided below) showed an
example of how a probe can be installed between a supplier extranet and an
internal LAN to monitor traffic.

c t ng
n
ne ori
io
on nit
o
M
C

If you leverage external service providers for elements of your application’s


functionality and are unable to gather statistics directly from their
equipment, consider the use of probes with your APM solution to gather this
data.

The Role of Transactions in EUE


It should be obvious at this point that there are a number of areas where EUE provides
benefit to the business and its applications. Yet this chapter hasn’t yet discussed how EUE
goes about gathering its data. If end users are scattered around the region or the planet,
how can an EUE monitoring solution actually come to understand their behaviors? Simply
put, the metrics are right at the front door.

EUE Feeds BSM


A much larger conversation on the role of end-user performance in meeting
(or breeching) business goals will be discussed in Chapter 9. There, you will
find a discussion on how APM links to the ideals of Business Service
Management (BSM). Even more information about BSM’s focus on
applications can be found in the book The Definitive Guide to Business Service
Management, downloadable from http://www.realtimepublishers.com.

91
Chapter 5

Think for a moment about a typical Internet-based application such as the one being
discussed in this chapter. Multiple systems combine to enable the various functions of that
application. Yet there is one set of servers that interfaces directly with the users
themselves: the External Web Cluster. Every interaction between the end user and the
application must proxy in some way through that Web-based system. This centralization
means that every interaction with users can also be measured from that single location.

EUE leverages transaction monitoring between users and Web servers as a primary
mechanism for defining the users’ experience. Every time a user clicks on a Web page, the
time required to complete that transaction can be measured. The more clicks, the more
timing measurements. As users click through pages, an overall sense of that user’s
experience can be gathered by the system and compared with known baselines. These
timing measurements create a quantitative representation of the user’s overall experience
with the Web page, and can be used to validate the quality of service provided by the
application as a whole.

It is perhaps easiest to explain this through the use of an example. Consider the typical
series of steps that a user might undergo to browse an e-commerce Web site, identify an
item of interest, add that item to their basket, and then complete the transaction through a
check out and purchase. Each of these tasks can be quantified into a series of actions. Each
action starts with the Web server, but each action also requires the participation of other
services in the stack for its completion:

• Browse an e-commerce Web site. The External Web Cluster requests potential
items from the Java-based Inventory Processing System, which gathers those items
from the Inventory Mainframe. Resulting items are presented back to the External
Web Cluster, where they are rendered via a Web page or other interface.
• Identify an item of interest. This step requires the user to look through a series of
items, potentially clicking through them for more information. Here, the same
thread of communication between External Web Cluster, Inventory Processing
System, and Inventory Mainframe are leveraged during each click. Further
assistance from the ERP system can be used in identifying additional or alternative
items of interest to the user based on the user’s shopping habits.
• Add that item to the basket. Creating a basket often requires an active account by
the user, handled by the ERP system with its security handled by the Kerberos
Authentication System. The actual process of moving a desired item to a basket can
also require temporarily adjusting its status on the Inventory Mainframe to ensure
that item remains available for the user while the user continues shopping.
Information about the successful addition of the item must be rendered back to the
user by the External Web Cluster.
• Complete the transaction through a check out and purchase. This final phase
leverages each of the aforementioned systems but adds the support of the Credit
Card Proxy System and Order Management System.

92
Chapter 5

In all these conversations, the External Web Cluster remains the central locus for
transferring information back to the user. Every action is initiated through some click by
the user, and every transaction completes once the resulting information is rendered for
the user in the user’s browser. Thus, a monitor at the level of the External Web Cluster can
gather experiential data about user interactions as they occur. Further, as the monitor sits
in parallel with the user, any delay in receiving information from down-level systems is
recognized and logged.

A resulting visualization of this data might look similar to Figure 5.6. In this figure, a top-
level EUE monitor identifies the users who are currently connected into the system.
Information about the click patterns of each user is also represented at a high level by
showing the number of pages rendered, the number of slow pages, the time associated with
each page load, and the numbers of errors seen in producing those pages for the user.

Figure 5.6: User statistics help to identify when an entire application fails to meet
established thresholds for user performance.

Adding in a bit of preprogrammed threshold math into the equation, each user is then given
a metric associated with their overall application experience. In Figure 5.6, you can see how
some users are experiencing a yellow condition. This means that their effective
performance is below the threshold for quality service. Although this information exists at
a very high level, and as such doesn’t identify why performance is lower than expectations,
it does alert administrators that degraded service is being experienced by some users.

An effective APM solution should enable administrators to drill down through high level
information like what is seen in Figure 5.6 towards more detailed statistics. Those statistics
may illuminate more information about why certain users are experiencing delays while
others are not. Perhaps one server in a cluster of servers further down in the application’s
stack is experiencing a problem. Maybe the items being requested by some users are not
being located quickly enough by inventory systems. Troubleshooting administrators can
drill through EUE information to server and network statistics, network analytics, or even
individual transaction measurements to find the root cause of the problem.

93
Chapter 5

Browsers Aren’t the Only Client Application


This chapter has talked at length about how Internet-facing e-commerce
systems are a good target for EUE monitoring. These types of applications
tend to use a standard Internet browser as their client interface. However,
EUE needn’t necessarily be limited to browser-based client interactions. EUE
integrations can be incorporated into remote application infrastructures
such as Windows Terminal Services or Citrix XenApp, or other client-based
solutions in much the same way. Such solutions leverage different
mechanisms for gathering data, but the result is the same: The user’s
experience is measured and quantified.

The C-N-S Spread


Transactions provide the basis by which experience is measured in applications; however,
the process of analyzing the transactions themselves can be a challenging activity. The
prototypical systems administrator often doesn’t have the development background or the
experience with an application’s codebase to convert individual transaction information
into something that is actionable. Further, some problems don’t necessarily exist at the
transaction level. If a loss of application performance has more to do with a server failure,
analyzing individual transactions is too close a perspective to be useful.

It is for these reasons that an effective APM solution will create numerous visualizations
out of collections of transaction information. These visualizations roll up the
communication behaviors between user and server or server and server into an easy-to-
use graphical form. One particularly useful visualization that is commonly used in
troubleshooting is the C-N-S Spread (see Figure 5.7).

Figure 5.7: The C-N-S Spread.

94
Chapter 5

The C-N-S Spread measures the amount of time required to complete a transaction
between two elements. In generalities, it breaks down that quantity of time between the
amount consumed by their Client, Network, and Server components. You can see in Figure
5.7 that these three components are broken down even further to include the network
overhead components of Latency, Congestion, and the TCP Effect. This spread of
information illuminates a number of interesting behaviors associated with the
communication:

• Client. The quantity of time spent at the client. This can include the amount of time
required for the client to process and render incoming data for the user.
• Server. This relates to the amount of time a server is processing an inbound
request. This can include locating records in a database, processing the business
logic surrounding those records, or completing essentially any activity associated
with the request.
• Bandwidth time. This represents the Network link speed component of the Spread.
Here, the amount of time required to clock data onto the network is measured; the
faster the link speed, the faster the clocking rate.
• Latency. This represents the Network distance component of the Spread—the
amount of time required for requests and replies to traverse the network. The
greater the distance, the higher the latency.
• Congestion. Similar to Latency, congestion measures the delay associated with too
much data attempting to pass across the network. When congestion is high, data is
delayed or even discarded if the network is oversubscribed.
• TCP Effect. Any network communication also has a certain level of flow-control
overhead associated with reliably getting packets from one location to another. This
TCP Effect can be broken out separately as well to identify when TCP-based errors
or other issues are having an impact on communication.

The C-N-S Spread Illuminates Environment Behaviors


Graphs such as the C-N-S Spread bring high-level detail to what would otherwise be
individual packets of information crossing the network. They are particularly interesting
because the creation of such visualizations is usually not possible with traditional
monitoring solutions alone. At the same time, their creation enables a look at application
processing that is more holistic than with traditional monitoring point solutions.

95
Chapter 5

Figure 5.8: Graphs like the C-N-S Spread leverage integrations across the suite of an
application’s components.

Consider the areas in which monitoring must be in place to create such a visualization.
Clients must be monitored to understand their behaviors. The network must be monitored
to watch for transaction traversal. The processing of requests on servers themselves must
additionally be watched. Even more complex is the logic involved with tracking this
information across the various components of an environment, and ultimately converting it
to a useable form. Only a mature APM solution with its integrations across the suite of
application components has the reach necessary to create such a visualization.

A Use Case of the C-N-S Spread as Troubleshooting Tool


Once these components are in place, the resulting information provides a starting point for
tracking down problems with an application. For example, assume that a problem has
occurred in our chapter’s application. That problem lies deep within the application’s
processing, making it difficult to “see” with the naked eye or through any one component’s
monitoring integrations. Perhaps this problem has to do with a recently–updated piece of
code in the Inventory Processing System. This update changed a series of methods within
the system’s home-grown Java codebase.

Coded into one updated method was a change that removed optimizations in inventory
processing. Removing this optimization forced the server to slow its processing of
inventory requests. Such a problem can be commonplace, especially with home-grown
code, and can be quite difficult to track down once implemented.

96
Chapter 5

In this case, however, the application’s administrators were quickly able to determine that
a problem existed with the updated code. Looking at a visualization similar to Figure 5.6,
the application’s administrators immediately noticed that the metrics associated with user
experience were dropping from the green state to the yellow, and occasionally the red
state. This high-level monitor immediately indicated that users were “experiencing” the
problem. Although it is likely that no one had called in to complain—perhaps users
considered it a momentary hiccup rather than an endemic problem—administrators were
immediately aware that something was wrong.

With this information in hand, administrators could quickly pull up another visualization
similar to Figure 5.9 to trace the problem at a slightly lower-level perspective. There, they
identified that the time consumed by the Server component was far larger than its
established baseline.

Figure 5.9: Drilling into the server’s Java codebase enables developers to locate un-
optimized code.

Clicking further into the details, administrators began peering into the individual servers to
find areas of delay, eventually focusing their attention to the update on the Inventory
Processing System. Recognizing that the problem is likely related to the update,
administrators enlisted the support of the development team. That team dug deeper to find
the visualization shown in Figure 5.9. There, you can see how a particular Thread and Main
Class is highlighted along with its rate of delay. This timing information related to specific
threads—and ultimately the methods being processed by those threads—enabled the
development team to quickly find and fix the offending code.

97
Chapter 5

The rate at which a problem like this can be resolved is attributable directly to the depth of
monitoring provided by an APM solution. Without its “everything and everywhere”
approach to watching for environment behaviors, such a rapid resolution would not be
possible.

The Impact of Users Themselves


Yet another area where an application’s performance can be impacted relates to the sheer
number of users on the system. Even in the best of architectures, there occasionally comes
the time when a product announcement or a large-scale run on services turns customers to
your services all at once.

This can be a particular problem when services are available for a short period of time or
on a limited basis. The TicketsRus.com story provides another metaphor for this situation
in relation to its selling of concert and sporting event tickets. These types of items are
generally available starting at a particular date and time, with a finite number of tickets
available. While for many events this is not a problem as the level of ticket supply meets the
level of customer demand, there occasionally comes the time when demand far exceeds
supply: sporting event finals, major concerts, and so on.

The way in which these situations manifest into customer-facing systems is through a
widespread slowdown in application performance. Figure 5.10 shows an example set of
graphs that can explain such a situation. Here, an application’s Application Performance
Index (Apdex) is related to the rate of unique users attempting to use the application. You
can see here that the performance index falls dramatically with the inbound spike of users.

Figure 5.10: Relating an applications performance index to the number of unique


users.

98
Chapter 5

In the case of a massive user influx, metrics such as these help identify when performance
problems have little or nothing to do with the application’s architecture itself. Consider the
types and rate of alerts that could be triggered by such an inrush of users. Processor
utilization on servers goes above thresholds. Network bandwidth becomes saturated,
causing latency to spike dramatically. Application analytics on down-level services begin
notifying that they cannot keep up with the load. Web pages are unable to refresh, causing
errors at the client level.

Lacking a holistic perspective on the environment, such a situation could cause an alert
storm to the pagers of unsuspecting administrators. Administrators might find themselves
struggling to bring meaning to such a situation, tracking down symptoms of a much greater
problem.

Applications that have the right kinds of APM monitoring in place might be able to
encapsulate overall application performance into an index such as the Apdex metrics noted
in Figure 5.10. Relating this to the rapid rise in incoming users provides a focus for
understanding why the alert storm is occurring. It also provides ways to recognize where
system bottlenecks can be later eliminated to reduce the effect of massive popularity the
next time demand overwhelms supply.

Lastly, is the ability to notify the end users themselves when their activities cause a
reduction in overall performance. Although such performance problems aren’t often the
result of bad decisions made by the business, the business itself is usually the one that is
blamed. One very useful way to eliminate finger-pointing and remind users of the site’s
overwhelming short-term popularity is to notify the users of the problem. In this case,
users can be automatically alerted by the system that a high volume of requests is currently
being received and that their requests may take longer than expected. In the end, an
educated customer population is less likely to blame your business when the problem is
related to their use.

Leveraging EUE for Improved Application Quality


This chapter has attempted to show how an EUE monitoring solution adds great value to
the management of a large-scale application. EUE is, in fact, one of the pillars of an effective
APM platform. It integrates monitoring across numerous components to bring a
quantitative perspective to the otherwise-subjective impression of user behaviors. Because
user behaviors can be quantified into specific actions that require specific amounts of time
to complete, administrators can very discretely understand how successfully their
application meets user needs.

EUE further improves this recognition of success by enabling a greater vision into the
environment. That vision speeds root cause analysis, enabling visualizations that very
quickly drill down to problems. Because user behaviors are specifically tracked, even the
most challenging of code-oriented problems can be isolated for a quick fix.

99
Chapter 5

Finally, EUE improves a business’ capability to refine their application infrastructure over
time. Its data shows where bottlenecks require hardware expansion or software update
while providing a real-world justification for basing short-term and long-term planning
activities.

To this point, however, EUE is still but one piece of the larger model of service created by
an APM solution. That service model is the central whiteboard by which an application’s
components are laid upon and connected. Through the process of creating and refining an
application’s service model, the linkages between components become well-defined. This
creates a web of dependencies upon which alerting and status information can be based.
Chapter 6 discusses how this model is created, and how the concepts surrounding the
service-centric monitoring approach enable a complete representation of an application’s
entire set of resources.

100
Chapter 6

Chapter 6: APM’s Service-Centric


Monitoring Approach
Its 7:30am on a Monday morning, and TicketsRus.com IT Director John Brown has called a
staff meeting for his Help desk personnel. Operating as Help desk staff for internal operations
as well as Web support for end customers, this group embodies the company’s first line of
support for nearly every TicketsRus.com technical issue.

In fact, there aren’t many problems that don’t get initially triaged by John’s Help desk team.
Back in the old days, internal problems were often discovered when a user called it into this
phone line. It was this very Help desk who first found out about that major Web site problem
during last year’s Finals. Painful in the extreme, John was determined to never see a day like
that again.

“OK, everyone, settle down,” John asks of the room, “We’re about to get started.”

John called the meeting today to introduce the Help desk staff to his new monitoring system. A
project he’s been working on for months, John and his implementation team were finally
ready to bring it into full production. Part of that rollout involved creating dashboards that
were specifically designed for this team. These dynamic visualizations brought high-level
information about each subsystem’s status to his eyes in the Help desk.

Although he’d admit this group isn’t his most technically adept, they were highly capable of
handling phone calls and waiting for green lights on a dashboard to turn red. In fact, that
simple capacity was one of their greatest benefits to the organization. With part of their job
being triage of any technical problem, they could very quickly identify if a perceived problem
was indeed “perceived” or actually real. By inserting this 24x7x365 group into the problem
resolution process right at its start, John managed to keep his top-level engineers on call
without burning them out. In short, it was the job of this team to identify why a light went red,
and subsequently alert the right people when necessary.

In John’s meeting today, he was ready to unveil his greatly improved set of lights.

“Everyone. We’re here today to unveil a project that’s going to improve how you identify
problems,” John begins, “Our APM monitoring solution, which you see here, gives us a high-
level heads-up display about each of our production systems. You can see in this graphic that
each component is given a stoplight showing green or red. It’s ultimately your job to identify
when something goes red, track it down, and alert the right personnel if necessary.

101
Chapter 6

“Everything you see in this visualization is clickable. You’ll notice that if I click on any of these
links, I can drill down into even more detail about that item. If you want, you can even click
right down into the specific details about the problem.

“Now, I recognize that that level of detail is probably too much for most of us, but the idea
here is that everyone from Help desk to engineer to manager to code developer can access this
same set of visualizations. This means that you can very quickly and easily walk an engineer
or a developer through what you’ve learned once you engage them. Now, we’re all on the
same page, and we’re all looking at the same data!

“Any questions?” asks John of the crowd. One hand shoots up.

“Mr. Brown, this is great information,” offers John’s newest employee, one whom John feels has
real potential, “I understand that you’re collecting these metrics from essentially everywhere
these days. But how are you crunching this data into something that makes sense?”
Real potential, John thinks. He smiles as he looks back to the audience, “Ahhh, great question.
There’s the real magic with this system. You see, it’s all about the model…”

What Is the Service-Centric Monitoring Approach


This guide has spent a lot of time talking about monitoring and monitoring integrations. It
discussed the history of monitoring. It explained where and how monitoring can be
integrated into your existing environment. It outlined in great detail how end user
experience (EUE) monitoring layers over the top of traditional monitoring approaches. Yet
in all these discussions, there has been little talk so far about how that monitoring is
actually manifested into an APM solution’s end result.

The new employee in this chapter’s episode of our ongoing TicketsRUs.com story asks a
critical question about just those kinds of calculations. This individual recognizes that
widespread monitoring integrations enable an IT infrastructure to view behaviors all
across the environment. They understand that you have to plug into each component if
you’re to gather a holistic set of data. But actually fitting those pieces together is something
that remains cloudy.

It is this process that requires attention at this point in our discussion. In reading through
the first five chapters of this guide, you’ve made yourself aware of where monitoring fits
into your environment. The next step is in creating meaning out of its raw data. As John
mentioned earlier and as you’ll discover shortly, the real magic in an APM solution comes
through the creation and use of its Service Model.

102
Chapter 6

Figure 6.1: High-level monitoring of services and their Service Quality.

But first, think for a minute about the high-level stoplight charts that were unveiled to
TicketsRUs.com’s Help desk during that meeting. Perhaps one of those charts looked
similar to Figure 6.1. In this visualization, each individual system element is given three
fairly simple metrics:

• What is its current status? It is here where the top-level stoplight charts are
presented to the viewer. Systems that perform normally get a green light, while
those that are currently down see red. A yellow light can be shown for those
systems that aren’t technically down but are experiencing some failure or pre-
failure condition.
• Is it available? Understanding exactly which services are available for their
customers is a binary value. If a system is non-functional for its users, it is
unavailable. This second column adds data to the first by illuminating which
services are indeed not serving their customers.
• What is its medium-term trend of availability? Lastly, the third column expands
this visualization’s usefulness backwards in time. Here, the monthly trend of service
availability is provided with enough granularity that viewers can see if and where
problems occur over the medium term. Services that go down together may be
related. Those that go down more often may need extra support. The medium-term
trending of a service’s quality helps administrators identify where expansion or
augmentation may be necessary.
Ultimately, one of the primary goals of an APM implementation is to define in real time the
measurement of a service’s quality. Services that accomplish their mission with a high level
of quality are likely to have high levels of customer satisfaction, repeat customers, and net
profit for the business. Conversely, when services are of low quality, customers are likely to
go elsewhere with their business. If you’ve ever experienced a problem with a Web-based
service that simply didn’t function to your needs, you know how important this
measurement can be.

103
Chapter 6

But What Is Service “Quality?”


This concept of Service Quality has been touched upon a few times in this guide. For a
monitoring solution whose reason for being is predicated on quantitative information,
what is really meant by the term “Quality?” How does one apply an objective approach to
what at first blush appears to be a subjective concept?

Figure 6.2: A spectrum of service functionality.

The idea of Service Quality is not new nor does its outward subjectivity hold true when
incorporated into an APM solution. When you think about a particular service, what are the
possible states that that service can operate in? It can be functional or non-functional, but
its level of actual functionality to its users can also lie across a spectrum, as Figure 6.2
shows. Each of these conditions represents a potential state that that service can be
operating in:

• Fully functional, servicing the needs of its customers


• Functional, but not meeting the performance needs of its customers
• Functional, but with some non-functional actions; for example, its customer-facing
system could be working but its payment card systems are non-functional
• Functional, but with low or no assurance that actions are being accomplished per
the needs of its users; for example, users can be interacting with the system but not
seeing expected levels of assurance that their actions are being completed
• Functional in appearance, but fully non-functional in the services it provides; this
state is common with Web-based systems where initial screens appear functional
but no deep interaction is possible
• Fully non-functional, incapable of accomplishing its stated mission
This guide’s companion book, The Definitive Guide to Business Service Management, dives
into this explanation a little further:

A loss in a sub-system to a business service feeds into the total quality of that
service. A reduction in the performance of a system reduces its quality. And, most
importantly, an increase in response time for a customer-facing system reduces its
service quality.

104
Chapter 6

Thus, it can be argued that any reduction in a service’s capacity to accomplish its stated
mission represents a reduction in that service’s quality. This idea should make sense to the
casual observer; lower service quality means a reduced user experience. Yet, this
explanation still hasn’t answered the fundamental question of, “How does one apply an
objective approach to defining quality?” The short answer is, “The right model, and a whole
lot of math.”

Understanding the Service Model


To fully understand the quantitative approach to Service Quality, one must understand how
the different types of monitoring are aggregated into what is termed a Service Model. This
Service Model is the logical representation of the business service, and is the structure and
hierarchy into which each monitoring integration’s data resides. The Service Model is
functionally little more than “boxes on a whiteboard,” with each box representing a
component of the business service and each connection representing a dependency. It
resides within your APM solution, with the sum total of its elements and interconnections
representing the overall system that the solution is charged with monitoring.

But before actually delving into a conversation of the Service Model, it is important to first
understand its components. Think about all the elements that can make up a business
service. There are various networking elements. Numerous servers process data over that
network. Installed to each server may be one or more applications that house the service’s
business logic. All these reside atop name services, file services, directory services, and
other infrastructure elements that provide core necessities to bind each component.

Take the concepts that surround each of these and abstract them to create an element on
that proverbial whiteboard. This guide’s External Web Cluster becomes a box on a piece of
paper marked “External Web Cluster.” The same happens with the Inventory Processing
System and the Intranet Router, and eventually every other component.

By encapsulating the idea of each service component, it is now possible to connect those
boxes and design the logical structure of the system. This step is generally an after-the-
implementation step, with the implemented service’s architecture defining the model’s
structure and not necessarily the opposite. Figure 6.3 shows a simple example of how this
might occur. There, the External Web Cluster relies on the Inventory Processing System for
some portion of its total processing. Both the External Web Cluster and the Inventory
Processing System rely on the Intranet Router for their networking support. As such, their
boxes are connected to denote the dependency.

105
Chapter 6

Figure 6.3: Abstracting each individual component to create connected elements on a


whiteboard.

Note
Your APM solution will be equipped with a “designer” tool that enables this
creation and connection of components directly within its management
interface. You’ll also find that best-in-breed APM solutions arrive with
support for the automatic discovery and creation of Service Models. This
automation can dramatically speed the process of creating your initial
Service Model’s structure over manual efforts alone.

This abstraction and encapsulation of components can grow as complex or as simple as


your business service (and your level of monitoring granularity) requires. One simplistic
system might have only a few boxes that connect. An exceptionally-complex one that
services numerous external customers—such as the one used by TicketsRUs.com—might
require dozens or hundreds of individual elements. Each element relies on others and must
work together for the success of the overall system.

106
Chapter 6

Component and Service Health


This abstraction and connection of service components only creates the logical structure
for your overall business service. Internal to each individual component are metrics that
valuate the internal behaviors of that component. As you already saw back in Chapter 4’s
Figure 4.3, those metrics for a network device might be Link Utilization, Network Latency,
or Network Performance. An inventory processing database might have metrics such as
Database Performance or Database Transactions per Second. Each individual server might
have its own server-specific metrics, such as Processor Utilization, Memory Utilization, or
Disk I/O. Even the installed applications present their own metrics, illuminating the
behaviors occurring within the application.

With this in mind, let’s redraw Figure 6.3 and map a few of these potential points of
monitoring into the abstraction model. Figure 6.4 shows how some sample metrics can be
associated with the Inventory Processing System. Here, the Database Performance and
Transactions per Second statistics arrive from application analytics integrations plugged
directly into the installed database. Agent-based integrations are also used to gather whole
server metrics such as Memory Utilization and Processor Utilization.

External Web Cluster Configured Monitors

Component Health

Database Performance
Inventory Processing System
Transactions per Second
Memory Utilization
Processor Utilization

Intranet Router

Figure 6.4: Individual monitors for each element are mapped on top of each
abstraction.

You’ll also notice that the colors of each element are changed as well. At the moment Figure
6.4 is drawn, the Inventory Processing System’s box is colored red. This indicates that it is
experiencing a problem. Drilling down into that Inventory Processing System, one can
identify from its associated metrics that the server’s Processor Utilization has gone above
its acceptable level and has switched to red.

107
Chapter 6

Each of the metrics assigned to the Inventory Processing System’s box are themselves part
of a hierarchy. The four assigned metrics fall under a fifth that represents the overall
Component Health. This illustrates the concept of rolling up individual metrics to those that
represent larger and less granular areas of the system. It enables the failure of a down-level
metric to quickly rise to the level of the entire system.

Drilling down in this model highlights the individual failure that is currently impacting the
system, but that specific problem is only one piece of data found in this illustration. As you
drill upwards from the individual metrics and back to the model as a whole, you’ll notice
that the individual boxes associated with each component are also active participants in the
model. Because the overall Component Health monitor associated with the Inventory
Processing System has changed to red, so does the representation of the Inventory
Processing System itself.

Going a step further, this model flows up individual failures to the greater system through
its individual linkages between components that rely on each other. In this example, the
External Web Cluster relies on the failed Inventory Processing System. Therefore, when the
Inventory Processing System experiences a problem, it is also a problem for the External
Web Cluster. The model as a whole is impacted by the singular problem associated with
Processor Utilization in the Inventory Processing System.

The Service Model


Chapter 1 of this guide first introduced these linkages when it suggested how a Mainframe
problem could impact the entire system as a whole:

However, consider the situation in which the problem lies deeper within the
application itself. In this example, the problem is not the loss of an entire server or
device. Here, a much deeper problem exists. Rather than a simple server loss, the
response time between the application server and the mainframe instead slows
down. This occurs due to a problem within the mainframe. The decrease in
performance between these two components eventually grows poor enough that it
impacts the system’s ability to complete transactions with the mainframe. As a
result, the upstream reliant servers such as the application server, database server,
and Web server can no longer fulfill their missions.

During that explanation, it was suggested that a properly-implemented APM solution could
quickly identify the problem and trace troubleshooting administrators down to the correct
solution. That process is realized through the implementation of abstractions such as that
shown in Figure 6.4.

In the use case associated with Figure 6.4’s model, perhaps a Help desk employee notices
that the system has switched from a healthy to an unhealthy state. Perhaps the stoplight for
the system as a whole has changed from green to red or yellow. The Help desk employee
now knows that something is amiss within the system.

108
Chapter 6

With this information in hand, that Help desk employee can then drill down within the APM
solution’s interface to find the component that is experiencing the problem. Once the
employee finds the offending component, they can drill down even further to discover
which aspect of that problem component is the source of the problem. In this case, that
source relates to a processor overuse condition within the Inventory Processing System.

This information greatly improves the ability to initially triage system issues. With this
information in hand, for example, the triaging Help desk employee knows that the server
team is a likely candidate for resolving the problem. The problem probably doesn’t relate to
a network issue. Its initial troubleshooting is likely not within the realm of the application’s
development team. Determining the right resources to troubleshoot the problem reduces
the level of effort required to solve the problem while improving the speed to resolution for
the problem.

Obviously, Figure 6.4 is only one part of that overall Service Model. A completed Service
Model for our example system is shown in Figure 6.5. Here, the previous three components
are shown in relation to the rest of the model. Also present is the earlier physical structure
for comparison. Arrows are also drawn to illustrate where dependencies might lie between
each individual component.

Note
In a fully-realized Service Model implementation, an additional element titled
“The E-Commerce System” would be created at the top-most level and
connected with each individual component as a dependency. This connection
ensures that the loss of any component has an impact on the functionality
(and, thus, Service Quality) of the entire system. This element is not present
in Figure 6.5 only for readability reasons, but is an important part of a full
Service Model implementation

109
Chapter 6

External Web Cluster

User

Kerberos Auth. System

Inventory Processing System


External
Web Cluster

ERP System

Kerberos Auth. Java-based ERP System


System Inventory Processing
System
Inventory Mainframe

Order Management System Inventory


Mainframe
Order Management 3rd Party Credit
Proxy System
System

Credit Proxy System


Extranet Credit Card
Router Extranet Router

Extranet Router

Credit Card Extranet Router

Figure 6.5: The complete Service Model for our example system, along with the
physical structure for comparison.

It is critical to understand that the Service Model is a construct that lives entirely within the
APM solution. Your APM solution will provide a whiteboard-like designing utility that
enables the creation and connection of individual elements. It is within this design utility
where you will input the specific data and metrics that are tagged to each element. In effect,
this design activity is the next step after installing monitoring integrations into the various
aspects of your business service.

110
Chapter 6

Service Quality
With this, the entire conversation swings back to the initial question of “How does one
quantitatively represent Service “Quality?” At this point, the business service has been
deconstructed into its disparate components. Each of those components has been laid out
into a logical structure, with dependencies highlighted using connectors. Individual
monitoring integrations have also been assigned into each component as they make sense
(for example, network monitors into network components, server monitors into server
components, and so on).

The final step in this process is the piecewise labeling of each component and its
monitoring with the behaviors that are considered acceptable. This process effectively
creates the mathematical model that the APM engine uses to define a service’s quality, and
is another task that is commonly done within the APM solution’s designer utility.

Each behavior is commonly quantified through a numerical representation of threshold


values beyond which the service is no longer considered “healthy.” You can see in Figure
6.6 that as long as Processor Utilization remains below 90%, the system is considered
healthy. The same holds true as Memory Utilization remains below 85%, and the count of
Transactions per Second stays below 23,700.

Figure 6.6: Providing quantitative thresholds of acceptable behavior for each


configured metric.

111
Chapter 6

These metrics are part of a hierarchy, so there must be included a set of rollup values that
identify when a higher-level value crosses a threshold. In the case of Figure 6.6, the rollup
value for Component Health remains green until one of its dependent monitors crosses into
the Failed state. Although this is a simple example of such a service, it is easy to see how
added logic and more layers of hierarchy enable administrators to add substantial levels of
granularity into the system. For example, a rollup value can remain healthy as long as a
percentage of its dependent values remain healthy. Or, that rollup value can change its
state based on the moving average of individual dependent states. Enriching the data even
further, individual values can be given time limits, whereby a healthy state will not switch
to an unhealthy state unless the threshold behavior occurs over a predetermined period of
time.

It is the summation of all these individual threshold values that ultimately drives the
numerical determination of Service Quality. A business service operates with high quality
when its configured thresholds remain in the green. That same service operates with low
quality when certain values flip from green to red and is no longer available when other
critical values become unhealthy. The levels of functionality between these states, as
introduced in Figure 6.2’s spectrum, become mathematical products of each calculation.

In effect, one of APM’s greatest strengths is in its capacity to mathematically calculate the
functionality of your service. Taking this approach one step further, IT organizations can
add data to each element that describes the number of potential users of that component.
Combining this user impact data with the level of Service Quality enables the system to
report on which and how many users are impacted by any particular problem.

The Value in Templates


You can see that a huge amount of data must be plugged into a Service Model
before it becomes effective in valuating a service’s quality. Not only must the
model be constructed and its elements interconnected, but the correct
monitors must be linked into each element. Within each monitor, the right
threshold values must be configured. Even a single incorrectly-configured
threshold value can cause an inappropriate representation of the system’s
health.
Actually knowing whether your database server should operate at 90%
processor utilization versus 95%, or whether 24,000 transactions per second
is bad while 23,000 is good obviously involves a period of trial and error.
Your business service is quite different than anyone else’s, and the
components that comprise it are much different than that of any other
implementation. Thus, your thresholds are probably not the same as
someone else’s.
Yet it is handy to start out with a commonly-accepted set of values.

112
Chapter 6

It is for this reason that many APM solutions include a set of templates for
known server, network, and application types. If it is considered an industry
standard for database servers to maintain an acceptable processor utilization
of 90%, then having that as a starting value assists you with much more
quickly swinging your Service Model into operations.
You should also recognize that your Service Model is intended to be an
organic construct. Over time, your service will change and evolve, and with it
so will its Service Model. For example, you may find down the road that
processor utilization over 80% actually causes a down-level problem with
your system, or that another service component needs to be added into the
model to support a new business endeavor.
An effective APM solution’s Service Model designer will allow you to
reposition and reconfigure Service Model components and configurations at
any time to mirror your ever-changing infrastructure.

Creating Your Service Model and Implementing APM


Going from an empty whiteboard to a completed Service Model is going to take effort.
Isolating the right service components, identifying their expected behaviors, and getting
them into a format that your APM solution’s designer will accept are all steps in a project
that will involve no small amount of management. You’ll find as well that your first
attempts at creating that model will usually require a measure of adjustment as you learn
more about your services.

The process of actually creating that model and implementing your APM solution tends to
require formal project planning with the appropriate stakeholders if it is to be successful.
To assist, consider the following seven-step process as a common methodology in creating
a service model and ultimately implementing your APM infrastructure.

Step 1: Selection
The first step in any APM implementation project is involved with identifying its ultimate
outer boundary. Although an APM solution can monitor virtually everything in your IT
environment, it may not make sense to do so. Monitoring of environments that aren’t in
production can create alert storms as those environments go through development and
testing. Integrating APM into unrelated IT infrastructure elements can create islands of
monitoring that don’t mesh into your Service Model. And, you may find that some IT
components just don’t have a measurable impact on your business’ bottom line.

When thinking about where you should place your outer boundary of monitoring, consider
first those elements that you define as business services. A business service is one whose
operation can be quantified in terms of dollars and cents. By limiting your APM solution (at
least, in the beginning) to the revenue-impacting portions of your IT infrastructure, you
will have a much greater capability of getting your arms around its initial implementation.
Once in place, you may consider adding services as necessary.

113
Chapter 6

Note
It is crucial in the selection process to recognize that common infrastructure
services such as name services, directory services, and the like can also be
critical components of a business service. When a business service relies on
these common infrastructure services for their support, they automatically
become a part of your Service Model's hierarchy. Any outage of an
infrastructure service can lead to an outage of the business service, which
incurs a revenue impact.

Be aware also that the implementation of an APM solution is more than just a technology
insertion. APM’s quantitative data has a tendency to change entire business processes.
Thus, gathering the right stakeholders in your organization—those who can impact
business process maturity—will ensure that your business ultimately recognizes the total
benefit from an APM implementation.

Step 2: Definition
Once the proper boundary has been selected and agreed upon for the initial APM rollout,
the next step is involved with formally defining each of the components that make up that
service. This process can be quite exhausting, as it requires a precise decomposition of each
element that makes up the service. A number of data points must then be defined for each
component that has been isolated. Consider the following data points as the minimal set
that are critical to creating that component’s abstraction:

• Component Name and Description. For each isolated component, a unique name
and description of that component is necessary for later tracking.
• Users. Understanding the type of users as well as the expected count of users that
will make use of each isolated component is useful for identifying the level of
affected users during an outage.
• Outage Impact. Valuating the revenue impact associated with a loss of this service
helps derive data for executive dashboards.
• Baseline and Desired Behaviors. Technical metrics and their desired threshold
levels are necessary. These metrics will link back to observed behaviors through an
APM solution’s monitoring integrations. It is likely that the definition of these
behaviors will consume the largest part of your definition activity.
• Dependencies. Lastly, identifying the dependency chain among components helps
to later draw the interconnections between elements in the Service Model.
Remember that dependencies needn’t necessarily be limited to directly required
services. Dotted-line dependencies are also important towards capturing the impact
associated with any loss of service.

114
Chapter 6

Data gathered during this phase is most often captured into an external spreadsheet or
database. That external document can be used by the project team in the next phase as the
Service Model itself is built. It is worth mentioning here that this definition phase also has
the benefit of documenting a system in ways that may not have been previously
accomplished. Thus, your level of process maturity naturally increases as a function of
completing this documentation activity.

Step 3: Modeling
Once finished with the formal definition of each component, your next step will be to use
this data to construct the model itself. This process translates the externally-captured
information in your spreadsheet into a form that works within your APM solution’s design
tool.

The process of actually creating the model should be relatively trivial if the right level of
effort was placed into the previous step. With the right level of detail in your definition
spreadsheet, you should already know which elements should be entered as well as their
dependencies and initial metric thresholds. The net result of this activity will be a
completed Service Model that is ready to accept monitoring data once integrations are in
place.

Step 4: Measurement
The fourth task in this process begins with the installation of monitors into your
environment. This process ties the Service Model’s empty framework into the actual data
that it will use in generating its calculations. The actual installation of monitors can take an
extended period of time, as the sheer magnitude of change control required to implement
APM’s wide swath of monitors hooks into every part of your environment.

Properly managing this installation will require the support of change control as well as
configuration control stakeholders in your environment. It will also require the technical
support of each technology’s administrators, an activity that can involve personnel and
project management. Further, achieving the buy-in of each component’s administrative
stakeholder can require assistance from business management due to the politics of
systems administration.

Note
For this reason, be aware that this step can be disruptive to your business
service if not completed with care. The installation of agents, incorporation of
agentless monitoring, and initial rollout of EUE monitoring can impact the
normal production of the environment for an initial period until monitors are
correctly tuned.

Once implemented, your APM environment will need a period of steady state to allow those
incorporated monitors to gather their data and begin filling out the picture of service
behavior. This period can be relatively short or rather extended as you ensure your
monitors are gathering the right kinds and volumes of data.

115
Chapter 6

APM and Third-Party Monitoring Solutions


In implementing APM’s monitoring integrations, you do not need to duplicate
the effort of existing third-party monitoring solutions—nor is it necessary to
rip existing solutions from the environment to replace them with those
native to an APM solution. The types of data gathered through an APM
solution are often very similar to those collected by others.
Effective APM solutions will include the ability to tie into existing monitoring
infrastructures. As these infrastructures are already in place, this process can
significantly reduce the amount of time and effort necessary to complete step
four.
When implementing a monitor-of-monitors configuration, care must be
taken to tie in the third-party monitoring solution as a dependency for the
entire environment. Thus, ensuring the continued functionality of the
external monitoring platform is a critical success factor in maintaining the
functionality of your APM solution.

Step 5: Data Analysis


Environments that implement an APM solution must be cautious not to roll that solution
into production until a period of evaluation is completed. It is not uncommon for the initial
abstraction of the business service to include errors in its design or development. Further,
the actual monitor thresholds may not be properly tuned for production use. Rolling such a
system into production too early can result in alert storms, which tend to have a negative
effect on users’ opinions of the system.

During the data analysis phase, installed monitors should be evaluated for their
effectiveness in gathering the right data to fill out the Service Model. As with the previous
phase, this phase tends to require the involvement of each component’s administrative
stakeholders. Those stakeholders can assist in quantitatively identifying the behaviors that
should be captured as well as tuning the appropriate monitoring thresholds. Even with
APM solutions that leverage templates, those common settings must be correctly re-tuned
for each particular environment.

Step 6: Improvement
Through any data analysis phase, you will find errors in Service Model development or
inconsistencies in the data being gathered. The sixth step in this process needn’t
necessarily be an isolated step that occurs after step five. This improvement process of
retuning settings can occur in parallel with data analysis, iteratively identifying and
resolving areas of improvement in the overall implementation.

116
Chapter 6

Step 7: Reporting
Lastly is the reporting phase. This final step in an APM installation concerns itself with the
development of visualizations for the rendering of data. This process in and of itself can
require its own project plan, as multiple types and classes of visualizations are likely to be
desired by the different users of the system:

• Triage. Help desk individuals are often the first line of defense in identifying and
isolating problems within IT systems. For this reason, visualizations that enable this
team to recognize that a problem is occurring and provide initial triage support are
necessary. These visualizations tend to be of low data resolution, providing just
enough information to help teams identify which resources to engage for resolution.
• Administrative and Developer. Administrators require their own set of
visualizations for understanding the behaviors of their systems under management.
This set commonly deals with low-level behaviors that can be managed by
administrators as well as visualizations that assist with deep troubleshooting.
Developers need their own capabilities as well; however, these tend to lean towards
detailed information about code operations and performance. Developer
visualizations are commonly used in finding and resolving areas of non-
optimization in custom code.
• Management. Technically-minded individuals needn’t be the only consumers of an
APM solution’s data. Those in management as well can leverage an APM solution for
identifying long-term trends in service usage, revenue impacts from both outages
and successes, and high-level project status.
• End User. An APM solution’s visualizations can also provide benefit for a service’s
end users. End users appreciate when they’re given useful information about the
functionality (or non-functionality) of the systems they’re working with. Users who
are given actionable information about system status can make their own decisions
about working with the system or delaying attempts until full functionality is
restored.
Cross Reference
A much larger discussion on the creation and use of APM visualizations will
occur in Chapter 7.

Using Your Service Model


Although much of the deep discussion on actually using your APM solution is left for
Chapter 8, there are a few items associated with the Service Model itself that warrant
review once your Service Model is in place and ready for production. Actually making use
of your APM implementation is an activity that will be quite different based on the role of
the individual in the organization.

117
Chapter 6

Tuning Monitor States


During the initial production use of your APM solution, you are likely to find a large
number of false positives as the monitor tuning process is incrementally completed. This
process can consume a large amount of time, as the threshold values for individual
monitors are adjusted. During this process, you will likely find yourself watching the model
in real time to identify when certain behaviors occur within your system.

Keep in mind that an APM solution comes equipped with a set of tools that enable the rapid
troubleshooting of applications as problems occur. The Service Model is only one of those
tools. Leveraging specific tools for network management, server monitoring, and
transaction following will assist with identifying the behaviors that are potentially of
interest within the model. One example of the use of such tools relates to the server
performance information displayed in Figure 6.7.

Figure 6.7: Server discovery tools identify performance conditions in the existing
infrastructure.

The data gathered from a tool such as this one provides a glimpse into performance
conditions within the existing infrastructure. In this example, the processor utilization and
queue length are measured along with virtual memory usage and interwoven queue
percentage. Information gathered through current and historical analysis of components
helps to identify the actual thresholds being experienced on a particular piece of hardware.

You can see in Figure 6.7 that virtual memory tends to average around 15% utilization over
the measured period. This information becomes a starting point to develop an
understanding for the baseline behavior of a service component. By recognizing that this
server’s baseline utilization of virtual memory remains around 15%, it is possible to tune
its metric to a more precise level.

118
Chapter 6

This example shows how the tuning process requires effort to both tune as well as de-tune
configured counters. Finding the correct setting for a metric that is experiencing too many
false positives is relatively easy: Simply dial down the metric’s threshold value until it no
longer turns red during a period of acceptable performance.

However, this is only one half of the necessary tuning activity needed for counters. The
other half is involved with watching for false negatives. In this case, metrics are de-tuned to
a level that is too far, and as such aren’t providing value in alerting on system behaviors. In
these cases, it is often necessary to look back at current and historical performance to
identify when counters aren’t tuned enough.

Eliminating Rapid State Change


Monitors by nature always reside in one state, based on the metrics information being fed
to them. Thus, a monitor can either be alerting on an impacting condition or reporting an
“all clear.” Critical is recognizing not only that a configured monitor has changed state but
also the time period when that monitor has changed state.

This is particularly important for situations when a monitor may be rapidly shifting
between states due to a problem with a component or when its threshold is tuned to a level
between existing states on a system. Monitors in a rapid state change condition can be
exceptionally difficult to track when historical analysis capabilities are not present in the
APM solution. Your APM solution should include the ability to look into the short-term past
for identifying when a monitor’s state has changed. Situations where rapid state changes
are occurring with a particular monitor threshold should be resolved through retuning.

The Service Calendar


Lastly is the addition of Service Calendar data into the APM solution, for further refining
the level of impact associated with service quality reductions. Consider an organization
that has a US-based business service that is used by customers worldwide. It is commonly
expected that there will be periods of high usage along with other periods of low usage.
Perhaps this application is used by customers during the normal work day. In this case,
assuming that the normal workday occurs between the hours of 8:00am and 5:00pm, it can
be expected that a period of high usage will be between the hours of 5:00am and 8:00pm
for users in the continental United States. These extra hours take into account the four
different time zones that split the United States.

A business that expects use of its system by only users in the United States can assume that
the remaining period of the day corresponds to their application’s period of low use.
Outages during that period are likely to impact relatively few users compared with an
outage occurring during peak hours. Maintenance activities are also commonly scheduled
during these low-usage hours.

119
Chapter 6

The timing calculations for applications across only a few time zones are relatively easy.
But consider the complexity that occurs when that same application is used by applications
in the EMEA and Asia-Pacific areas as well. The multiple-hour differences in time zones
between the continental United States, EMEA, and Asia-Pacific regions mean that some
traditionally low-usage periods actually correspond to times when other regions are
beginning their workday.

Service Calendar information aids an APM solution by identifying when the high-usage and
low-usage patterns exist for the monitored application. Based on the source for expected
users, an APM’s Service Calendar can assist triage teams with identifying the true level of
impact to global users. It further assists administrators with finding the best time of day to
schedule outages for updates and fixes. When completing your Service Model, consider
including Service Calendar information into its calculations to ensure that a true level of
impact is recognized.

The Service-Centric Approach Quantifies Quality


The goal of this chapter has been to bring a quantitative understanding of how an APM
solution defines the quality of a particular service. This understanding is necessary because
the data on quality is used by an APM solution for identifying many key factors to include
the user impact of outages, the revenue impact when users cannot connect, and the overall
representation of performance. As you can see through this chapter’s discussion, quality
can be quantified through the hierarchal calculation of an APM solution’s Service Model.

Yet the Service Model is not a construct that is designed to be used by APM’s end
consumers. That Service Model exists in the background, to be used by APM’s calculations
engine. For users, a set of visualizations or “dashboards” are commonly constructed that
provide the right level of information. Chapter 7 discusses how those dashboards are
created and used while presenting a run-down on common dashboard elements you’ll find
in an effective APM solution.

120
Chapter 7

Chapter 7: Developing and Building APM


Visualizations
Its 10:22p and TicketsRus.com COO Dan Bishop finds himself at yet another hotel bar,
celebrating the end of a successful day at the National Ticket Sellers Conference. At the
conference full of business executives like himself, Dan’s been spending the week learning
about new technologies and tactics in servicing his customers.

One of those new tactics he now cradles in his hand as he orders a second drink. Through his
PDA’s Web browser, Dan is showing a fellow conference attendee some of the visualizations
from his new APM solution.

“So, here’s our internal Web site,” Dan explains, “On this site, I can take a look at the rate of
incoming orders. I can see which events are tracking to expectations and which ones might
need a little extra help in getting sold. Over here, I can track revenues on a per-month basis,
per-day, or even down to the individual hour.”

The other attendee is Lee Mitchell, CEO of Dan’s closest competitor and a long-time personal
friend. Lee leans in close to peer at the PDA’s not-entirely-tiny screen. On that screen are
metrics he’s used to seeing in his own systems. But something is different, and he can’t put his
hands on it.

Lee counters, “That’s great, Dan. But we’ve been able to pull these metrics for years. I’ve got
something fairly similar back at the office, although I’ll admit that pulling it up on the PDA
nets you extra ‘gee-whiz’ points. My guys have built something like this that pulls reports right
out of our accounting system to get me the same kinds of information.”

Dan smiles because this is exactly the road he’s been leading Lee down for the past 20
minutes. His old system was a lot like Lee’s in that he could pull metrics. But those metrics
always needed to be pulled. Once a report was generated, it was only a static representation
from a single system. With his APM solution, everything is real time and integrates not only
with the accounting system but also his entire IT infrastructure.

“A-ha!” proclaims Dan with a grin, “I know what you’re talking about, and that’s exactly
where we were about 12 months ago. Now, take a look at this…”

121
Chapter 7

Dan clicks on a few places on the screen and switches the view to a geographic representation
of his multiple data centers. Each data center shows a stoplight chart—green, red, or
yellow—displaying each data center’s representation of overall health. Dan continues, “In this
view, I can see which of my data centers are meeting which parts of their SLAs, and which
aren’t servicing my customers.”

He clicks to drill the view into the metrics for his data center in Rochester, New York, which
curiously shows a yellow condition. Seeking more details, he discovers that his Rochester ISP is
experiencing a problem that impacts his bandwidth to the Internet.

“So, here you can see that we’ve got an issue in Rochester,” Dan continues. “Some Internet
device is probably having a problem, which means that fewer people are connecting through
that point of presence.”

Lee scratches his head, “I’m still not seeing the a-ha moment here, Dan. I’ve got this kind of
data as well. My crew would be looking at this in the NOC right now if we were having a
problem, which, come to think of it, we might in Rochester if you are too.”

Dan chuckles, “Here’s the a-ha. All of this data is being gathered from all of my systems,
crunched through some service quality as well as business logic, and presented to me all at
once. Want to see something really impressive?” Dan clicks a few more links on the page, “This
screen aggregates my revenue impact data with that system performance data. It tells me
exactly how many users are impacted by the Rochester situation, how much money I’m losing,
and where my data center manager needs to send teams to fix the problem. The entire
infrastructure is completely visible, right here through Web pages I can access on my PDA.

“It gets better. Even my developers use it to trace specific lines of code that aren’t working
correctly. Everyone from the techies to my aging brain gets the visualizations they need,” Dan
stops as the formerly-yellow light turns green, “Hey, looks like they’ve fixed the problem!”

Lee’s eyes widen as he realizes the complete vision such a system brings, “Alright, you win.
Drinks tonight are on me. Now, tell me more about this system.”

122
Chapter 7

Visualizations Are the Core of APM


If to you pictures tell more than a thousand words, this chapter is one not to miss. This
guide’s growing explanation of APM has introduced each new topic with an end goal in
mind. That end goal—both for this guide as well as APM in general—is to gather necessary
data that ultimately creates a set of visualizations useful to the business.

It is the word “useful” that is most important in the previous sentence. “Useful” in this
context means that the visualization is providing the right data to the right person. “Useful”
also means providing that data in a way that makes sense for and provides value to its
consumer.

The concept of digestibility was first introduced in this book’s companion, The Definitive
Guide to Business Service Management. In both guides, the digestibility of data relates to the
ways in which it can be usefully presented to various classes of users. For example, data
that may be valuable to a developer is not likely to have the same value for Dan the COO.
Dan’s role might care less about the failure of an individual network component compared
with how that component impacts the system’s customers. Each person in the business has
a role to fill, and as such, different views of data are necessary.

Yet what’s interesting is that each of these different types of data must still be gathered to
be useful. Lee’s solution gathers data from only a single location, namely his accounting
system. As a result, his results can only be based on the quality of data available from that
system. If, for example, a network problem occurs in a data center, Lee’s accounting system
can’t factor the problem into its reports. As a result, Lee’s data isn’t fully representative of
the actual conditions “on the ground” across his entire customer services infrastructure.
You’ll find that this integration of business metrics into traditional monitoring represents a
key way in which APM impacts business decisions.

Cross Reference
Chapter 9 will discuss this linkage in further detail, focusing on how the topic
of Business Service Management (BSM) is impacted by the data gained
through an APM solution.

Even when dollars and cents calculations aren’t part of an APM Web page, visualizations
assist the business in other ways: They provide a mechanism for finding faults in the
environment. They enable traceability from the initial discovery of the fault down to its
actual root cause. They also enable an otherwise-impossible glimpse at the medium- and
long-term health of the system, displaying hard-number metrics that report on the quality
of service being delivered to customers.
In this chapter, you’ll step through a series of mock-up visualizations that illuminate these
situations and others. The goal here is to show you smart ways in which visualizations can
be generated out of the monitoring integrations we’ve laid into place in previous chapters.

123
Chapter 7

Note
This chapter’s pictures show how Web-based dashboards and their
visualizations bring value to raw data. Being Web-based, these dashboards
are designed to show a large amount of data at a single glance. As such, they
can appear very small in print. This is done intentionally to illustrate how
much data can be consolidated into a traditional browser-based view.

For our purposes, the design of the visualization and the ways in which it
represents its data is more valuable than the actual data.

Useful Visualizations for Every Data Consumer


No visualization is effective unless it is created first with its consumer in mind. If that
consumer can’t digest what’s being presented to them, the information being displayed is
valueless. Think about the types of consumers who in your business today might benefit
from the data an APM solution can gather:

• Service desk employees and administrators gain troubleshooting assistance and an


improved view into systems health.
• IT managers are assisted in positioning troubleshooting resources to the most
crucial problems as well as plan for expansion based on identified problem domains.
• Business executives gain a financial perspective and better quality data that is
formatted specifically for their needs.
• Developers are able to dig into specific areas where code is non-optimized or
requires updating.
• End users are proactively notified when problems occur, maintaining their
satisfaction with your services.
Each of these individuals is benefitted in some way through higher-quality information.
The next section will look at each of these data consumers in detail with an eye towards the
types of visualizations that are digestible to each. The first and most obvious groups are
service desk employees and administrators, as they represent a class of data consumer who
needs to know first when applications or application components break.

Service Desk Employees


Figure 7.1 shows one example of data that can be useful to this class of consumer. Here, a
stoplight visualization has been created that shows a number of top-level applications.
These top-level systems are represented both by period of time as well as by end user. In
this visualization, the value for the system changes from green to yellow or red when the
application is not meeting its expected levels of service.

124
Chapter 7

Figure 7.1: A top-level stoplight chart that alerts when systems violate established
metrics for service quality.

A visualization like this is useful for service desk employees as well as administrators
because it answers the top-level question “Are you functioning?” If all the lights are green
in this graphic, administrators and service desk employees can be assured that the system
is and has been functioning to expectations.

In contrast, when any of the visualization’s cells changes, it can be assumed that some
change has also occurred in the application. Both “when” and “where” questions are
answered at the same time, with the top representation showing the location and count of
affected users and the bottom showing how long the problem has occurred.

In this example, the second line associated with the online banking system began
experiencing a yellow condition at roughly 4:00a, which escalated to a red condition at
around 6:00a. Not all users are impacted, with those in EMEA experiencing a greater
number of impacted users than others.

125
Chapter 7

The Data Needs are Enormous


You might be wondering why this guide’s APM discussion has waited until
this chapter to start its conversation on visualizations. With the
visualizations being APM’s true value to the environment, delaying this
discussion until Chapter 7 appears outwardly counterproductive.
Yet, consider the amount of data that is required in order to even generate a
visualization like the one shown in Figure 7.1. Health metrics for each and
every server, network component, user experience element, and application
analytic must be identified, monitored, configured, and tailored before a top-
level visualization like this could ever be created. The tasks to get to this
point require no small amount of effort, with the data gathering and
calculating requirements equally as comprehensive.
APM solutions are uniquely capable of creating what appear to be simple
graphics because of the sheer magnitude of their instrumentation. Simply
put, underneath this simple graphic are dozens, if not hundreds, of individual
calculations that occur in real time to determine when a green light turns red.
The information here is only the start of an enlightened service desk’s triaging process.
You’ll see in Figure 7.1 that the problem relates in some way to the online banking system.
Although that information is useful for knowing that a problem exists, it provides nothing
for helping troubleshooting administrators track down where it exists. Needed is more
detail that a service desk can use in reporting the issue to those teams.

That information comes through an APM solution’s drill-down visualizations. In Figure 7.2,
the visualization for the online banking system has been drilled down to view a few of the
different technology elements that enable its functionality. Here, servers, databases, the
network infrastructure, and software elements are all shown with a slightly-greater level of
granularity over the very simple graphic shown in Figure 7.1. The additional bits of
information provided here help the service desk identify that the problem is likely due to a
server fault, helping them identify which group of individuals may be best suited to resolve
the situation.

Figure 7.2: Service quality details associated with technology elements.

126
Chapter 7

Yet this level of data is still not something that is useful for a troubleshooting
administrator. At this point, the presence and domain of the problem have been identified,
but its location within that domain remains unrecognized. In order to determine that
information, even deeper monitoring integrations are required.

Remember that an APM solution gathers its metrics from multiple sources. Those sources
can be the instrumentation within the applications themselves, they can come from various
network components and probe devices, or they can come from the actual server metrics
themselves.

Figure 7.3 shows the information in Figure 7.2 can be expanded further to view the actual
server monitor that initially tripped the alarm. Here, red lights are seen for the online
banking system as originally seen in Figure 7.1, and ultimately that system’s infrastructure
elements. One of those elements is the Web server at 10.4.224.42. Drilling down into the
details of that element, it appears that the server is experiencing a CPU overuse condition.

Figure 7.3: Tracing a problem condition through a service tree.

With this information in hand, the service desk can now transfer ownership of the problem
to the correct set of administrators for its resolution.

127
Chapter 7

Administrators
Top-level visualizations like the ones previously shown are useful for an IT environment’s
first responders. Once an alert associated with a problem system has been raised,
administrators can be notified to track that problem to its root cause. It is very obviously
within this step where APM can provide optimizations to the triaging process.

But triage and resolution are two different things. Recognizing that a CPU overuse
condition has occurred on a server does nothing to assist in bringing that issue to
resolution. That task lies with the business’ administrators, who must first identify its root
cause, a process that can also be assisted through APM visualizations.

First, consider a fully-unmonitored environment. In such an environment, the root cause


identification activity tends to consume the largest part of the troubleshooting process.
This is the case because a fault in a system often doesn’t manifest directly into something
that is observable by an administrator. Tracing the recognized issue to the actual problem
requires skill and experience, and often a bit of luck or trial and error. Alternatively, it can
be accomplished through a data-driven approach.

Figure 7.4 shows an example of an administrator drill-down that analyzes multiple


perspectives of the business system at the same time. In the top perspective is data relating
to the system’s front-end performance, with Web server and application server metrics
being displayed in the middle and lower sections.

Figure 7.4: Using visualizations to trace a fault across multiple tiers.

For each system component in this graphic, multiple types of data are presented. The front-
end server’s count and rate of transactions are specified, along with the C-N-S Spread for
those transactions. Web server and application server transaction details are also aligned,
providing—like before—a single glimpse of system health across each element.

128
Chapter 7

For this example, assume that metrics were laid into place prior to the fault. These metrics
quantified the acceptable and unacceptable behaviors across each monitored system
component. For example, the metric for HTTP server errors might have been configured to
alert when the count of errors grew beyond zero. As a result, a troubleshooting
administrator can quickly identify by the red-colored columns that a greater-than-
acceptable quantity of HTTP errors are occurring.

Further, the amount of C-N-S “Server time” spent by both front-end and Web servers is
greater than expected. The combination of these two pieces of information helps the
administrator further track down the possible CPU overuse condition.

Note
It is important to recognize that any APM solution is equipped with a
dashboard designer. This designer enables these visualizations to be
modified as necessary to suit the needs of the consumer. Thus, if your
business needs a view like the one in Figure 7.2 but with different data, it is
possible to create a slightly different view.

IT Management
Directing these IT personnel in a cohesive manner is another activity that can be an ad hoc
exercise without the right data. Consider the situation when multiple problems occur at the
same time in an IT infrastructure; a situation that isn’t terribly uncommon. When multiple
parts of a large and complex system experience problems at once, directing personnel to
resolutions that have the greatest impact is exceptionally important.

You can, for example, send out a team of administrators to fix a database problem when
that team’s time might be better served in fixing a simultaneous email server problem. As
systems grow in complexity, determining the right way to provision your human resources
can be as problematic as running the system itself.

An alternative way to handle the provisioning of resources to problems uses the same data-
driven approach as the previous example. Rather than making educated guesses on which
problems impact which users, APM can build this information out of the data it gathers
from your system components. Figure 7.5 shows how this information might look in a
resulting visualization.

129
Chapter 7

Figure 7.5: Affected users by system component.

The module shown here is one piece of a larger visualization, with the availability and
service quality information for each application not displayed. Figure 7.5 does, however,
show a list of applications that may or may not be in a degraded state. For each, the count
of affected and total users is displayed for those which are experiencing a problem.

Graphics like this one quickly assist IT management with directing troubleshooting
resources to the applications with the largest impact on operations. Here, the online
banking system’s outage impacts more than 50% of its total users, making it a greater
priority than the email system (or any other outage) for resolution.

Visualizations also help IT management with the planning and budgeting aspects of their
job. In this case, historical data can be used to create visualizations that document where IT
is spending the majority of its time. In Figure 7.6 shows a Pareto chart has been created to
document the number of outages over a period of time for a set of business services. Pareto
charts are used to highlight most-important issues among a set of potential issues. The bar
chart for each business service documents the number of issues for that service, while the
line graph shows the cumulative frequency of occurrence.

130
Chapter 7

Figure 7.6: A Pareto chart shows a historical breakdown of outages.

In this case, a historical Pareto chart gives IT management the data it needs to identify
where the majority of issues occurs within an environment. In this example, Trading, Citrix,
and Credit Services represent the top-three issues seen by the environment over the
measured period of time. Because these three services are experiencing the highest count
of issues, they make excellent low-hanging fruit for expansion or re-architecture activities.

Business Executives
Some situations can arise that are not technical in nature and as such are the purview of
business executives. Perhaps an Internet connection from a particular service provider
experiences a problem that is caused by the actions of the service provider itself. There is
no technical problem with the connection; it is merely not meeting its Service Level
Agreement (SLA) obligations.

Traditional monitoring solutions might overlook these types of situations due to their
heavy focus on technical metrics. Yet an APM solution’s widespread reach across systems,
networks, and even external connections can identify when executive-level support is
necessary for solving what ends up being a contractual problem. Figure 7.7 shows a sample
dashboard module that merges business contract logic with availability information to
alert when SLA conditions are in breach of contract.

131
Chapter 7

Figure 7.7: SLA fulfillment that is measured with actual data from monitoring
instrumentation.

Even during nominal activities, business executives struggle with the need for information
that they often have no capability to understand. In our chapter example, Dan the COO
might not understand what a network router is when it fails, but he absolutely needs to be
notified when that router failure causes an impact to his business operations.

It is this dissonance between the data that executives can get and what the data they need
that is a primary motivator for APM incorporation. APM solutions—most especially when
they are installed as part of a much larger BSM solution—enable the executive to view
information that they can digest and that they can truly care about.

Figure 7.8 shows what could be the most simplistic visualization for a business executive,
presenting the instantaneous performance of the IT environment in a dial format. It shows
very simply an aggregate percentage of how well that business executive’s services are
meeting the needs of their customers. The availability of the overarching service itself (via
an internal perspective) as well as the availability of the service to its users (via an external
user perspective) are noted in these twin graphics.

Figure 7.8: A simple dial module that represents availability metrics.

132
Chapter 7

Yet these two graphics only show the performance as it occurs at the moment it is read. To
get an executive-level view of service history over a medium-term period of time, another
module is commonly used similar to that shown in Figure 7.9. This graphic extends the
instantaneous representation of availability over a configurable period of time. Notice how
the easy-to-read format draws the eyes to situations that most business executives want to
prevent.

Figure 7.9: Another module that displays service quality over a period of time.

Dial and bar chart modules like these are commonly used as components of much larger
dashboards. In Figure 7.10, they and a number of other high-level modules are
consolidated to create a single-glimpse view that is useful for the business executive. This
dashboard includes visualizations that show service and user availability alongside service
quality and impact metrics for the business system’s various components. Notice how each
module presents its information in a slightly different way.

Figure 7.10: An executive-level dashboard that contains multiple visualization


modules.

133
Chapter 7

Dashboards like these provide information that gives business executives the confidence
that their systems are meeting the needs of their customers. With information that is
updated in real time, the business executive can reduce their need for operational status
reports from each component owner or manager. The result is that executives can spend
more time on value-added activities while reducing the level of attention necessary to daily
operations.

Executive-Level Dashboards Needn’t Be Detail Free


It is important at this point to mention that executive-level dashboards
shouldn’t necessarily obscure the details. Dashboards and their
visualizations are by definition designed to be drill-down capable. This
means that the top-level view within an executive-level dashboard can
include the basic stoplight-style charts seen earlier. At the same time, more
detailed information can be enabled through clickable elements on that
dashboard.
This ability to reposition the executive’s perspective at every layer helps to
create a more-educated executive while at the same time assuring good
quality data in visualizations.

Code Developers
With business executives requiring a very high-level view of the environment, their polar
opposite is your group of code developers. This group requires an extremely detailed view
of the individual functions being run within a system, broken down into detailed
explanations of inter-device conversations and transaction data. You can argue that the
data needs of this group go even further than what is needed by systems administrators,
because code developers actually create and manage the code that creates your business
system.

To that end, businesses that consider an APM solution must be careful of the capabilities
that such a system can provide for this group of individuals. For example, traditional
monitoring solutions tend to suffer from the “shrink-wrap support” phenomenon. Here, a
monitoring solution very openly offers support of many common technology products and
platforms, such as specific databases, middleware applications, or network devices. But
your business service is likely comprised from as much custom code as these off-the-shelf
applications. Thus, the ability to drill into the specifics behind the inter-application
communication is as important as the applications themselves.

For example, consider our previous situation where a Web server was experiencing a CPU
problem. Knowing that that Web server was experiencing an increase in CPU utilization is
less valuable than recognizing the exact Web page or code method that hasn’t been
processor-optimized. An effective APM solution should have the capability to peer directly
into database transactions to find such optimizations and present them via visualizations to
your developers.

134
Chapter 7

Such a deconstruction is shown in Figure 7.11. Here, a very simple table has been
generated that contains details about the performance of the front-end Web site. The
specifics here relate to a series of end user operations and their effective performance.
Transactions associated with login pages, search and view policies, search processes, and
logout operations are aligned with their effective rate of performance.

Figure 7.11: User experience monitoring for a front-end application.

Like before, the acceptable transaction rates for each of these activities has been
preconfigured within the APM solution’s logic. As a result, in this image, a developer can
quickly see that each of the measured activities is performing to desired expectations.

Charts, graphs, and tables are never enough with this group of data consumers, as their role
is to always look for areas in which to improve the application. As such, even when an
application is performing to desired specifications, there are always places in which
database queries can be further optimized, Web pages can be accelerated, and applications
can be given more power to accomplish their jobs. Code developers are also charged with
continuing this process even as changes are requested to their applications, whether those
changes be updates to Web sites or deep-level code updates to support new lines of
business.

One central issue with this dynamic rate of change with many business services is in
measuring their effective performance over time. Today’s slowdown in performance might
be related to a run on a new product with hundreds or thousands of new customers coming
in as new business. It could also be related to a bug fix that was implemented only to
discover that the fix caused more damage than improvement to the system. To that end,
multi-view visualizations like that shown in Figure 7.12 provide yet another way in which
multiple APM monitoring integrations can be tied together in a time-oriented way to track
down performance issues.

135
Chapter 7

Figure 7.12: A multi-view of application performance.

In Figure 7.12, three different views of a business service are gathered on a single pane of
glass. The top view shows the rate or volume of pages that are being rendered over a 24-
hour period of time. During that same period of time, the load time of those pages is related
along with an overall representation of application performance.

By aggregating each of these views over an equivalent period of time, a code developer can
quickly identify where correlations occur between different system activities. In this
example, it is easy to see that between the hours of 8:00a and 12:00p and again between
1:00p and 4:00p there is a substantial spike in the volume of Web site pages being
requested by clients. This volume changes from nearly zero to over a hundred Web pages
being rendered per minute by the Web server. At the same time, however, the load time for
these Web pages remains relatively steady. The application performance index of that Web
server also remains consistent over the monitored period.

136
Chapter 7

With these three graphs aligned, a code developer can quickly determine that there
appears to be no correlation between the volume of rendered pages and their effective load
time (for the volume of pages that were monitored). Such a developer can then be assured
that the volume of pages being rendered is not impacting CPU performance, and as such,
does not need code optimization or a hardware expansion.

If, however, the developer does find that some issue with the actual code is causing the
problem, alternate visualizations can be brought to bear to break down the processing of
that Web page into its disparate elements. As was first discussed in Chapter 3, any Web
page is rendered as the sum of a large number of individual parts. Those parts can gather
their data from internal databases or file structures, or can rely on external sources for
data. As such, when one of those parts or external sources is not performing to the level
needed by the Web server, the result is a reduction in performance.

Breaking down those transactions can be accomplished through a chart that looks similar
to Figure 7.13. In this chart, the individual parts that make up a Web page—graphics, HTML
code, scripts, and so on—are deconstructed by filename. Each file is rendered as part of the
Web page in a particular order, with some components overlapping. With the quantity of
elapsed time shown on the bottom, it becomes very easy for a Web developer to see where
delays in page rendering are impacting the overall perception of Web server performance.

Figure 7.13: A transaction breakdown chart for rendering a Web page.

137
Chapter 7

Remember that each individual Web page is the sum of its parts. With each one requiring
different parts for its complete execution, some pages can perform well while others
experience unacceptable load times. The problem in the unmonitored environment is in
tracking down which are the problem pages in and among those that are performing well.
To the unaided eye, tracking down one problem Web page can be an extremely time-
consuming process.

An APM solution can leverage its end user experience monitoring to keep records of page
performance on all pages at the same time. Aligning those pages to an index of performance
creates a table similar to Figure 7.14. Here, dozens of pages or more can be ranked by their
performance against each other. Pages that experience the highest levels of performance
are given a rating of one, with all other pages given a decimal rating below that number.
Pages with the lowest ratings are experiencing the worst performance, and as such, require
the greatest amount of attention by developers.

Figure 7.14: Measuring the Application Performance Index across multiple Web
pages at once.

End Users
Lastly are the end users themselves. This often-ignored class of users is your ultimate
consumer; however, they’re often forgotten when problems occur. Using a real-world
metaphor, keeping your Internet customers informed about situations is just as important
as the airlines notifying you when your flight will be late. An uninformed customer is an
unsatisfied customer, so keeping them aware of situations with your Internet-facing
systems is critical to keeping them coming back.

The problem in many organizations is in relating the information to end users in ways that
are digestible to them. Also problematic is relating the right amount of information: not too
little so as to annoy users or create situations of distrust, and not too much so as to give
away proprietary information to competitors.

138
Chapter 7

For these reasons, end user dashboards similar to Figure 7.15 can be some of the most
difficult ones to correctly configure. You’ll see that dashboards of this type often include the
lowest resolution of information, while at the same time presenting enough data to users so
that they know when problems occur. Typically, three types of data points are given to end
users when creating dashboards like this one: information about current outages, current
status of the infrastructure, and data about any upcoming or planned outages in the future.
With these three pieces of information in hand, end users find themselves empowered
enough to know when problem situations are actively occurring, and when they should
expect to return to access the business service with its full capabilities.

Figure 7.15: An example end-user dashboard.

139
Chapter 7

APM Visualizations Bring Quantitative Analysis to Operations


This chapter has indeed been all about the pictures. This is a necessary discussion for this
point in the guide because it is those pictures—the way they are designed, the data they
carry, the people for which they are tailored—that are the primary value generated by an
APM solution. With the right monitoring integrations in place, data can be gathered to fill
out these pictures with a quantitative view of your business operations. Deciding what to
do with the result is up to you.

It is that determination of “what to do next” that is the topic of this guide’s next chapter.
Taking a new approach to the APM topic, Chapter 8 will depart from the traditional
conversation to instead tell a story. That story continues the saga of Dan and John and the
rest of TicketsRus.com, but over the course of an entire chapter. Chapter 8 will give you the
opportunity to see an APM solution in action, showing how TicketsRus.com’s APM
implementation can be used to solve a major problem from start to finish. Told in the
narrative format as each chapter’s story, you’ll learn a bit more about the company. You
might also learn a bit more about your own business, and how it goes about solving similar
problems today with—or without—an APM solution’s objective analysis capabilities.

140
Chapter 8

Chapter 8: Seeing APM in Action


You could easily think of this chapter as the “companion” to the previous chapter. In
Chapter 7, you learned about the best practices in creating APM visualizations. By
analyzing a sample of mocked-up dashboards, the previous chapter presented a number of
ways in which APM visualizations enhance IT operations as well as business decision
making.

However, it’s worth saying again that APM is all about its pictures. The return provided by
an APM solution comes from the data it presents to the multiple classes of users in your
organization: administrators, developers, business executives, Service Desk employees, end
users. As such, truly understanding the value in those visualizations needs a second look.

You’ve hopefully been enjoying the chapter story of TicketsRus.com, and how Dan and John
and the entire cast of players have evolved along with their monitoring capabilities. You’ve
seen how John’s job has gotten easier as the notifications he and his team receive grow
more useful. You’ve also seen how Dan gets the quality information he needs to make data-
driven decisions as a business executive.

But what you really haven’t experienced to this point is a complete walk through of the
entire process. Such a walk through can tell the extended story of a potentially scary
problem along with how the APM solution set assists. That storytelling happens in this
chapter. Its goal is to add a dash of humanity into the 20,000-foot perspective that’s been
the forefront thus far in this guide.

The story you’ll be reading is entirely fictional but is based off the types of problems and
war stories that you probably experience every day. Every IT organization in every
business has its share of technology issues; they’re a fundamental part of doing business
with technology. This chapter’s story is written to show how the presence of a fully-
implemented APM solution enables every actor to do a better job supporting the needs of
the company as a whole.

You’ll also quickly notice that you’ve already seen many of the images here in previous
chapters. Many are slightly-adjusted views of those seen in Chapter 7, where this guide’s
extended discussion on pictures and data-filled visualizations was most prevalent. Where
feasible, the images have been altered to fit the storyline and its characters. This reuse is
done intentionally, to bring a sense of continuity between the topics that have already been
discussed and the story that ensues.

141
Chapter 8

APM Helps Avoid the “War Room”


Yet before we get into the story proper, let’s step back a minute and think first about the
traditional ways in which IT organizations identify, troubleshoot, and resolve situations as
they occur. These traditional ways have been developed over time as IT professionals
develop a sense for troubleshooting, “sniffing out” root causes to problems based on
previous experience and gut feeling.

When triaging administrators become aware of problems through user interactions, and
when troubleshooting administrators must track down solutions based on gut feelings
rather than hard data, the resulting process is purposefully un-optimized. In immature
environments that don’t use a data-driven approach to problem resolution, eight steps are
common: awareness, assessment, assignment, handoff, transaction-level triage,
infrastructure-level triage, handoff, and solution.

Awareness
The first step in the unmonitored environment is often its most painful. Actually becoming
aware that an IT problem exists is one of the hardest hurdles to overcome when IT
organizations lack process maturity and monitoring integrations. Without formalized
systems for watching and reporting on system behaviors, the first acknowledgement of a
problem often comes after the users have been affected.

Assessment
“There’s a problem with the database? We’ll look right into that.” This statement is a
common response in the unmanaged environment. In immature environments, the
assessment of the problem often starts with simply verifying that the user’s called-in
problem actually exists. With little or no instrumentation present, this process requires
walking through the stated problem as presented to the Service Desk. For those with
traditional, stove-piped monitoring solutions, this step can also involve verifying stated
behaviors using each solution’s individual console while attempting to manually aggregate
their information. In short, without monitoring integrations, the assessment process here
consumes time in actually determining whether the problem even exists.

Assignment
All but the most nascent IT organizations today use some form of work order tracking
system to transfer ownership of problems from one team to another. This process, even in
immature environments, can appear to be well solved, but in reality may be masking
processes that are not optimized.

Consider the quality of the data that often makes its way into work orders as they are
created. When a user calls in to announce a problem, the triaging Help desk operator must
insert the correct amount of useful and pertinent data into the work order. If this doesn’t
occur, the assignment process fails and forces the next person in line to again assess the
problem. The result is lost time and effort.

142
Chapter 8

A central problem in this step relates to how problem information is documented as it


transfers from one individual to another. Without a common visualization, like what is
provided through APM, there is no common language between teams. The result is a bit like
the “telephone game,” with information quality growing ever worse as ownership of the
problem changes hands.

Handoff
The handoff process itself also suffers in an immature environment, primarily due to the
establishment of priority to the issue. In unmonitored environments, it is functionally
impossible to determine the number of affected users except by sheer guess. Often, the
decibel level of the caller determines the priority of the work order rather than its actual
impact on business operations. The result is that problems get worked on in the wrong
order, satisfying “louder” callers over others with greater needs.

Resource
This problem is, in fact, so endemic to immature organizations that it is a
common theme in this guide’s companion The Executive Guide to Improving
Your Business Through IT Portfolio Management. There, the problem of
“decibel management” as it relates to IT projects is discussed in greater
detail.

Transaction-Level Triage
At the point of handoff, it is common for the ownership of problems to be forked towards
either developer or infrastructure teams. Developer teams by nature look at problems in
relation to their codebase, seeking out areas where individual transactions might be non-
optimized or service components may be broken.

Infrastructure-Level Triage
Infrastructure administrators, in contrast, look at problems with a more big-picture
approach. These individuals leverage their infrastructure experience and cross-system
vision to analyze issues in relation to all the other components in the infrastructure.

Characterization
Both halves discussed in the two previous points approach problems using different paths.
This double-pronged approach is particularly suited for the types of customized business
services that are common in customer-facing applications. However, there exists a
shortcoming when the problem cannot be characterized by either team. In this case, the
problem is often bounced back and forth between each team or their sub-teams, such as
“networking” versus “security,” and so on. As will be discussed in a minute, complex
problems that bridge problem domains require the support of multiple teams. Lacking a
common vision, the only way to bring teams together is through a “war room” approach.

143
Chapter 8

Handoff
Handing off the problem once again from troubleshooting to issue management for its
ultimate resolution is yet another step in the process. This step often requires coordination
between change management and configuration management teams as well as
communication between each component’s stakeholders.

Solution
The final step is in actually implementing the identified solution, a long way from its initial
awareness seven steps ago. You can see how this multiple-step process ensures that no
issue goes untracked, but at the same time, it creates a burden of process overhead for the
solution. Particularly problematic are those times when issues are much larger than a
single team can handle alone—such as when a code issue impacts the infrastructure as a
whole or when a change to the infrastructure breaks some piece of code.

In situations where multiple teams are needed to solve the problem, the “war room”
approach becomes the tactic of choice in many organizations. By bringing representatives
from every team into a single room, each becomes responsible for tracking down artifacts
of the problem within their zone of control.

The problem with this war room approach is in its costliness. War rooms are necessary in
the unmonitored environment because data is not consolidated into a single location for
common consumption. Individual team members can’t simply look to a Web page to see
how others are doing with the problem. In order to actually characterize a problem that
exists across multiple domains, the only way in an unmonitored environment to share
metrics is through personal contact.

Bringing together such a large group of individuals is disruptive to normal flow of


operations and creates an environment of distrust between teams. In most war room
situations, the onus of responsibility lies on each team to prove why their domain isn’t at
fault. Finger-pointing commonly ensues, with problem resolution drawing out over
extended and unnecessary periods of time.

A fully-realized APM implementation is very much like the digital representation of that
war room, but without its actors. As has been stated many times before, an APM solution
gathers its metrics from every part of the IT environment, consolidating them into single-
glimpse views for consumption by all teams. The result is that environment behaviors
across every domain can be seen by everyone without the need for “circling the wagons”
and resorting to qualitative “gut feelings” for potential solutions.

144
Chapter 8

APM Streamlines the Solutions Process


Environments that benefit from APM’s data-driven approach consolidate the problem
resolution process into six very streamlined steps. This new process consolidates many
steps from the traditional approach, while at the same time adding a few new ones that
improve the overall communication between teams and to the rest of the business.
Consider the following six steps as best practices for an APM-enabled environment.

Visibility
Behaviors that occur outside expected thresholds are alerted via high-level visualizations.
Through drill-down support, the perspective and data found in that high-level visualization
can be narrowed to one or more systems or subsystems that triggered the failure. Using
tools such as service quality metrics and hierarchical service health diagrams, triaging
administrators can be quickly advised as to initial steps in problem resolution.

Prioritization
Counts of affected users are predefined within an APM solution’s interface, enabling
triaging teams to identify the actual priority of one incident in relation to others that are
outstanding. As a result, those with higher numbers of affected users or greater impacts on
the business bottom line can be prioritized higher than those with lesser affect.

Problem & Fault Domain Isolation


Triaging teams then work with troubleshooting teams, often through a work-order
tracking system, to track the root cause of the problem. The same visualizations used
before in the visibility step are useful here. Different from the unmanaged environment is
that all eyes share the same vision into environment behaviors through their APM
visualizations. As such, details about the problem can be very quantitatively translated to
the right teams to assist in their further troubleshooting.

Troubleshooting, Root Cause Identification, & Resolution


Using health metrics, the problem is then traced to the specific element that caused the
initial alarm. That alarm describes how the selected element is not behaving to expected
parameters. Here, troubleshooting administrators can work with other teams (networking,
security, developers, and so on) to translate the inappropriate behavior into a root cause
and ultimately a workable resolution.

Communication with the Business


During this entire process, business leaders and end users are kept appraised of the
problem through their own set of APM visualizations that have been tailored for their use.
Business leaders see in real time who and how many people are affected by the problem as
well as how much budget impact occurs. End users are notified through notification
systems that give them real-time status on the problem and its fix.

145
Chapter 8

Improvement
Throughout the entire process, the APM solution continues to gather data about the
system. This occurs both during nominal as well as non-nominal operations. The resulting
data can then be later used by improvement teams to identify whether additional
hardware, code updates, or other assistance is needed to prevent the problem from
reoccurring. By monitoring the environment through the entire process, after-action
review teams can identify whether the resolution is truly a permanent fix or if further work
is needed.

It should be obvious to see how this six-step process is much more data driven than the
earlier traditional approach. Here, every team remains notified about the status of the
problem and can provide input when necessary through the sharing of monitoring data.
When problems occur that cross traditional domain boundaries, those teams can work
together towards a common goal without the need for war rooms and their subsequent
finger-pointing.

With this in mind, let’s see how this new approach works for the ongoing story of
TicketsRus.com. In this continuance of the story, John finds himself starting what appears
to be a regular day at the office. He has started his day by completing the tuning of a
behavior threshold within his APM solution.

The problem that is about to ensue can and will be handled much differently than in
previous situations. This time, John and his teams have the data they need at their
fingertips to turn what could be a disaster into a relatively minor impact on their daily
operations. Rather than repeating the public horror of their previous “Labor Day Incident,”
John and his teams keep the problem from growing out of control and affecting their end
customers.

TicketsRus.com—A Day in the Life


Its 8:18a and TicketsRus.com IT Director John Brown finds himself putting some finishing
touches on one of the elements in his APM service model. Today, John finds himself tuning
a database performance threshold for a monitor that is attached to the Inventory
Processing System.

As he begins this relatively trivial task, he thinks back to an afternoon last week. That day, a
false positive in this particular monitor caused a minor stir at the Service Desk. During that
afternoon, the company had just released a new set of tickets for a fairly popular concert to
happen in a large local venue. The band for that concert had just returned from an
extended break, and their fans were eager to see them perform once again. The resulting
rush created an abnormally heavy load on the Web site that the company hadn’t
experienced in a while, and at least one incident that was related to this very counter.

After the incident, a few code changes were also made to the system. The same heavy load
was expected this week as the same band’s second series of concerts were to be made
available to fans.

146
Chapter 8

Clicking through his APM solution’s designer tool, John brings open the Service Model for
TicketsRus.com’s external Ticket Sales application. There, he finds the hierarchical diagram
that he and his teams had spent so much time tuning over the past 6 months. Including
representations for every component in the system, from the External Web Cluster all the
way through the Order Management System and others, this designer tool provided the
workspace where they built the logical representation of the overall system.

External Web Cluster


Configured Monitors

Component Health
Kerberos Auth. System

Database Performance
Inventory Processing System
Transactions per Second (600)
Memory Utilization
ERP System Processor Utilization

Inventory Mainframe

Order Management System

Credit Proxy System

Extranet Router

Credit Card Extranet Router

John double-clicks on the Inventory Processing System element and brings up its
properties. There, he is shown the list of health monitors that are attached to this particular
system. Each monitor is configured with one or more settings that define what
TicketsRus.com considers healthy versus unhealthy behaviors. Browsing down through its
fairly comprehensive list, he finds the culprit in last week’s alert brouhaha.

“A-ha!” says John to nobody in particular, “It looks like that last batch of tuning updates
dialed down the Red threshold on Transactions per Second to a number that’s far too low.
Let me just set this back to 600, where it belongs.”

147
Chapter 8

He clicks to view the properties of the Transactions per Second element and makes the
appropriate change. He thinks to himself, “APM is a great solution, but you absolutely have
to keep on top of your tuning. Making sure that every alarm is tuned properly seems to be a
never-ending activity in this tool.”

Thankfully, his chosen APM solution arrived with a set of preconfigured profiles that got
the monitoring team started. Those profiles gave the team a set of starting points for each
class of hardware, application, and service that were based on known best practices—for
example, 80% for Web server processor utilization, 600 for database transactions per
second, and so on.

He recalls the first few weeks after its initial installation where a few of those “suggested”
threshold levels weren’t really all that tuned for his environment. During that period, he
had shown up for work with an entire screen of health stats flashing Red for warnings that
really weren’t all that relevant to his environment. “There aren’t that many Web-based
ticket sales applications out there that service an entire continent, so we had our hands full
with customizing for a while,” he reminisces to himself.

It took a period of time to get those metrics tuned just so, but there were always areas for
improvement in the system. That’s why he spent a period of each day keeping tabs on how
well the solution was doing its job. Today’s adjustment for the Inventory Processing
System’s metrics is no different.

Browsing around through some of the other screens, he notices something out of place.
“That’s funny,” he exclaims, “Now why is the country map showing a yellow condition for
one of my sites. The…Rochester site...it seems. We’re not doing any work there today.”

John calls up the Service Desk to see what the matter is. With Dan halfway across the world
at some ticket seller’s conference—“Yet another junket,” thinks John—John is technically in
charge of operations. And a major issue is all he needs.

148
Chapter 8

“Service Desk, this is Eloise.”

“Hey, Eloise, this is John. Tell me about the yellow condition I’m seeing in Rochester.”

“Hi John. Well, it was yellow. Look again. APM now says we’re having a problem with the
entire Ticket Sales application, all across the map. Whatever it was, it started in Rochester
and spread to the entire country map. We’re looking into it now.”

John’s heart skips a beat as he recalls the Finals Week incident from not too long ago, a low
point in his otherwise illustrious career. If this problem is for real, it’ll be the moment of
truth for his relatively new APM solution. Another repeat of that Finals Week situation, and
he could be looking for a new job.

“I’m coming up to the NOC to manage the incident. I’ll be there in a minute.”

John drops the phone into its cradle. Before he leaves, he calls up his highest-level Service
Quality visualization for all his monitored systems. The board shows Green everywhere,
with the exception of a fairly-disturbing strip of Red across the line marked Ticket Sales.
Displaying 2674 in the Overall column, John recognizes that some nearly-3000 people
aren’t able to buy tickets right now. He dashes off to the Service Desk to coordinate his
teams.

The NOC is a flurry of activity. Though, John thinks, this incident’s “flurry” doesn’t seem as
“frantic” as in previous major incidents. Rather than the usual rush of administrators
scurrying in and out, seeking updates and where they can help, this time the hubbub is
more like a murmur.

He heads to Eloise’s desk for an update. “Whatever it is, it appeared to start in Rochester
and spread through other parts of the system. Right now, we’re showing Red conditions in
the NA-East and Canada-East zones. The other zones are showing Yellow, which I’m not
sure what to think,” she states.

149
Chapter 8

“Was anyone doing any work on the system when it went down?” John asks. Although
doing work on the system outside its designated maintenance windows wasn’t supposed to
happen, sometimes his eager administrators tried to sneak in work when they shouldn’t.

“As far as I know,” says Eloise, “nobody’s even been in the data center this morning. The
network administrators have been in a meeting, and most of the systems administrators
have been at their desks prepping for the next Change Management meeting.”

“OK,” John asks of the room, “Will someone call down to the systems admins and see if
anyone’s been up to anything? And will you,” he points towards one of the Service Desk
employees at random, “run down to whatever conference room the network
administrators are in and see what they know?”

By this point, pretty much everyone in the IT organization had probably received some
kind of notification about the problem. Being the money-making system for the company,
almost everyone was on notice when a problem of this scope occurred. Why they hadn’t
called into the Service Desk yet either means that they’re looking at their own
visualizations—proving, he thinks, that the processes surrounding his APM solution are
working—or they’re all huddled around the water cooler. He hopes for the former.

“Eloise, let’s drill down a bit into these alerts and see what’s going on,” he faces the large
screen in the center of the NOC’s alert wall. Recently installed, it now operates as a kind of
giant heads-up monitor for the entire Service Desk to see at once. That monitor has access
to all of his APM solution’s visualizations, from developer to administrator to even Dan’s
business executive views. It has already come in handy on a couple of occasions.

John continues, “Bring up the high-level network view. Are we still seeing those odd
behaviors from last week?”

Eloise takes control of the projected screen’s computer, clicking through visualizations to
find the one John wants. She brings up the high-level network screen he’s interested in
seeing. It discouragingly shows more than a few elements that aren’t in Green.

150
Chapter 8

“Somebody help me here,” John asks no one in particular, “We were working on these
network metrics just last week, and they’re still in Yellow and even a few Red. Are these for
real, or are these problems that are still residual from our tuning activities of last week?”

Right then John’s network engineer pops into the room, “Ignore those colors, John. That’s
the meeting that me and my team where in just a few minutes ago when this whole thing
started. We’re still tuning our part of the system. You aren’t seeing these error conditions
roll up right now because we’re still trying to determine what thresholds make sense for
us.”

“Have these changed from where they were last week?” John asks, “The colors look to be in
the same place, but is today’s ‘real’ situation impacting your numbers in any way?”

The engineer squints at the rows of numbers on the screen. They show his network
performance metrics alongside upstream and downstream loss rates, round-trip time, and
total bytes, “Everything looks OK. The numbers are on the same magnitude of what we
were seeing before any of this happened. It looks like the problem isn’t us this time.”

“Now, I thought we decided we were ending that friendly rivalry?”

The engineer smiles, “Old habits die hard, man. Let me get back with my team and keep
digging around over there. We’ll see what we can find.”

“Hey, John,” Eloise catches his attention. While John’s been chatting with his network
engineer, she’s been browsing through a series of other screens, “It looks like today’s
problem might be with the servers. Check this out.”

Eloise has drilled down past the top-level screens to view each individual technical system.
There, she’s looking at a new visualization that shows health statistics for the system’s
Servers as well as Database, Network, and Software Infrastructure. The Service Quality
indicator for Servers shows Yellow.

151
Chapter 8

“Double-click Servers. Let’s see if we can’t figure out which one’s having a problem today.”

Eloise double-clicks the Servers link to bring forward a new screen. On this one, she’s
presented a list of all the servers that make up the Ticket Selling system. “We know we can
ignore the network metrics right now. Frontline and Changepoint both seem OK, as well as
most of the others. Salesforce is unavailable now, but we know about that one. It’s the Site
HTTP that doesn’t look happy. It’s showing less than 70% for its application performance.”

“Now, we’re getting somewhere,” says John. He looks at his watch. The time shows 8:29a.
Exactly 7 minutes have passed since the first APM light went from Green to Yellow, “Bring
up the Service Tree for our Servers.”

Eloise drills further into the visualization, bringing forward the Service Tree. For
TicketsRus.com, the Service Tree for Ticket Selling servers is broken down into
Infrastructure, Application, Database, and Mainframe servers. The indicator for
Infrastructure Servers glows Red.

152
Chapter 8

She clicks the plus sign next to Infrastructure, and sees that the Web Servers seem to be
triggering the Red alert. Drilling further, she exposes that the problem is with the Web
server at 10.4.224.42, one of the servers in the Ticket Selling application’s External Web
Cluster. Clicking one more time, the error appears specifically to fault its CPU utilization.

“So, here’s a source of the alert,” John tells Eloise, “We now know the predominant
symptom of the problem. Let’s get this info to the systems administrators. I’ll call them. You
draw up a work order with this info.”

---

Half a world away, Dan’s been sitting at the bar with his old friend and now competitor Lee
Mitchell. With his PDA viewing the same Web screens as everyone in his NOC, he’s aware of
the situation as well. Right now he’s showing Lee the country map view on his PDA as one icon
flashes from Green to Yellow.
“So, here you can see that we’ve got an issue in Rochester,” Dan continues in his story to Lee.
“Some Internet device is probably having a problem, which means that fewer people are
connecting through that point of presence.”

Dan clicks past the country map to bring forward his availability dials, the two indicators he
looks to most to identify how well his systems are working. While talking with his friend, he
notices that the dial for User Availability has dropped to 87.4% and continues to go down.
With a couple of drinks in him, he’s in too good a mood to worry too much. He knows that
John’s got him covered.

---

Back at TicketsRus.com’s corporate headquarters, John hopes that Dan isn’t watching this
situation as it unfolds. It’s always easier to talk with him about these situations after
they’ve happened rather than when they’re going on.

153
Chapter 8

He calls down to the systems administrators, “You guys got the work order, yes?”

“Yep,” reports Eric, the administrator who answers the phone, “We’re actually way ahead of
you. We were together working on a few things for today’s Change Management meeting
when the alerts started to come across. So, we’ve been digging into the metrics to see what
we can find for the past few minutes.”

John thanks Eric, “Let me know when you find something.”

While the Service Desk has been focused on high-level quality alerts, the systems
administrators have been focusing their attention to the actual performance metrics on
their servers. This group believed pretty quickly that the problem might lie somewhere
within their realm of control, considering the data they’re seeing.

Eric brings forward what he jokingly likes to call his Monster Dashboard for the Web server
that’s having a problem. He likes this dashboard because it aligns all sorts of metrics by
time. He knows when he looks at this dashboard that every metric shows the same period
in time, giving him and his team a way to correlate different behaviors that may occur
simultaneously or at least very close to each other.

The Web server’s Monster Dashboard is equipped with nine performance views all at once.
With CPU utilization for the Web server in the upper left, response time for that Web server
in the upper middle, and HTTP errors in the upper right. Right away, he sees some
correlations between these three measurements that give him a bit more information
about the problem.

He focuses on each of these three modules in turn. “Yep, we’ve definitely got a CPU problem
here. It actually seemed to start a couple of hours ago, but is only now getting to the point
where it’s a problem. See this spike,” he points out the trailing spike in CPU utilization to
another administrator, “there’s our problem.”

154
Chapter 8

“Now, check this out. You see how the response time for the Web server got quite a bit
worse as this problem escalated. Over the past few hours, we’ve been slowly creeping up to
the point where this problem began to be noticeable by users. That’s what tripped the
alarm.

“Didn’t we drop that second set of tickets today for that band’s new tour, the one that’s
been on hiatus for, like, 4 years? I wonder how much of this has to do with an actual
problem and how much has to do with their screaming fans trying to get their tickets?

“Well, now, doesn’t this just get all. Check out the HTTP Error rate. Looks like we had a
spike there, and then it just dropped to nothingness. That’s not a pattern that we usually
see when the load is super-high. HTTP Errors should be right around zero all the time,
unless there’s something going wrong. To me, that makes this whole situation feel like a
developer problem. Something’s wrong in that latest code drop perhaps…?”

155
Chapter 8

Eric ponders this recommendation for a few seconds more and decides to make the call. In
the old days, he thinks, this is when the resolution to the problem might get worse than the
problem itself. Right around now is when John would have been frantically calling
everyone in for a big meeting. We’d all need to bring our suggestions, draw them out, and
see which ones made sense. Eric didn’t like these meetings very much. They often went far
into the night.

This has him thinking, “You know, I never really liked this new APM system when John was
talking about it. I was always worried that it’d make my job harder. But now that I’m seeing
it here today, I’ve got to give the guy a hand. It’s making this process a lot…well…easier.”

Eric picks up the phone to call the developers as he finishes logging his impressions into
the work order. Also different than in the previous system, he can link his research notes
directly to items on his dashboard. Since everyone in IT can read almost every dashboard—
those with the financial info are locked away, but the troubleshooting dashboards are all
common access—he needs only jot down the few notes he’s made over the past minutes
and paste links to the correct dashboard URL right into the work order.

On the other end of Eric’s phone call is Rhonda, one of TicketsRus.com’s lead developers,
“Yep. We’re looking at this too. We got the notifications just after you guys did. It is looking
like a problem that’s on our side. We ran that last batch of code updates through regression
testing for days but we must have neglected to run some kind of test. C’est la vie.”

Rhonda has also been looking at the problem through her own set of eyes. She first caught
wind of the problem while grabbing another cup of coffee down the hall a few minutes ago.
She figured she’d look into the problem a bit even before she got word that the ball was
indeed in her court.

Rhonda finds her answer as she pulls up her Apdex by Page URL report. In looking through
it, she chuckles a bit about how the executives and the Service Desk keep thinking about
this Ticket Selling system in terms of “quality.”

156
Chapter 8

“What a subjective way to think about a deterministic system. Quality, heh,” she mutters to
herself, “What does quality really measure?”

As soon as the words come out, she stops to think about what she’s just said, “Well, I guess,
come to think of it, their numerical value for Quality is a lot like my numerical value here
for Apdex. Apdex in this case measures how well a particular page is performing, with a
value of one meaning ‘perfectly’ and those less than one describing their level of
performance in comparison with each other as well as the established baseline of ‘perfect.’
OK. So, I’ll admit each set of numbers has meaning to each set of people. I’ll buy that.”
Getting back to the issue at hand, Rhonda takes a look through her Apdex numbers and
immediately sees that a few are absolutely not where they need to be. Four in particular
give her cause for concern. She thinks to herself, “That batch file shouldn’t be performing
that poorly, but it’s the gateway.jsp code that’s really at fault here. Three procedures are
absolutely unacceptable in terms of performance.”

She continues to herself, “I wonder how much these three procedures are impacting the
overall loading of the page.”

Rhonda switches her view to call up a view of the average page load time by element.
There, she can see each of the pieces of code that goes into a page load, and how much time
each requires to complete executing its portion.

157
Chapter 8

“Wow. Eight, Eight-and-a-half seconds to load that silly .JSP file?” Rhonda admits to herself,
“That’s waaaay out of the ballpark in terms of performance. Here’s why we’re having
problems loading pages for a whole set of our users. This is one of the new code pieces we
updated last week. We’re under pretty heavy load right now. There’s probably some non-
optimized code in the file that didn’t get caught by regression testing.”

Rhonda feels she’s getting closer to a root cause of today’s problem. She thinks, “The big
question now is, ‘How does this load problem relate to the CPU problems that Eric was
seeing? Usually a slow .JSP load doesn’t necessarily equal a processor overuse scenario…”

To answer this question, she needs to see the communication between the Web server and
one of its downstream databases. Rhonda remembers that John’s APM solution now gives
her some very nice transaction-level metrics between servers. In the old days, if she
wanted to see bits and bytes as they cross the wire, she’d first need to call up the network
engineers and have them set up a sniffer in the environment. With TicketsRus.com’s new
Change Control rules, that sniffer could take a while to get approved.

Now, with the APM solution, the sniffers are already in place. Rhonda needs only to find
and pull up the right trace. She does just that, and discovers that the extended load time has
to do with a double-looping construct in the .JSP code.

158
Chapter 8

“A-ha! There’s this slightly-more-than seven second lag in this subroutine. It looks like
we’re looping inside of looping. Amateur mistake. Sorry, everyone,” she mutters to no one
in particular.

Rhonda knows exactly the subroutine she needs to adjust to fix the problem. Rewriting a
half-dozen lines of code, she calls up John to approve fast-tracking the fix.

John answers, “Rhonda. I thought this was a CPU problem.”

“It is,” she tells him sheepishly, “and I’m the one at fault. That last batch of code included
a…ahem…bug that’s causing a delay issue when we get under load. I’ve tracked down the
problem and have already coded the fix. All I need from you is the approval to push it out to
the External Web Cluster.”

John harrumphs. He doesn’t usually implement code fixes this quickly, preferring to run
them through testing and Q&A first, but he needs to get this system back online quickly,
“Alright. Patch the code.”
Rhonda inserts the patched code into her automated staging and deployment system,
thoughtfully runs a quick test on it in her testing environment first, then pushes it out. She
waits.

John waits also. He looks at his watch again. Its 8:34a. Exactly 16 minutes have passed since
the system alerted first on this problem. Sixteen minutes in the old days and they might
have gotten everyone into the same room by now. Today, 16 minutes later and John finds
himself again staring back at his country map. Without warning, his Red and Yellow icons
begin flipping back to Green.

159
Chapter 8

His APM solution, it seems, has come in quite handy.

---

On the other side of the world, Dan has finished his second drink at the hotel bar, as well as his
story to Lee about his new APM solution.

“It gets better. Even my developers use it to trace down specific lines of code that aren’t
working correctly. Everyone from the techies to my aging brain gets the visualizations they
need,” Dan stops as the formerly-yellow light turns green, “Hey, looks like they’ve fixed the
problem!”

Lee’s eyes widen as he realizes the complete vision such a system brings, “Alright, you win.
Drinks tonight are on me. Now, tell me more about this system.”

Everyone Benefits by Seeing APM in Action


As mentioned before, this story is entirely fictional. But it’s not far from the mark in terms
of how a fully-realized APM solution brings benefit to IT operations. The Service Desk,
administrators, network engineers, and even developers all share the same workspace and
experience during times of trouble. Each group can independently work on solutions
without the need for war rooms. Even executives gain the comfort that their systems are
absolutely being managed correctly, and that their customers are being serviced.

Yet, there’s one last part of this story that hasn’t been told. This guide has repeatedly
referred to the idea that performance metrics can be aggregated with business data to get a
financial perspective on IT systems. This capability relates to the topic of Business Service
Management (BSM), which is the topic for Chapter 9. By incorporating BSM’s financial logic
into the technical logic seen through an APM solution, stakeholders like Dan are greeted
with real-world impacts to dollars and cents based on situations like the one told in this
story. Chapter 9 is next, and in it finishes the story.

160
Chapter 9

Chapter 9: APM Enables Business Service


Management
This guide’s previous chapter brought closure to the ongoing story of TicketsRus.com,
presenting a comprehensive example of how an APM solution can be leveraged by multiple
stakeholders in an organization. Through that story, service desk employees,
administrators, developers, and even IT and executive management were able to work
together through a common set of visualizations towards the resolution of a major
problem.

That chapter also showed how effectively resolving problems requires a data-driven
approach, one with a substantial amount of granular detail across multiple devices and
applications. Using this approach, it is possible to trace a system-wide performance
problem directly into its root cause. By integrating into databases, servers, network
components, and the end users’ experience itself, a fully-realized APM solution is uniquely
suited to gather and calculate metrics for entire business services as a whole.

Yet the topics in the previous chapter’s story were fundamentally focused on the
technologies themselves, along with the performance and availability metrics associated
with those technologies. Its resulting visualizations were heavily focused on the needs of
the technologist:

• Service desk employees were able to track the larger issue directly into its problem
domain.
• Network administrators were able to identify whether metrics for network
utilization were within acceptable parameters.
• Administrators were able to use health and performance metrics to identify
symptoms of the problem.
• Developers were able to ultimately identify the failing lines of code and quickly
implement a fix.
Missing, however, in the previous chapter’s story is another set of business-related metrics
that convert technology behavior into useable data for business leaders. This class of data
tells the tale of how a business service ultimately benefits—or takes away from—the
business’ bottom line. It also creates a standard by which the quality of that service’s
delivery can be measured. It is the gathering, calculation, and reporting on these business-
related metrics that comprise the methodology known as Business Service Management
(BSM).

161
Chapter 9

What Is Business Service Management?


The IT Information Library (ITIL) v3 defines BSM as an approach to the management of IT
services that considers the business processes supported and the business value provided. Also,
it means the management of business services delivered to business customers. Businesses
that leverage BSM look at IT services as enablers for business processes. They also look at
the success of IT as driving the ultimate success of the business.

This is a critical approach to how IT brings value to the business; however, it isn’t one that
is used by all organizations. Those without high levels of IT maturity are intrinsically
unable to attain alignment between IT and the business.

Chapter 2 talked at length about this problem of IT and business alignment. It discussed
how different IT organizations display different levels of organizational maturity, with
greater maturity bringing greater business value. Like APM, BSM is a methodology that
both requires IT maturity while it develops IT maturity.

Note
To better understand the concepts in this chapter, you may consider turning
back to review those first introduced in Chapter 2. As successfully
implementing BSM requires a high level of IT maturity, understanding
exactly how that maturity is developed and measured is important.

BSM and APM are two methodologies that are naturally linked by their requirements for
data. The information gathered through an APM solution’s monitoring integrations directly
feed into the requirements of a BSM calculations engine. Performance, availability, and
behavioral data of the overall business service and its components are all metrics that aid
in calculating that service’s overall return. These metrics also provide the kind of raw data
that helps identify how well a business system is meeting the needs of its customers.

Figure 9.1 shows a logical representation of where BSM links into APM. Here, APM begins
with the creation of monitoring integrations across the different elements that make up a
business service. Those monitoring integrations gather behavioral information about the
end users’ experience. They collect application and infrastructure metrics as well as other
customized metrics from technology components. APM’s data by itself is used primarily by
the IT organization for the problem resolution and service improvement processes
discussed to this point in this guide.

162
Chapter 9

Service-level
Expectations

Service-level
Reporting Business Service Management

Business
Service Views

End-user Application &


The Other Service
Experience Infrastructure
Business Metrics
Monitoring Metrics

Monitoring Integrations

Application Performance Management

Figure 9.1: BSM converts technology-focused monitoring data into business-centric


metrics.

The addition of BSM creates a new layer atop this APM infrastructure. Here, the business
itself becomes a critical component of the monitoring solution. Business processes and
service level expectations are encoded into a BSM solution, with the goal of creating
business service views that validate and report on how well the technology is meeting the
needs of the business.

It Starts with the Service Model


This linkage is realized through an extension of the Service Model that was first introduced
in Chapter 6. If you turn back to Chapter 6’s explanation of APM’s Service Model, you can
obviously see the direct linkages between service technologies and their representation
within the model. However, that discussion didn’t focus largely on how representations for
business services should be implemented.
Figure 9.2 shows a graphical example of how Chapter 6’s physical model can be augmented
with additional elements that represent the Ticketing System’s business service. Here, the
technology infrastructure components that comprise that service are abstracted under the
element titled Ticketing System Infrastructure. Above this element, three more are added
to represent the geographic locations where that service is available to customers. Finally,
atop the entire model is the ultimate representation of the Ticketing System itself.

163
Chapter 9

Figure 9.2: The technology underpinnings of a business service feed into BSM’s
Service Model.

This positioning of the Ticketing System at the model’s top is significant. Not shown in
Figure 9.2 but important for BSM’s Service Model is how multiple business services can be
represented in parallel through this augmented model. For example, the same organization
may support business services for Ticket Brokering and/or Vendor Management. These
services, which might or might not be available to customers in different geographic
locations, are added to the topmost level of the model and linked into its geographic
elements. The resulting multiple-service representation allows business leaders a single-
glimpse view across all the services and locations that make up their business.

164
Chapter 9

The Measurement of “Quality”


As with the technology-oriented Service Model represented in Chapter 6, monitoring data
and associated threshold values exist underneath the BSM model’s individual elements. As
with the initial technologist-oriented model, that data provides the logic that determines
whether a business service, geographic location, or any other element is represented in
green versus red. Further, it populates the more granular explanations of system behaviors
when a user drills down into individual elements.

Different here is the type of data that is represented within this level of the model. Here,
business leaders are primarily interested in information that measures “how well the
technology is meeting the needs of the business” and is commonly manifested at a high
level using a metric referred to as Service Quality.

Quality is a term that has been discussed previously in this guide, yet thus far with a strong
focus on the technology underpinnings to business services. In the BSM perspective, the
idea of quality has been formalized. It corresponds to a quantitative and numerical
representation of the success or failure of a business service.

Now, at first blush, assigning some numerical value to an abstract concept such as “quality”
seems inappropriate for a data-driven solution. But think for a minute about what makes a
quality service—one that meets the needs of its end consumers. To borrow a line of
thinking from Chapter 6:

• When the service is fully functional and meeting the needs of its customers, is that
service of high quality?
• When the service is fully non-functional, meeting the needs of no one, is that service
of high quality?
• When the service is functional but with some non-functional actions, is that service
of high quality?
• When the service is functional but with low assurance that user actions are being
fully accomplished per user needs, is that service of high quality?
• When the service is functional in appearance but fully non-functional, is that service
of high quality?
It is obvious to see how the first question represents a business service that is operating
with a large amount of quality: The service is operational. Users are interacting with it and
accomplishing their needs. It is also obvious to see how the scenario in the second bullet
operates with zero quality: The service is completely down and no one is accomplishing
anything with it.

165
Chapter 9

Yet complex systems operate in more than just a binary “on” versus “off” state. A system
can be functional or non-functional, but it can also operate in many different states in-
between. For example, built-in redundancy can mean that individual component failures
may not affect that service’s overall availability. Those same failures may only slightly
affect its performance. Reductions in performance can also render a functioning service to
a state of slowness that no user would want to interact with. Here, the service is
operational, but not with any level of performance that can be considered “successful.”
Reductions in quality can also occur when the loss or reduction in performance of down-
level components ultimately means a partially-functional service.

As you can see, the situation gets murkier when the state of a system exists in this area
between “on” and “off.” In each of the final three bullets, how well is this system operating?
How well is it fulfilling the needs of its users? As an example, in the third bullet’s scenario
the service is mostly functional with some non-functional actions. If some users can
accomplish their tasks, does this make its quality higher than in the fifth bullet’s scenario
where the service appears functional but really isn’t working at all?

The answers to these questions are obviously non-trivial. As such, this extended
conversation is intended to prove that a spectrum measurement of “quality” is necessary in
addition to the simplistic green versus red representation of an element’s state. A
visualization of just that spectrum is shown in Figure 9.3. There, you can see how the
quality of service for multiple business services is shown in a single image. Visualizations
such as this one enable business leaders to identify which services are meeting the needs of
their customers, yet without being bogged down in the minutiae of their technology
underpinnings.

Figure 9.3: A measurement of Service Quality across multiple business services.

Further, BSM’s numerical representation of quality needn’t necessarily be an instantaneous


value. Knowing the quality of a service over the medium- or long-term enhances a business
leader’s situational awareness even further.

Consider the situation where the quality of a business service is constantly changing
between acceptable and unacceptable states (see Figure 9.4). When today’s quality
measurement is reported at 77, while yesterday’s was reported at 95, it is easy to recognize
that the overall system’s effectiveness has diminished between these two days.

166
Chapter 9

Figure 9.4: A historical view of Service Quality.

When quality measurements change rapidly between acceptable and unacceptable


thresholds, this information gives the business leader the data he or she needs to direct
improvement activities. Planning, budgeting, and expansion activities can all be directed to
the services that need them the most. Ultimately, measurements like these give IT as well
as business leaders the information they need to recognize when their services are (or,
more importantly, are not) meeting the needs of their customers.

High Quality Doesn’t Necessarily Equal Complete Quality


You’ll notice some very specific terminology used in this section associated
with quality measurements. The statement “of high quality” is used rather
than “of full quality” or “of complete quality.” This choice of wording is
important because business systems always have areas for improvement.
This means that today’s measurement of high quality may be below
tomorrow’s measurement. Quality improvement activities, an important ITIL
component, enable the measurement of “today’s highest quality” to be
continually moved upwards as improvements are incorporated into the
business service or its technology components.

How Does One Measure Quality?


This concept of a single number that represents a service’s quality is great, but how can it
be calculated? How does a system like BSM gather the thousands of raw metrics gathered
by monitoring integrations into a single-number definition of a service’s behavior? You’ll
find that this process is also non-trivial but can be made easy through the right software.

The first step in developing this sense of quality is obviously in creating the abstraction of
the business service that is its Service Model. This process has been explained in detail
throughout this guide. By developing the Service Model, IT and the business define the
elements that make up the business service as well as their interconnections. Both the
elements as well as the interconnections are important because each fills out the picture of
the service in its own way.

167
Chapter 9

Once that Service Model is fully realized, the next step requires the mapping of service
levels, Key Performance Indicators (KPIs), user impacts, and revenue impacts atop its
structure. In this process, the metrics that define success or failure within business
processes are used as thresholds. For example, if the business defines a particular rate of
completed transactions to mean that the service is acceptably fulfilling customer needs,
that metric should be added to the appropriate element. Or, if the user drop rate from a
Web front end remains within a particular parameter, adding this metric in the appropriate
place is another useful threshold. In the following sections, consider a few of the metrics
that you likely already have in your business today.

Service Levels
Service level metrics are often first defined by the IT organization or through Service Level
Agreements (SLAs) between the business and IT. In these agreements, the business
identifies that services and their components must be available during certain hours with
an identified minimum of downtime. Mean-time between failures, failure rate, allowable
downtime, and expected performance are all common metrics that can be applied to
network, server, application, or other elements by the business.

Highly mature organizations are often capable of adding performance-based metrics into
their SLAs as well. With these types of service levels, specific thresholds for element
performance are known and documented. Lacking an APM solution, these types of
performance metrics are often exceptionally difficult to gather and report on. Their
monitoring may be accomplished through multiple, non-integrated solutions that are
separately controlled by individuals within each technology domain. With such point
solutions in place, the sharing of information between domains can be difficult or even
impossible. However, when APM monitoring is extended across the IT infrastructure, the
business gains the ability to gather these kinds of performance-based metrics across many
different types of technology elements at once.

Service levels with external service providers are another area in which APM provides a
great assist. By gathering availability and performance metrics against contracted service
providers, your organization gains its own set of data to be used during outages. This
information is also useful when contract disputes require chargebacks for SLA breech
events with service providers. Any business service that relies on external services for a
portion of its activities requires this kind of monitoring to truly gain a representation of
overall service success.

KPIs
KPIs are often business-oriented metrics used to define the success level of an organization
or business process. These metrics are often used to quantify activities that are otherwise
subjective in nature. These can be activities such as leadership development, the level of
customer-business engagement, and overall customer satisfaction.

168
Chapter 9

As a rule, KPIs should be designed as actionable metrics, with the value of the metric
driving some necessary action by the organization. KPIs should also be defined to provide
information on status, trends, or variance to business artifacts such as plans, forecasts, or
budgets. These two elements are critically important to KPIs that will eventually be
encoded into BSM, as they provide a basis for quantifying its data. In essence, you need
metrics whose value eventually drives some change to the environment if they are to be
useful in the context of APM. By leveraging metrics that have a known reaction line, your
BSM solution can further be used to provide necessary alerting when that action needs to
be taken.

Mapping KPIs to business artifacts in addition to technology components also enables the
later assignment of dollar values to incidents. As you’ll learn in a minute, a fully-realized
BSM solution can highlight where expensive problems need immediate resolution or when
system or user behaviors are impacting the business bottom line.

Accomplishing all of this requires some sort of design tool. An effective BSM solution will
provide the necessary logic to match incoming KPI data to defined thresholds and
management reaction lines. That logic can be incorporated through user-defined rules,
through relationships between elements, or through complex expressions. Complex
expressions may use if-then-else expressions or regular expressions with which to
construct the necessary thresholds.

Figure 9.5 shows an example design tool where a complex expression has been constructed
that validates availability metrics. In this example, both minimum and expected thresholds
are described. The combination of these two values quantifies the behaviors that are
considered appropriate and inappropriate for the assigned element.

169
Chapter 9

Figure 9.5: A BSM solution’s design tool provides a location where expressions can be
constructed based on incoming data.

User Impacts
Chapter 8’s story discussed a few examples of how the level of user impact drives the
targeting of troubleshooting resources. It argued that those problems with a greater user
impact should in most cases be prioritized over those with a smaller impact. Defining those
user impacts is another activity that occurs within a BSM solution.

Using the BSM software’s designer tool, it is possible to identify the impact associated with
each of the elements in the model. For example, when a particular network connection goes
down to a geographic site, the number of users in that site are known to be affected by the
problem. Or, when one of a pair of clustered transaction processing servers goes down, it
can be assumed that the total level of processing will be reduced by half.

170
Chapter 9

Once the user impacts for individual elements are known and entered into the system, the
Service Model with its interconnections is then used to identify the flow-up and flow-down
impacts for each element. This is represented in the simplistic example shown in Figure
9.6. Based on the dependencies encoded through the model’s interconnections, individual
element impacts can be combined to understand how many users are affected by a problem
with any element. In this example, the Inventory Processing System is known to have 1700
users, while the External Web Cluster is known to have 8300 users. Here, the loss of the
Inventory Processing System can impact its 1700 users as well as a portion of those who
use the External Web Cluster.

Figure 9.6: A simplistic example of how user impacts can be assigned to Service
Model elements.

Revenue Impacts
Knowing how many users are impacted tells one story of a problem. But knowing exactly
how service behaviors impact revenues is yet another. Information about revenue impacts
can be inputted manually into a BSM’s threshold logic. Or, more dynamically, they can be
gathered from business artifacts such as budgets, sales metrics, or other revenue data.

BSM is uniquely suited above all other monitoring solutions in that it includes the capacity
to aggregate traditional technology monitoring with financial data from these kinds of
sources. When sales or budgetary data is available in a format that the BSM solution can
work with, it becomes possible to relate technology and user impacts to hard dollar gains
or losses to the organization.

171
Chapter 9

One example of this can be seen in Figure 9.7. In this example, technical information from a
site’s external Web metrics has been related to financial information as gathered from a
sales or revenue database. In this example, a historical trendline can be developed that
shows the relationship between unique visitors to a Web site and the level of daily revenue
that occurs as a function of those visitors.

Figure 9.7: Using a BSM visualization to relate user count to revenue statistics.

This information becomes fantastically useful to the business leader because it provides a
real-time and historical look at the systems under their management. Yet it does so without
exposing the complex technology underpinnings that aren’t part of their job role. Such
visualizations relate the technology behaviors to business successes as a function of
revenue and/or sales. Business leaders with access to this kind of information have a much
greater capability to quickly shift focus, activities, and even entire lines of business as
needed based on quantitative information.

In some cases, a BSM solution needn’t be used at all with its technology monitoring
elements. A BSM solution provides a single-glimpse visualization of business activities, so it
becomes a location where purely financial information can be gathered for regular
consumption. Figure 9.8 shows an example of this, displaying monthly profit versus payout
information that is sourced directly from a financial database.

172
Chapter 9

Figure 9.8: Information in BSM visualizations can be gathered from purely financial
sources as well.

Nowhere in this visualization is information that arrives through APM monitoring


integrations. The data here is calculated purely from financial databases or other business
artifacts. Visualizations like this are often placed alongside others that contain APM-related
information in a management dashboard. The result is a more holistic situational
awareness of the business service and its revenue impacts.

Real-Time Monitoring Means Real-Time Metrics


There has for a long time been a problem intrinsic to the collection and reporting on
business-related metrics. That problem relates the quantity of time required to collect,
compute, and report metrics when done through traditional manual processes. The
problem here is that for much of the history of business itself, the only way to collect these
types of metrics is through those very manual and time-consuming processes.

Notwithstanding the labor cost associated with collecting and creating the necessary
reports, the manual collection process also masks another problem: data granularity. Think
about the situation where it takes a few days of labor to compile and report on business
metrics. The result is that metrics are always at least a few days old, and there is no
capacity to see in real-time how incremental system changes directly impact business
revenues. Lacking data granularity, you’re always making business decisions on old data.

173
Chapter 9

Today’s business climate mandates that businesses constantly adapt themselves to


changing situations in the economy. Purchasing or customer trends must be analyzed as
they happen, with decisions made quickly, if the business is to remain agile in its products
or services. This need for a constant flow of real-time information goes completely counter
to the traditional manual collection efforts that have been its tradition. Needed are
solutions that collect data on a constant basis and present that information to decision
makers in what could effectively be called real-time.

BSM represents one solution that can accomplish just that goal. With a fully-realized BSM
implementation in place, business leaders are given access to real-time information about
technology as well as financial impacts. Because the data that drives their visualizations is
gathered constantly through its APM underpinnings, the business leader gains greater
visibility into customer and system behaviors. Leaders with this information can much
better reposition their business when conditions mandate changes to the business model.

The Cost of Poor Quality


Also important here is a recognition of what to do when systems or services simply aren’t
meeting the needs of their customers. It’s been said before that there are an unlimited
number of ways in which a complex business system can be constructed but only a few that
will ultimately provide value. Without the deep instrumentation gained through a BSM-on-
top-of-APM (BSM/APM) solution, you likely don’t have the raw data that quantitatively
validates that your services are fully optimized. Further, you’ll never be able to make
improvements to service delivery if you don’t know where improvements can be made.

One important metric that is gained through the creation of quality metrics is actually the
functional inverse of quality: the Cost of Poor Quality (CoPQ). When a business service is
not meeting the needs of its customers, it is not bringing in a maximum level of revenue.
With a BSM solution’s metrics in place, it grows possible to measure the lost revenue that is
incurred through poor quality service delivery. That lost revenue is directly and inversely
related to the level of quality in your system.

CoPQ is a term that is defined by Six Sigma as those costs which are generated as a result of
producing defective material. This cost includes the cost involved in fulfilling the gap between
the desired and actual product/service quality. It also includes the cost of lost opportunity due
to the loss of resources used in rectifying the defect.

Although this traditional definition relates primarily to manufacturing environments, CoPQ


can be measured in other IT-related environments as well. Think for a minute about the
high-level variables that can go into a CoPQ calculation: You need to know the quality of a
system as well as its level of potential quality. Subtracting these two numbers and adding a
multiplier for cost gives you this information.

More on Six Sigma and other frameworks in a minute, but for now, recognize that metrics
such as CoPQ become easy to measure once metrics like quality and revenue impacts are
well defined. BSM solutions have the potential to provide all of these numbers.

174
Chapter 9

The Impact of the Business Calendar


Today’s businesses are also more global in nature. Whereas locally-oriented businesses can
easily determine their hours of operation based on those that are industry- and locally-
acceptable, global businesses have a much harder time determining when the sign at their
front door flips from “Closed” to “Open.”

This situation grows even more problematic when global businesses sell their wares on the
Internet. Business on the Internet is commonly considered a 7×24×365 operation, with
services and businesses never really closing for operations. The expectation of never being
down presents a set of problems to the Internet-connected business. If Web services are
always up,

• when can regular maintenance operations be scheduled?


• are there times to swing additional resources to bear?
• when are inbound user rates higher or lower?
With users connecting in from areas around the globe, one solution for the always-on
business is the creation of a business calendar. In the context of BSM, the business calendar
represents the periods of time each day when inbound users are at their peak versus non-
peak hours. Depending on the type of business, this period of time can be at very different
times of the day. For example, a Web service whose customers are primarily other
businesses will see greater attention during the workday. In contrast, others who service
families might see greater attention when users have gone home for the evening.

Creating such a calendar is another non-trivial task. First and foremost, actual metrics of
user counts must be collated and averaged based on time and day. Those metrics must be
aggregated across the multiple geographic locations and time zones where the business
service is primarily located. Internet-based services with replicated infrastructures in other
parts of the world will see greater levels of inbound users at different parts of the day.

Figure 9.9 shows an example of how a simplistic business calendar can be constructed
across United States, EMEA, and Asia-Pac localities. Here, the areas shaded in red indicate
peak hours for that locality. Those in green represent non-peak hours. A BSM solution that
calculates a service’s business calendar must then aggregate those metrics into an
aggregated business schedule. That schedule can be used to answer the previously-posed
questions as well as define the hours where servicing affects the least users.

175
Chapter 9

Figure 9.9: An example business calendar.

Note
It is worth mentioning here that replicated service infrastructures across
multiple localities can impact how the business calendar is used. For
example, if a maintenance activity needs to occur on a local device, the local
business calendar should be consulted. If the device under maintenance is
used by the entire infrastructure, the aggregated calendar will determine its
maintenance time.

Another important point is that the BSM business calendar is intended to be


a real-time metric. As your business evolves over time, so will your business
calendar.

BSM’s Impact on ITIL and Six Sigma Activities


The metrics gained through a BSM implementation are also useful when fed into
management frameworks such as ITIL or Six Sigma. Like BSM’s roots in APM data, these
frameworks are often highly data-driven in how they accomplish and improve upon the
tasks of IT. One of the common limitations, however, in successfully implementing ITIL and
Six Sigma framework processes is in gathering enough data of the right kind to be useful.
The data gathering and calculation potential of a BSM/APM solution enables greater
success with both frameworks. Without delving too deep into their details, let’s take a quick
look at the areas of each where BSM and APM can both provide added value.

176
Chapter 9

ITIL Integration
The ITIL is comprised of a set of industry best practices that identify the necessary
activities that are common to an IT environment. ITIL is specifically comprised of 5 stages
and 24 activities within those stages. Activities within ITIL span the life cycle of service
strategizing, design, transitioning, operations, and continual improvement.

A full discussion on ITIL can take an entire book (or in fact six books, which is what
comprises the entire library to date). For the purposes of ITIL’s linkages with BSM,
consider the activities in Figure 9.10. The 12 activities highlighted in red are those that
stand to gain a direct benefit from the quantitative data gathered and calculated by a
BSM/APM solution.

Figure 9.10: ITIL’s 5 stages and 24 activities. Those that are directly impacted by BSM
are highlighted in red.

Note
You can view more detailed information about the ways in which BSM
improves each of these activities by turning to Chapter 9 of The Definitive
Guide to Business Service Management.

Many of these activities have been discussed in this guide so far, although without
specifically calling them out by their ITIL nomenclature. One in particular of note is the
entire fifth stage of the ITIL service life cycle, Continual Service Improvement. In this stage,
services in operations are analyzed with an eye towards their capacity to meet their
original stated goals as well as the needs of their consumer.

177
Chapter 9

BSM provides a substantial added value to this process through its identification and
quantification of service quality. This quantification enables improvement teams to very
discretely identify areas of gap in service delivery, develop appropriate solutions, and
visibly see how well those solutions impact the overall quality of service delivery. In
essence, using BSM’s metrics, service improvement teams can measure the difference in as-
is and to-be levels of service quality, proving that their improvement activities have indeed
brought about improvement.

Six Sigma Improvement Activities


Whereas ITIL has a widespread focus on the required activities of an IT organization and
its services, Six Sigma’s focus is entirely on the improvement process alone. Again, without
delving too deep into the purpose and history of Six Sigma, it is important for the
discussion here to recognize that Six Sigma’s improvement activities also gain quantitative
measurements through the data within a BSM/APM solution.

The Six Sigma Define, Measure, Analyze, Improve, and Control (DMAIC) process is
comprised of five phases: Defining the services or components that are critical to quality,
measuring their behaviors, analyzing those measurements with an eye towards finding
areas of gap, implementing and validating improvements, and finally, building the
controlling structures that ensure the improvement remains in place over time.

Figure 9.11 highlights each of these five phases as well as some of the common activities
that are accomplished during each phase. Important to recognize here is that each of the
activities noted in Figure 9.11 can actually be augmented through the data provided by a
BSM/APM solution. For example, sampling data can be gathered through APM monitoring
integrations. That data can be used to create a baseline of configuration and behaviors,
which is then continually measured through the same APM monitors.

178
Chapter 9

Figure 9.11: The activities and phases of Six Sigma.

Those behaviors can then be analyzed for poor quality and associated costs, creating failure
mode effect and Pareto charts to identify areas of highest impact. Ultimately, gaps can be
identified and improved upon, with the same BSM/APM data validating the positive impact
of the improvement. Similar to the process improvement example with ITIL, BSM/APM
quality data establishes the metric by which all improvement activities are ultimately
measured against.

BSM: The Bottom Line


BSM’s bottom line arrives with its unique capability to convert technology metrics into
dollars and cents impacts. By seeing how the impact of technology behaviors changes the
business bottom line, technologists and business leaders alike are more empowered to
make smart decisions about service delivery.

Most importantly, business processes are the backbone of a well-managed business. Their
efficient completion ensures that business activities are executed properly and ultimately
drive value—rather than cost—to the business. The historical problem, however, with
business processes has been their integration into IT technologies. Too often in immature
IT organizations, business processes are forced to function within the capabilities of the IT
infrastructure rather than the opposite. In the most egregious of examples, business
processes are simply not fulfilled by the technologies that are deployed by the IT
organization.

179
Chapter 9

Alignment between your IT processes and your business processes is critical to


transforming IT from its traditional role as business cost center into a new role of business
partner. As should be obvious in this chapter, the right data and the right solutions enable
you to do just that.

This chapter effectively concludes this guide’s discussion on APM. Its focus on the business
side of technology is fundamentally critical, as every IT organization is a function of its
business, and that most businesses today can’t function without their IT organizations. In
the end, the data-driven approach that such an organization gains through the
implementation of an APM solution gives a far greater situational awareness of the
technology environment than domain-focused point solutions.

Although this chapter concludes the discussion, it does not conclude the guide. The final
chapter, Chapter 10, arrives as a sort of primer on the topics discussed throughout this
entire guide. It summarizes the important points discussed in each chapter and is intended
to be the handout you can use to educate others about what you’ve learned in the other
nine chapters. Most importantly, Chapter 10 serves as a more concise explanation that you
can deliver to the decision makers in your organization should you determine that an APM
solution is a necessary addition to your IT environment.

180
Chapter 10

Chapter 10: Your APM Cheat Sheet


APM is obviously a big and complex topic. With monitoring integrations that span
networks, servers, and end users along with all the transactions in between, a fully-realized
APM solution quickly finds itself wrapped around every part of your network. Truly
understanding how an APM solution adds value to your business services requires every
part of a multi-chapter book. This book’s definitive approach to explaining APM strategy
and tactics gives you the information you need to make smart purchasing decisions. Once
decided, its chapters then help guide you towards the best ways to lay it into place.

However, not everyone has the time or the interest in poring through a 200-page tome.
Digesting this guide’s 200-odd pages will consume more than an afternoon making the
topic hard to approach for the busy executive or IT director. To remedy this situation, this
final chapter is published as a sort of “Cheat Sheet” for the other nine. Using excerpts from
each of the previous chapters, this “shortcut” guide summarizes the information you need
to know into a more bite-size format.

So, how is this chapter best utilized? Hand it out to your business leaders as a walkthrough
for APM’s business value. Pass it around your IT department to give them an idea of APM’s
technical underpinnings. Show it to your Service Desk employees as an example of the
future you want to implement. Then, for those who show particular interest, clue them in
on the other chapters for the full story. In the end, you’ll find that APM benefits everyone.
You just have to show them how.

Part 1—What Is APM?


If you’re an IT professional reading this guide, you’ve heard these stories many times
before. You know about the host of potential problems that an IT infrastructure can and
does experience on any particular day. You’ve experienced the nightmare situation where a
critical service goes down and no one can track down exactly why. You’ve sat in the “war
room” where highly-skilled individuals from every IT domain—network engineers,
systems administrators, application analysts—sit around the conference table for hours
attempting to prove that the problem isn’t theirs. Whether you’re an IT professional, or
someone who directs teams of them, you know that any downed service immediately
signals the beginning of a bad day.

181
Chapter 10

The problem is that the idea of a “service that is down” is often so much more than a simple
binary answer; on versus off, working versus not working. As you can see in Figure 10.1, IT
services are made up of many components that must work in concert. Servers require the
network for communication. Web servers get their information from application servers
and databases. Data and workflow integrations from legacy systems such as mainframes
must occur. These days, even data storage must be accessible over that same network.

Figure 10.1: An IT service is comprised of numerous components that rely on each


other for successful operation.

If any of those pieces experiences an unacceptable condition—an outage, a reduction in


performance, an inappropriate or untimely response, and so on—the functionality of the
entire service is affected. This can happen in any number of ways:

• The service or hardware hosting the service is non-functional


• A server or service that is relied on is non-functional
• One or more servers or services that make up the service are not performing at an
acceptable level
• An individual component or function of the service is non-functional or is not
performing at an acceptable level
All of these are situations that can and will impact the ability of your critical IT services to
complete their stated mission. No matter whether the actual service itself is down or the
cause is some component that feeds into the functionality of that service, the ultimate
result to the end customer is a degradation in service. The ultimate result to your business
is a loss of revenue, a loss of productivity, and the inability to fulfill the regular needs of
business.

182
Chapter 10

Defining APM, More than “On vs. Off”


Fixing IT’s former “on versus off” approach to service management is therefore a critical
step. As such, smart organizations are looking to accomplish this through a more
comprehensive approach to defining their services, the quality of those services, and their
ability to meet the needs of users. Application Performance Management (APM) is one
systems management discipline that attempts to provide that perspective. Consider the
following definition:

APM is an IT service discipline that encompasses the identification, prioritization, and


resolution of performance and availability problems that affect business applications.

Organizations that want to take advantage of APM must lay in place a workflow and
technology infrastructure (see Figure 10.2) that enables the monitoring of hardware,
software, business applications, and, most importantly, the end users’ experience. These
monitoring integrations must be exceptionally deep in the level of detail they elevate to the
attention of an administrator. They must watch for and analyze behaviors across a wide
swath of technology devices and applications, including networks, databases, servers,
applications, mainframes, and even the users themselves as they interact with the system.

Figure 10.2 shows an example of how such a system might look. There, you can see how the
major classes of an IT system—users, networks, servers, applications, and mainframes—
are centered under the umbrella of a unified monitoring system. That system gathers data
from each element into a centralized database. Also housed within that database is a logical
model of the underlying system itself, which is used to power visualizations, suggest
solutions to problems, and assist with the prioritization of responses.

183
Chapter 10

Visualizations

Service
Suggested
Prioritization Model
Solutions

Monitoring Integrations

Clients Networks Servers Applications Mainframes

Figure 10.2: An APM solution leverages monitoring integrations and service model
logic to drive visualizations, prioritize problems, and suggest solutions.

With its monitoring integrations spread across the network, such a system can then assist
troubleshooting administrators with finding and resolving the problem’s root cause. In
situations in which multiple problems occur at once—not unheard of in IT environments—
an APM system can assist in the prioritization of problems. In short, an effective APM
system will drive administrators first to those problems that have the highest level of
impact on users.

184
Chapter 10

Part 2—How APM Aligns with the Business


For an organization to efficiently make use of the kinds of information that an APM solution
can provide, it must operate with a measure of process maturity. IT organizations that lack
configuration control over their infrastructure don’t have the basic capability to maintain
an environment baseline. Without a baseline to your applications, the quality of the
information you gather out of your monitoring solution will be poor at best and wrong at
worst.

But how does an IT organization know when they’ve got that right level of process in place
to best use such a solution? Or, alternatively, if an organization recognizes that they don’t
have the right level, how can an APM solution help them get there?

One way to evaluate and measure the “maturity” of IT is through a model that was
developed in 2007 as part of a Gartner analysis titled Introducing the Gartner IT
Infrastructure and Operations Maturity Model (2007, Scott, Pultz, Holub, Bittman, McGuckin).
This groundbreaking research note defined IT across a spectrum of capabilities, each
relating to the way in which IT actually goes about accomplishing its assigned tasks. An IT
culture with a higher level of process maturity will have the infrastructure frameworks in
place to make better use of technology solutions, solve problems faster, plan better for
expansions, and ultimately align better with the needs and wants of the business they
serve.

Process maturity within an organization is defined as quite a bit more than simply having
the ability to solve problems. Within Gartner’s maturity model, the capacity of IT to solve—
and prevent—ever more complex problems was defined largely by its level of process
maturity.

Secondly, and arguably more importantly, smart organizations can leverage an APM
solution itself to rapidly develop process maturity in an otherwise immature organization. By
reorganizing your IT operations around a data-driven approach with comprehensive
monitoring integrations, you will find that you quickly begin making IT decisions based on
their impact to your business’ applications. You will better plan for augmentations based
on actual data rather than the contrived anticipation of need. You will better budget your
available resources based on actual responses you get out of your existing systems.

185
Chapter 10

What Changes with APM?


With IT’s movement from one stage to the next, the entire culture of the organization
changes as well. IT at higher levels of maturity has the capacity to accomplish bigger and
better projects. But IT at higher levels of maturity also thinks entirely differently about the
tasks that are required:

• The ways IT looks at itself. In earlier stages of maturity, IT sees itself as a fully-
segregated entity from the business. In many cases, IT can see itself as a different
business entirely! Individuals in IT find themselves concerned with the daily
processing of the servers and the network, to the exclusion of the data that passes
through those systems. As IT matures, the natural culture of IT is to begin thinking
of itself as a partner of the business, and ultimately as the business itself.
• The ways IT looks at data & applications. Data and applications in the immature
IT organization are its bread and butter. These are the elements that make up the
infrastructure, and are worked on as individual and atomic elements. IT in earlier
stages will find itself leveraging manual activities and shunning automation out of
distrust for how it interacts with system components. Applications in early-stage IT
are most often those can be purchased off the shelf, with customization often very
limited or non-existent. Later-stage IT organizations needn’t necessarily build their
own applications; however, they do see applications as solutions for solving
business processes as opposed to fitting the process around the available
application.
• The ways IT looks at the business. Immature IT organizations are incapable of
understanding how their activities impact the business as a whole. Lacking a holistic
view of their systems, they focus on availability as their primary measure of success.
Yet business applications require more than a ping response for them to be truly
available to users. More mature IT organizations find themselves implementing
tools to measure the end user’s experience. When that level of experience is better
understood, IT gains a greater insight into how their operations impact business
operations.
• The tools IT uses. The tools of IT also get more mature as the culture grows in
maturity. IT organizations with low levels of maturity are hesitant to incorporate
holistic solutions often because they can’t see themselves actually using or getting
benefit from those solutions. As such, immature IT organizations lean on point
solutions as stopgap resolutions for their problems. The result is that collections of
tools are brought to bear while unified toolsets are ignored. Mature IT organizations
have a better capability to understand the operational expense of an expanding
toolset, while being more capable—both technically and culturally—of leveraging
the information gained from unified solutions.

186
Chapter 10

As the maturity of IT’s tools grows, so does the predictive capacity of those tools. It was
discussed in Chapter 1 that solution platforms such as those that fulfill APM’s goals extend
their monitoring integrations throughout the technology infrastructure of a business.
Because APM’s reach is so far into each of a business application’s components, it grows
more capable than point solutions for finding the real root cause behind problems or
reductions in performance.

Part 3—Understanding APM Monitoring


Part two discussed the concepts of IT organizational maturity. Although that conversation
has little to do with monitoring integrations and their technological bits and bytes, it serves
to illuminate how IT organizations themselves must grow as the systems they manage
grow in complexity. As an example, a Chaotic or Reactive IT organization will simply not be
successful when tasked to manage a highly-critical, customer-focused application. The
processes, the mindset, and the technology simply aren’t in place to ensure good things
happen.

To that end, IT has seen a similar evolution in the approaches used for monitoring its
infrastructure. IT’s early efforts towards understanding its systems’ “under the covers”
behaviors have evolved in many ways similar to Gartner’s depiction of organizational
maturity. Early attempts were exceptionally coarse in the data they provided, with each
new approach involving richer integrations at deeper levels within the system.

IT organizations that manage complex and customer-facing systems are under a greater
level of due diligence than those who manage a simple infrastructure. As such, the tools
used to watch those systems must also have a higher level of due diligence. As monitoring
technologies have evolved over time, new approaches have been developed that extend the
reach of monitoring, enhance data resolution, and enable rich visualizations to assist
administrative and troubleshooting teams:

• Simple availability with ICMP


• Richer information with SNMP
• Device details with the agent-based approach
• Situational awareness with the agentless approach
• Application runtime analysis for deep monitoring integration
• Complete recognition of the end user’s experience
Chapter 3 discusses how this evolution has occurred and where monitoring is today. As
you’ll find, APM aggregates the lessons learned from each previous generation to create a
unified system that leverages every approach simultaneously.

187
Chapter 10

Part 4—Integrating APM into your Infrastructure


Integrating an APM solution into your environment is no trivial task. Although best-in-class
APM software comes equipped with predefined templates and automated deployment
mechanisms that ease its connection to IT components, its widespread coverage means
that the initial setup and configuration are quite a bit more than any “Next, Next, Finish.”

That statement isn’t written to scare away any business from a potential APM installation.
Although a solution’s installation will require the development of a project plan and
coordination across multiple teams, the benefits gained are tremendous to assuring quality
services to customers. Any APM solution requires the involvement of each of IT’s
traditional silos. Each technology domain—networks, servers, applications, clients, and
mainframes—will have some involvement in the project. That involvement can span from
installing an APM’s agents to servers and clients to configuring SNMP and/or NetFlow
settings on network hardware to integrating APM monitoring into off-the-shelf or
homegrown applications. As a result, an APM solution enables a level of objective analysis
heretofore unseen in traditional monitoring.

The realities of that objective data are best exemplified through APM’s mechanisms to
chart and plot its data. Figure 10.3 shows a sample of the types of simultaneous reports
that are possible when each component of an application infrastructure is consolidated
beneath an APM platform. In Figure 10.3, a set of statistics for a monitored application is
provided across a range of elements.

Take a look at the varied ways in which that application’s behaviors can be charted over
the same period of time. Measuring performance over the time period from 10:00 AM to
7:00 PM, these charts enable the reconstruction of that application’s behaviors across each
of its points of monitoring.

188
Chapter 10

Figure 10.3: APM’s integrations enable real-time and historical monitoring across a
range of IT components, aggregating their data into a single location for analysis.

With the data you see in Figure 10.3, consider the points of integration where you might
want monitors set into place. You will definitely want to watch for server processing. You’ll
need to record your network bandwidth utilization and throughput. You need to know
transaction rates between mainframes and inventory processing.

All these monitors illuminate different behaviors associated with the greater system at
large, and all provide another set of data that fills out the picture in Figure 10.3’s charts and
graphs. Now take a look at Figure 10.4, which shows how some of these monitoring
integrations can be laid into place for an example customer-facing business service.

189
Chapter 10

Figure 10.4: Overlaying potential monitoring integrations onto a complex system


shows the multiple areas where measurement is necessary.

One end goal of all this monitoring is the ability to create an overall sense of system
“health.” As should be obvious in this chapter, an APM solution has a far-reaching capability
to measure essentially every behavior in your environment. That’s a lot of data. A resulting
problem with this sheer mass of data is in ultimately finding meaning. Essentially, you can
gather lots of data, but it isn’t valuable if you don’t use it to improve the management of your
systems.

As a result, APM solutions include a number of mechanisms to roll up this massive quantity
of data into something that is useable by a human operator. This process for most APM
solutions is relatively automatic, yet requires definition by the IT organization who
manages it.

190
Chapter 10

The concept of “service quality” is used to explain the overarching environment health. Its
concept is quite simple: Essentially, the “quality” of a service is a single metric—like a
stoplight—that tells you how well your system is performing. In effect, if you roll up every
system-centric counter, every application metric, every network behavior, and every
transaction characteristic into a single number, that number goes far in explaining the
highest-level quality of the service’s ability to meet the needs of its users.

Consider the graphic shown in Figure 10.5. Here, a number of services in different locations
are displayed, all with a health of “Normal.” This single stoplight chart very quickly enables
the IT organization to understand when a service is working to demands and when it isn’t.
The graph also shows the duration the service has operated in the “normal” state, as well as
a monthly trend. This single view provides a heads-up display for administrators.

Figure 10.5: The quality of a set of services is displayed, showing a highest-level


approximation of their abilities to serve user needs.

Yet actually getting to a graph like this requires each of the monitoring integrations
explained to this point in this chapter. The numerical analysis that goes into identifying a
service’s “quality” requires inputs from network monitors, on-board agents, transactions,
and essentially each of the monitoring types provided by APM.

Part 5—Understanding the End User’s Perspective


With APM solutions, you’ll hear the term “perspective” used over and over in relation to the
types of data that can be provided by a particular monitoring integration. But what really is
perspective, and what does it mean to the monitoring environment?

It is perhaps easiest to consider the idea of perspective as relating to the orientation of a


monitors view, which determines the kinds of data that it can see and report on. Although
the computing environment is the same no matter where a monitor is positioned, different
monitors in different positions will “see” different parts of the environment.

191
Chapter 10

Consider, for example, a set of fans watching a baseball game. If you and a friend are both
watching the game but sitting in different parts of the stadium, you’re sure to capture
different things in your view. Your friend who is sitting in the good seats down by the
batter is likely to pick up on more subtle non-verbal conversations between pitcher and
catcher. In contrast, your seats deep in the outfield are far more likely to see the big picture
of the game—the positioning of outfielders, the impact of wind speed on the ball, the
emotion and effects of the crowd on the players—than is possible through your friend’s
close-in view.

Relating this back to applications and performance, it is for this reason that multiple
perspectives are necessary. Their combination assists the business with truly
understanding application behaviors across the entire environment. An agent that is
installed to an individual server will report in great detail about that server’s gross
processing utilization. That same agent, however, is fully incapable of measuring the level
of communication between two completely separate servers elsewhere in the system.

Monitoring from the End User’s Perspective


Thus far, this guide has discussed how the vast count of different monitors enables metrics
from a vast number of perspectives: Server-focused counters are gathered by agents,
network statistics are gathered through probes and device integrations such as Cisco
NetFlow, transactions and application-focused metrics are gathered through application
analytics; the list goes on. Yet, it should be obvious that this guide’s conversation on
monitoring remains incomplete without a look at what the end users see in their
interactions with the system.

This view is critically necessary because it is not possible—or, at the very least,
exceptionally difficult—to construct this experience using the data from other metrics.
Relating this back to the baseball example, no matter how much data you gather from your
seat in the outfield, it remains very unlikely that you’ll extrapolate from it what the pitcher
is likely to throw next.

For the needs of the business application, end user experience (EUE) enables
administrators, developers, and even management to understand how an application’s
users are faring. First and foremost, this data is critical for discovering how successful that
application is in servicing its customers. Applications whose users experience excessive
delays, drop off before accomplishing tasks, and don’t fulfill the use case model aren’t
meeting their users’ needs. And those that don’t meet user needs will ultimately result in a
failure to the business.

192
Chapter 10

This line of thinking introduces a number of potential use cases where EUE monitoring can
benefit an application’s quality of service. EUE monitoring works for valuating the
experience of the absolute end user as well as in other ways:

• Quantifying the performance characteristics of connected users as well as


differences in performance between users in different geographic locales
• Simulating user behaviors through the use of robots for the purpose of predicting
service quality degradations
• Identifying where internal users, as opposed to the absolute end user, are seeing a
loss of service
• Keeping external service providers honest through independent measurements of
their services

Where Does EUE Fit?


It should be obvious at this point that there are a number of areas where EUE provides
benefit to the business and its applications. Yet this chapter hasn’t yet discussed how EUE
goes about gathering its data. If end users are scattered around the region or the planet,
how can an EUE monitoring solution actually come to understand their behaviors? Simply
put, the metrics are right at the front door.

Think for a moment about a typical Internet-based application such as the one being
discussed in this chapter. Multiple systems combine to enable the various functions of that
application. Yet there is one set of servers that interfaces directly with the users
themselves: the External Web Cluster. Every interaction between the end user and the
application must proxy in some way through that Web-based system. This centralization
means that every interaction with users can also be measured from that single location.

EUE leverages transaction monitoring between users and Web servers as a primary
mechanism for defining the users’ experience. Every time a user clicks on a Web page, the
time required to complete that transaction can be measured. The more clicks, the more
timing measurements. As users click through pages, an overall sense of that user’s
experience can be gathered by the system and compared with known baselines. These
timing measurements create a quantitative representation of the user’s overall experience
with the Web page, and can be used to validate the quality of service provided by the
application as a whole.

193
Chapter 10

It is perhaps easiest to explain this through the use of an example. Consider the typical
series of steps that a user might undergo to browse an e-commerce Web site, identify an
item of interest, add that item to their basket, and then complete the transaction through a
check out and purchase. Each of these tasks can be quantified into a series of actions. Each
action starts with the Web server, but each action also requires the participation of other
services in the stack for its completion:

• Browse an e-commerce Web site. The External Web Cluster requests potential
items from the Java-based Inventory Processing System, which gathers those items
from the Inventory Mainframe. Resulting items are presented back to the External
Web Cluster, where they are rendered via a Web page or other interface.
• Identify an item of interest. This step requires the user to look through a series of
items, potentially clicking through them for more information. Here, the same
thread of communication between External Web Cluster, Inventory Processing
System, and Inventory Mainframe are leveraged during each click. Further
assistance from the ERP system can be used in identifying additional or alternative
items of interest to the user based on the user’s shopping habits.
• Add that item to the basket. Creating a basket often requires an active account by
the user, handled by the ERP system with its security handled by the Kerberos
Authentication System. The actual process of moving a desired item to a basket can
also require temporarily adjusting its status on the Inventory Mainframe to ensure
that item remains available for the user while the user continues shopping.
Information about the successful addition of the item must be rendered back to the
user by the External Web Cluster.
• Complete the transaction through a check out and purchase. This final phase
leverages each of the aforementioned systems but adds the support of the Credit
Card Proxy System and Order Management System.
In all these conversations, the External Web Cluster remains the central locus for
transferring information back to the user. Every action is initiated through some click by
the user, and every transaction completes once the resulting information is rendered for
the user in the user’s browser. Thus, a monitor at the level of the External Web Cluster can
gather experiential data about user interactions as they occur. Further, as the monitor sits
in parallel with the user, any delay in receiving information from down-level systems is
recognized and logged.

A resulting visualization of this data might look similar to Figure 10.6. In this figure, a top-
level EUE monitor identifies the users who are currently connected into the system.
Information about the click patterns of each user is also represented at a high level by
showing the number of pages rendered, the number of slow pages, the time associated with
each page load, and the numbers of errors seen in producing those pages for the user.

194
Chapter 10

Figure 10.6: User statistics help to identify when an entire application fails to meet
established thresholds for user performance.

Adding in a bit of preprogrammed threshold math into the equation, each user is then given
a metric associated with their overall application experience. In Figure 10.6, you can see
how some users are experiencing a yellow condition. This means that their effective
performance is below the threshold for quality service. Although this information exists at
a very high level, and as such doesn’t identify why performance is lower than expectations,
it does alert administrators that degraded service is being experienced by some users.

An effective APM solution should enable administrators to drill down through high level
information like what is seen in Figure 10.6 towards more detailed statistics. Those
statistics may illuminate more information about why certain users are experiencing
delays while others are not. Perhaps one server in a cluster of servers further down in the
application’s stack is experiencing a problem. Maybe the items being requested by some
users are not being located quickly enough by inventory systems. Troubleshooting
administrators can drill through EUE information to server and network statistics, network
analytics, or even individual transaction measurements to find the root cause of the
problem.

Part 6—APM’s Service-Centric Monitoring Approach


This guide has spent a lot of time talking about monitoring and monitoring integrations. It
discussed the history of monitoring. It explained where and how monitoring can be
integrated into your existing environment. It outlined in great detail how end user
experience (EUE) monitoring layers over the top of traditional monitoring approaches. Yet
in all these discussions, there has been little talk so far about how that monitoring is
actually manifested into an APM solution’s end result.

195
Chapter 10

It is this process that requires attention at this point in our discussion. In reading through
the first five chapters of this guide, you’ve made yourself aware of where monitoring fits
into your environment. The next step is in creating meaning out of its raw data. As John
mentioned earlier and as you’ll discover shortly, the real magic in an APM solution comes
through the creation and use of its Service Model.

To fully understand the quantitative approach to Service Quality, one must understand how
the different types of monitoring are aggregated into what is termed a Service Model. This
Service Model is the logical representation of the business service, and is the structure and
hierarchy into which each monitoring integration’s data resides. The Service Model is
functionally little more than “boxes on a whiteboard,” with each box representing a
component of the business service and each connection representing a dependency. It
resides within your APM solution, with the sum total of its elements and interconnections
representing the overall system that the solution is charged with monitoring.

But before actually delving into a conversation of the Service Model, it is important to first
understand its components. Think about all the elements that can make up a business
service. There are various networking elements. Numerous servers process data over that
network. Installed to each server may be one or more applications that house the service’s
business logic. All these reside atop name services, file services, directory services, and
other infrastructure elements that provide core necessities to bind each component.

Take the concepts that surround each of these and abstract them to create an element on
that proverbial whiteboard. This guide’s External Web Cluster becomes a box on a piece of
paper marked “External Web Cluster.” The same happens with the Inventory Processing
System and the Intranet Router, and eventually every other component.

By encapsulating the idea of each service component, it is now possible to connect those
boxes and design the logical structure of the system. This step is generally an after-the-
implementation step, with the implemented service’s architecture defining the model’s
structure and not necessarily the opposite. Figure 10.7 shows a simple example of how this
might occur. There, the External Web Cluster relies on the Inventory Processing System for
some portion of its total processing. Both the External Web Cluster and the Inventory
Processing System rely on the Intranet Router for their networking support. As such, their
boxes are connected to denote the dependency.

196
Chapter 10

Figure 10.7: Abstracting each individual component to create connected elements on


a whiteboard.

This abstraction and encapsulation of components can grow as complex or as simple as


your business service (and your level of monitoring granularity) requires. One simplistic
system might have only a few boxes that connect. An exceptionally-complex one that
services numerous external customers—such as the one used by TicketsRUs.com—might
require dozens or hundreds of individual elements. Each element relies on others and must
work together for the success of the overall system.

This abstraction and connection of service components only creates the logical structure
for your overall business service. Internal to each individual component are metrics that
valuate the internal behaviors of that component. As you already saw back in Figure 10.4,
those metrics for a network device might be Link Utilization, Network Latency, or Network
Performance. An inventory processing database might have metrics such as Database
Performance or Database Transactions per Second. Each individual server might have its
own server-specific metrics, such as Processor Utilization, Memory Utilization, or Disk I/O.
Even the installed applications present their own metrics, illuminating the behaviors
occurring within the application.

With this in mind, let’s redraw Figure 10.7 and map a few of these potential points of
monitoring into the abstraction model. Figure 10.8 shows how some sample metrics can be
associated with the Inventory Processing System. Here, the Database Performance and
Transactions per Second statistics arrive from application analytics integrations plugged
directly into the installed database. Agent-based integrations are also used to gather whole
server metrics such as Memory Utilization and Processor Utilization.

197
Chapter 10

External Web Cluster Configured Monitors

Component Health

Database Performance
Inventory Processing System
Transactions per Second
Memory Utilization
Processor Utilization

Intranet Router

Figure 10.8: Individual monitors for each element are mapped on top of each
abstraction.

You’ll also notice that the colors of each element are changed as well. At the moment Figure
10.8 is drawn, the Inventory Processing System’s box is colored red. This indicates that it is
experiencing a problem. Drilling down into that Inventory Processing System, one can
identify from its associated metrics that the server’s Processor Utilization has gone above
its acceptable level and has switched to red.

Each of the metrics assigned to the Inventory Processing System’s box are themselves part
of a hierarchy. The four assigned metrics fall under a fifth that represents the overall
Component Health. This illustrates the concept of rolling up individual metrics to those that
represent larger and less granular areas of the system. It enables the failure of a down-level
metric to quickly rise to the level of the entire system.

Flow Up, Drill Down


Drilling down in this model highlights the individual failure that is currently impacting the
system, but that specific problem is only one piece of data found in this illustration. As you
drill upwards from the individual metrics and back to the model as a whole, you’ll notice
that the individual boxes associated with each component are also active participants in the
model. Because the overall Component Health monitor associated with the Inventory
Processing System has changed to red, so does the representation of the Inventory
Processing System itself.

198
Chapter 10

Going a step further, this model flows up individual failures to the greater system through
its individual linkages between components that rely on each other. In this example, the
External Web Cluster relies on the failed Inventory Processing System. Therefore, when the
Inventory Processing System experiences a problem, it is also a problem for the External
Web Cluster. The model as a whole is impacted by the singular problem associated with
Processor Utilization in the Inventory Processing System.

It is the summation of all these individual threshold values that ultimately drives the
numerical determination of Service Quality. A business service operates with high quality
when its configured thresholds remain in the green. That same service operates with low
quality when certain values flip from green to red and is no longer available when other
critical values become unhealthy. The levels of functionality between these states become
mathematical products of each calculation.

In effect, one of APM’s greatest strengths is in its capacity to mathematically calculate the
functionality of your service. Taking this approach one step further, IT organizations can
add data to each element that describes the number of potential users of that component.
Combining this user impact data with the level of Service Quality enables the system to
report on which and how many users are impacted by any particular problem.

Part 7—Developing & Building APM Visualizations


This guide’s growing explanation of APM has introduced each new topic with an end goal in
mind. That end goal—both for this guide as well as APM in general—is to gather necessary
data that ultimately creates a set of visualizations useful to the business.

It is the word “useful” that is most important in the previous sentence. “Useful” in this
context means that the visualization is providing the right data to the right person. “Useful”
also means providing that data in a way that makes sense for and provides value to its
consumer.

The concept of digestibility was first introduced in this book’s companion, The Definitive
Guide to Business Service Management. In both guides, the digestibility of data relates to the
ways in which it can be usefully presented to various classes of users. For example, data
that may be valuable to a developer is not likely to have the same value for Dan the COO.
Dan’s role might care less about the failure of an individual network component compared
with how that component impacts the system’s customers. Each person in the business has
a role to fill, and as such, different views of data are necessary.

199
Chapter 10

No visualization is effective unless it is created first with its consumer in mind. If that
consumer can’t digest what’s being presented to them, the information being displayed is
valueless. Think about the types of consumers who in your business today might benefit
from the data an APM solution can gather:

• Service desk employees and administrators gain troubleshooting assistance and an


improved view into systems health.
• IT managers are assisted in positioning troubleshooting resources to the most
crucial problems as well as plan for expansion based on identified problem domains.
• Business executives gain a financial perspective and better quality data that is
formatted specifically for their needs.
• Developers are able to dig into specific areas where code is non-optimized or
requires updating.
• End users are proactively notified when problems occur, maintaining their
satisfaction with your services.
A fully-realized APM implementation will include visualizations that provide the right kind
of data to each of these stakeholders. Technical stakeholders get readouts on the stability of
their devices and applications. Business leaders get a financial perspective. But each class
of individual receives the information they need, which has been calculated from APM’s
singular database.

Note
With a picture really being worth a thousand words, consider turning back to
Chapter 7 to see examples of APM visualizations for each of these classes of
consumer. There, you’ll see how APM’s graphical representation of your
business services enables a much improved situational awareness of their
inner workings and impacts to the business.

Part 8—Seeing APM in Action


Environments that benefit from APM’s data-driven approach consolidate the problem
resolution process into six very streamlined steps. This new process consolidates many
steps from the traditional approach, while at the same time adding a few new ones that
improve the overall communication between teams and to the rest of the business.
Consider the following six steps as best practices for an APM-enabled environment.

Visibility
Behaviors that occur outside expected thresholds are alerted via high-level visualizations.
Through drill-down support, the perspective and data found in that high-level visualization
can be narrowed to one or more systems or subsystems that triggered the failure. Using
tools such as service quality metrics and hierarchical service health diagrams, triaging
administrators can be quickly advised as to initial steps in problem resolution.

200
Chapter 10

Prioritization
Counts of affected users are predefined within an APM solution’s interface, enabling
triaging teams to identify the actual priority of one incident in relation to others that are
outstanding. As a result, those with higher numbers of affected users or greater impacts on
the business bottom line can be prioritized higher than those with lesser affect.

Problem & Fault Domain Isolation


Triaging teams then work with troubleshooting teams, often through a work-order
tracking system, to track the root cause of the problem. The same visualizations used
before in the visibility step are useful here. Different from the unmanaged environment is
that all eyes share the same vision into environment behaviors through their APM
visualizations. As such, details about the problem can be very quantitatively translated to
the right teams to assist in their further troubleshooting.

Troubleshooting, Root Cause Identification, & Resolution


Using health metrics, the problem is then traced to the specific element that caused the
initial alarm. That alarm describes how the selected element is not behaving to expected
parameters. Here, troubleshooting administrators can work with other teams (networking,
security, developers, and so on) to translate the inappropriate behavior into a root cause
and ultimately a workable resolution.

Communication with the Business


During this entire process, business leaders and end users are kept appraised of the
problem through their own set of APM visualizations that have been tailored for their use.
Business leaders see in real time who and how many people are affected by the problem as
well as how much budget impact occurs. End users are notified through notification
systems that give them real-time status on the problem and its fix.

Improvement
Throughout the entire process, the APM solution continues to gather data about the
system. This occurs both during nominal as well as non-nominal operations. The resulting
data can then be later used by improvement teams to identify whether additional
hardware, code updates, or other assistance is needed to prevent the problem from
reoccurring. By monitoring the environment through the entire process, after-action
review teams can identify whether the resolution is truly a permanent fix or if further work
is needed.

It should be obvious to see how this six-step process is much more data driven than the
earlier traditional approach. Here, every team remains notified about the status of the
problem and can provide input when necessary through the sharing of monitoring data.
When problems occur that cross traditional domain boundaries, those teams can work
together towards a common goal without the need for war rooms and their subsequent
finger-pointing.

201
Chapter 10

Note
For a fictional narrative of the entire six-step process, consider turning back
to Chapter 8. There, a made-up storyline is used to show how a fully-realized
APM solution can and does improve the process of triaging, troubleshooting,
provisioning resources, and eventually solving what would otherwise be an
exceptionally painful problem.

Part 9—APM Enables Business Service Management


Thus far, this chapter has shown how effectively resolving problems requires a data-driven
approach, one with a substantial amount of granular detail across multiple devices and
applications. Using this approach, it is possible to trace a system-wide performance
problem directly into its root cause. By integrating into databases, servers, network
components, and the end users’ experience itself, a fully-realized APM solution is uniquely
suited to gather and calculate metrics for entire business services as a whole.

Yet the topics in this chapter’s story so far have been fundamentally focused on the
technologies themselves, along with the performance and availability metrics associated
with those technologies. Its resulting visualizations were heavily focused on the needs of
the technologist:

• Service desk employees were able to track the larger issue directly into its problem
domain.
• Network administrators were able to identify whether metrics for network
utilization were within acceptable parameters.
• Administrators were able to use health and performance metrics to identify
symptoms of the problem.
• Developers were able to ultimately identify the failing lines of code and quickly
implement a fix.
Missing, however, in the previous chapter’s story is another set of business-related metrics
that convert technology behavior into useable data for business leaders. This class of data
tells the tale of how a business service ultimately benefits—or takes away from—the
business’ bottom line. It also creates a standard by which the quality of that service’s
delivery can be measured. It is the gathering, calculation, and reporting on these business-
related metrics that comprise the methodology known as Business Service Management
(BSM).

202
Chapter 10

Linking BSM to APM


The IT Information Library (ITIL) v3 defines BSM as an approach to the management of IT
services that considers the business processes supported and the business value provided. Also,
it means the management of business services delivered to business customers. Businesses
that leverage BSM look at IT services as enablers for business processes. They also look at
the success of IT as driving the ultimate success of the business.

BSM and APM are two methodologies that are naturally linked by their requirements for
data. The information gathered through an APM solution’s monitoring integrations directly
feed into the requirements of a BSM calculations engine. Performance, availability, and
behavioral data of the overall business service and its components are all metrics that aid
in calculating that service’s overall return. These metrics also provide the kind of raw data
that helps identify how well a business system is meeting the needs of its customers.

Figure 10.9 shows a logical representation of where BSM links into APM. Here, APM begins
with the creation of monitoring integrations across the different elements that make up a
business service. Those monitoring integrations gather behavioral information about the
end users’ experience. They collect application and infrastructure metrics as well as other
customized metrics from technology components. APM’s data by itself is used primarily by
the IT organization for the problem resolution and service improvement processes
discussed to this point in this guide.
Service-level
Expectations

Service-level
Reporting Business Service Management

Business
Service Views

End-user Application &


The Experience Infrastructure
Other Service
Business Metrics
Monitoring Metrics

Monitoring Integrations

Application Performance Management

Figure 10.9: BSM converts technology-focused monitoring data into business-centric


metrics.

203
Chapter 10

The addition of BSM creates a new layer atop this APM infrastructure. Here, the business
itself becomes a critical component of the monitoring solution. Business processes and
service level expectations are encoded into a BSM solution, with the goal of creating
business service views that validate and report on how well the technology is meeting the
needs of the business.

The metrics gained through a BSM implementation are also useful when fed into
management frameworks such as ITIL or Six Sigma. Like BSM’s roots in APM data, these
frameworks are often highly data-driven in how they accomplish and improve upon the
tasks of IT. One of the common limitations, however, in successfully implementing ITIL and
Six Sigma framework processes is in gathering enough data of the right kind to be useful.
The data gathering and calculation potential of a BSM/APM solution enables greater
success with both frameworks.

BSM provides a substantial added value to this process through its identification and
quantification of service quality. This quantification enables improvement teams to very
discretely identify areas of gap in service delivery, develop appropriate solutions, and
visibly see how well those solutions impact the overall quality of service delivery. In
essence, using BSM’s metrics, service improvement teams can measure the difference in as-
is and to-be levels of service quality, proving that their improvement activities have indeed
brought about improvement.

APM Is Required Monitoring for Business Services


If this short “Cheat Sheet” has piqued your interest in the technologies and the business
relevance of an APM solution for your business, consider turning back to its other chapters.
Through both regular reporting as well as narrative storytelling, this Definitive Guide
attempts to relate the tale of why APM should be required monitoring for business services.
Through its comprehensive approach, you’ll quickly find that an APM solution can and will
bring vast amounts of value to your critical business services.

204

You might also like