Monitoring Data Center Ebook

The Definitive Guide To
tm
tm
Monitoring the
Data Center, Virtual
Environments, and
the Cloud
Don Jones
The Definitive Guide to Monitoring the Datacenter, Virtual Environments, and the Cloud

Introduction to Realtime Publishers
by Don Jones, Series Editor
For several years now, Realtime has produced dozens and dozens of high‐quality books
that just happen to be delivered in electronic format—at no cost to you, the reader. We’ve
made this unique publishing model work through the generous support and cooperation of
our sponsors, who agree to bear each book’s production expenses for the benefit of our
readers.
Although we’ve always offered our publications to you for free, don’t think for a moment
that quality is anything less than our top priority. My job is to make sure that our books are
as good as—and in most cases better than—any printed book that would cost you $40 or
more. Our electronic publishing model offers several advantages over printed books: You
receive chapters literally as fast as our authors produce them (hence the “realtime” aspect
of our model), and we can update chapters to reflect the latest changes in technology.
I want to point out that our books are by no means paid advertisements or white papers.
We’re an independent publishing company, and an important aspect of my job is to make
sure that our authors are free to voice their expertise and opinions without reservation or
restriction. We maintain complete editorial control of our publications, and I’m proud that
we’ve produced so many quality books over the past years.
I want to extend an invitation to visit us at http://nexus.realtimepublishers.com, especially
if you’ve received this publication from a friend or colleague. We have a wide variety of
additional books on a range of topics, and you’re sure to find something that’s of interest to
you—and it won’t cost you a thing. We hope you’ll continue to come to Realtime for your
educational needs far into the future.
Until then, enjoy.
Don Jones
i

Introduction to Realtime Publishers ................................................................................................................. i
Chapter 1: Evolving IT—Data Centers, Virtual Environments, and the Cloud .............................. 1
Evolving IT .............................................................................................................................................................. 1
Remember When IT Was “Easy?” ............................................................................................................ 1
Distributed Computing: Flexible, But Tough to Manage ................................................................ 2
Super‐Distributed Computing: Massively Flexible, Impossible to Manage? ......................... 3
Three Perspectives in IT ................................................................................................................................... 4
The IT End User ............................................................................................................................................... 4
The IT Department ......................................................................................................................................... 5
The IT Service Provider ............................................................................................................................... 7
IT Concerns and Expectations ........................................................................................................................ 8
IT End Users ...................................................................................................................................................... 8
IT Departments ................................................................................................................................................ 8
IT Service Providers ...................................................................................................................................... 9
Business Drivers for the Hybrid, Super‐Distributed IT Environment ....................................... 10
Increased Flexibility ................................................................................................................................... 10
Faster Time‐to‐Market .............................................................................................................................. 11
Pay As You Go ................................................................................................................................................ 11
Business Goals and Challenges for the Hybrid IT Environment .................................................. 12
Centralizing Management Information .............................................................................................. 12
Redefining “Service Level” ....................................................................................................................... 12
Gaining Insight .............................................................................................................................................. 13
Maintaining Responsibility ...................................................................................................................... 13
Special Challenges for IT Service Providers .......................................................................................... 14
The Perfect Picture of Hybrid IT Management .................................................................................... 15
For IT End Users ........................................................................................................................................... 15
For IT Departments .................................................................................................................................... 16
ii

For IT Service Providers ........................................................................................................................... 16
About this Book ................................................................................................................................................. 17
Chapter 2: Traditional IT Monitoring, and Why It No Longer Works ............................................. 18
How You’re Probably Monitoring Today................................................................................................ 18
Standalone Technology‐Specific Tools ............................................................................................... 18
Local Visibility ............................................................................................................................................... 19
Technology Focus, Not User Focus ...................................................................................................... 20
Problems with Traditional Monitoring Techniques .......................................................................... 21
Too Many Tools ............................................................................................................................................ 21
Fragmented Visibility into Deep Application Stacks .................................................................... 21
Disjointed Troubleshooting Efforts ..................................................................................................... 23
Difficulty Defining User‐Focused SLAs ............................................................................................... 23
No Budget Perspective .............................................................................................................................. 24
Evolving Your Monitoring Focus ............................................................................................................... 24
The End User Experience ......................................................................................................................... 24
The Budget Angle ......................................................................................................................................... 25
Traditional Monitoring: Inappropriate for Hybrid IT ....................................................................... 26
It’s Your Business, So It’s Your Problem ................................................................................................. 27
Provider SLAs Aren’t a Business Insurance Policy ....................................................................... 27
Concerns with Pay‐As‐You‐Go in the Cloud ..................................................................................... 28
Evolving Monitoring for Hybrid IT ........................................................................................................... 28
Focusing on the EUE ................................................................................................................................... 28
Monitoring the Application Stack ......................................................................................................... 30
Keeping an Eye on the Budget ............................................................................................................... 32
Coming Up Next… ............................................................................................................................................. 33
Chapter 3: The Customer Is King: Monitoring the End User Experience ...................................... 34
Why the EUE Matters ...................................................................................................................................... 34
iii

Business‐Level Metric ................................................................................................................................ 34
Tied to User Perceptions .......................................................................................................................... 37
Challenges as You Evolve to Hybrid IT .................................................................................................... 38
Geographic Distribution ........................................................................................................................... 38
Deep, Distributed Application Stacks.................................................................................................. 41
Techniques for Monitoring the EUE ......................................................................................................... 42
Platform‐Level APIs .................................................................................................................................... 42
Data from Providers ................................................................................................................................... 43
Distributed Monitoring Agents .............................................................................................................. 43
Click‐to‐Click Monitoring ......................................................................................................................... 43
Why We Often Don’t Monitoring EUE Today........................................................................................ 45
Complexity ...................................................................................................................................................... 45
Lack of Tools .................................................................................................................................................. 45
Cost .................................................................................................................................................................... 47
Component‐Level Monitoring Can Be “Close Enough” ................................................................ 48
Why We Must Monitor EUE Going Forward .......................................................................................... 48
Vastly More Complex Environments ................................................................................................... 48
Business‐ and Perception‐Level Focus ............................................................................................... 49
Too Much Is Out of Your Control .......................................................................................................... 51
The Provider Perspective: You Want Your Customers Measuring the EUE ............................ 51
The Provider Isn’t 100% Responsible for Performance ............................................................. 51
You Gain a Competitive Advantage ...................................................................................................... 51
Coming Up Next… ............................................................................................................................................. 52
Chapter 4: Success Is in the Details: Monitoring at the Component Level ................................... 53
Traditional, Multi‐Tool Monitoring .......................................................................................................... 53
Client Layer .................................................................................................................................................... 54
Network Layer .............................................................................................................................................. 54
iv

Application and Database Layers ......................................................................................................... 56
Other Concerns ............................................................................................................................................. 58
Multi‐Discipline Monitoring and Troubleshooting ............................................................................ 59
Applications Are Not the Sum of Their Parts ................................................................................... 59
Tossing Problems Over the Fence: Troubleshooting Challenges ............................................ 60
Integrated, Bottom‐Up Monitoring ........................................................................................................... 61
Monitoring Performance Across the Entire Stack ......................................................................... 61
Integrated Troubleshooting Saves Time and Effort ..................................................................... 67
The Provider Perspective: Providing Details on Your Stack .......................................................... 68
Coming Up Next… ............................................................................................................................................. 69
Chapter 5: The Capabilities You Need to Monitor IT from the Data Center into the Cloud .. 70
Business Goals for Evolved Monitoring .................................................................................................. 70
EUE and SLAs ................................................................................................................................................. 70
Budget Control .............................................................................................................................................. 72
Technology Goals for Evolved Monitoring ............................................................................................ 74
Centralized Bottom‐Up Monitoring ..................................................................................................... 75
Improved Troubleshooting ..................................................................................................................... 76
A Shopping List for Evolved Monitoring ................................................................................................ 76
High‐Level Consoles ................................................................................................................................... 76
Domain‐Specific Drilldown ..................................................................................................................... 79
Performance Thresholds .......................................................................................................................... 80
Broad Technology Support: Virtualization, Applications, Servers, Databases, and
Networks ......................................................................................................................................................... 81
End‐User Response Monitoring ............................................................................................................ 83
SLA Reporting ............................................................................................................................................... 84
Public Cloud Support: IaaS, PaaS, SaaS ............................................................................................... 85
The Provider Perspective: Capabilities for Your Customers ......................................................... 85
v

Coming Up Next… ............................................................................................................................................. 87
Chapter 6: IT Health: Management Reporting as a Service ................................................................. 88
The Value of Management Reporting ...................................................................................................... 88
Business Value .............................................................................................................................................. 88
Technology Value ........................................................................................................................................ 89
Reporting Elements ......................................................................................................................................... 89
Performance Reports ................................................................................................................................. 89
SLA Reports .................................................................................................................................................. 106
Dashboards ................................................................................................................................................... 108
The Provider Perspective: Reports for Your Customers ............................................................... 110
Conclusion ......................................................................................................................................................... 111
vi

Copyright Statement
© 2010 Realtime Publishers. All rights reserved. This site contains materials that have
been created, developed, or commissioned by, and published with the permission of,
Realtime Publishers (the “Materials”) and this site and any such Materials are protected
by international copyright and trademark laws.
THE MATERIALS ARE PROVIDED “AS IS” WITHOUT WARRANTY OF ANY KIND,
EITHER EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE,
TITLE AND NON-INFRINGEMENT. The Materials are subject to change without notice
and do not represent a commitment on the part of Realtime Publishers or its web site
sponsors. In no event shall Realtime Publishers or its web site sponsors be held liable for
technical or editorial errors or omissions contained in the Materials, including without
limitation, for any direct, indirect, incidental, special, exemplary or consequential
damages whatsoever resulting from the use of any information contained in the Materials.
The Materials (including but not limited to the text, images, audio, and/or video) may not
be copied, reproduced, republished, uploaded, posted, transmitted, or distributed in any
way, in whole or in part, except that one copy may be downloaded for your personal, non-
commercial use on a single computer. In connection with such use, you may not modify
or obscure any copyright or other proprietary notice.
The Materials may contain trademarks, services marks and logos that are the property of
third parties. You are not permitted to use these trademarks, services marks or logos
without prior written consent of such third parties.
Realtime Publishers and the Realtime Publishers logo are registered in the US Patent &
Trademark Office. All other product or service names are the property of their respective
owners.
If you have any questions about these terms, or if you would like information about
licensing materials from Realtime Publishers, please contact us via email at
info@realtimepublishers.com.
vii

Chapter 1: Evolving IT—Data Centers,
Virtual Environments, and the Cloud
In the beginning, data centers were giant buildings housing a single, vacuum tube‐driven
computer, tended to by people in white lab coats whose main job was changing the tubes as
they burned out. Today’s data centers are so much more complicated that it’s like a
completely different industry: We not only have dozens or hundreds or even thousands of
servers to worry about, but now we’re starting to outsource specific services—like email,
spam filtering, or customer relationship management (CRM)—to Web‐based companies
selling “Software as a Service (SaaS)” in “the cloud.” How do we manage it all, to ensure that
all of our IT assets are delivering the performance and service that our businesses need?
Evolving IT
Every decade or so, the IT industry pokes its toes into the waters of a new way of
computing. I’m not talking specifically about the revolving thin client/thick client
computing model that comes and goes every few years; I’m talking about major paradigm
shifts that take place because of radical new technologies and concepts. Shifts that
permanently change the way we do business. In some cases, these shifts can resemble past
IT techniques and concepts, although there are always crucial differences as we move
forward. This is how IT evolves from one state to another, and it’s often difficult and
complex for the human beings in IT to keep up.
Remember When IT Was “Easy?”
I started in IT almost two decades ago—that’s several lifetimes in technology years. When I
started, we had relatively simple lives—my first IT department didn’t even have a local
area network (LAN). Instead, our standalone computers connected directly to an AS/400
located in the data center, and that was really our only server. IT was incredibly easy back
then: Everything took place on the mainframe. We didn’t worry about imaging our client
computers because we ultimately didn’t care very much about them. Security was simple
because all our resources were located on one big machine, and the only connections to it
were basically video screen and keyboard feeds. Monitoring performance was incredibly
straightforward: We called up an AS/400 screen—I think the command was WRKJOB, for
“work with jobs”—and looked at every single IT process we had in a single place. We could
bump the priority on important jobs or depress the priority on a long‐running job that was
consuming too many cycles.
Ah, nostalgia.
1

Distributed Computing: Flexible, But Tough to Manage
We soon made the move into distributed computing. Soon, we had dozens of Novell
NetWare servers and Windows NT servers in our expanding data center. Our computers
were connected by blazing‐fast Token Ring networks. We shifted mail off our AS/400 onto
an Exchange Server. For the first time, our IT processes were starting to live on more and
more independent machines, and monitoring them—well, we didn’t actually monitor them.
If things were a bit slow, there wasn’t much we could do about it. I mean, the network was
16Mbps and the processors were Pentiums. “Slow” was kind of expected. And, at the time,
the best performance tool we had was Windows’ own Performance Monitor, which wasn’t
exactly a high‐level tool for managing anything like Service Level Agreements (SLAs). Our
basic SLA was, “If it breaks, yell a lot and we’ll get right on it. We have a pager.”
That’s the same basic computing model that we all use today: Bunches of servers in the
data center, connected by networks—100Mbps or better Ethernet, thankfully, rather than
Token Ring—and client computers that we have to spend a significant amount of time
managing. Gone are the days of applications that ran entirely on the mainframe; now we
have multi‐tier applications that run on our clients, on mid‐tier servers, and in back‐end
databases. Even our “thin client” Web apps are often multi‐tier, with Web servers,
application servers, and database servers participating.
We’re also more sophisticated about management. Companies today can use tools that
monitor each and every aspect of a service. For example, some tools can be taught to
recognize the various components—middle‐tier, back‐end, and so forth—that comprise a
given application. As Figure 1.1 shows, they can monitor each aspect of the application, and
let us know when one or more elements are impacting delivery of that application’s
services to our end users.

Figure 1.1: Monitoring elements in an application or service.
We’ve mastered distributed computing, and we have the means to monitor and manage the
distributed elements quite effectively. To be sure, not every company employs these
methods, but they’re certainly available. So what’s next?
2

Super‐Distributed Computing: Massively Flexible, Impossible to Manage?
The common theme behind all of today’s distributed elements is that they live in our data
centers. Location, however, isn’t as important as what our own data centers provide us—
absolute control. For every server in our data center, we’re free to install management
agents, monitor network traffic, and even stick thermal sensors into our servers if we want
to. They’re our machines, and we can do anything with them that, from a corporate
perspective, we want to.
But we’re starting to move outside of our own data centers. What marketing folks like to
broadly call “the cloud” is offering a variety of services that live in someone else’s data
center. For example—and to define a few terms—we can now choose from:
• Hosted services, such as hosted Exchange Server or hosted SharePoint Server. In
most cases, these are the same technologies we could host in our own data center,
but we’ve chosen to let someone else invest in the infrastructure and to bear the
headache of things like patching, maintenance, and backups.
• Software as a Service, or SaaS, such as the popular SalesForce.com or Google Apps.
Here, we’re paying for access to software, typically Web‐based, that runs in
someone else’s data center. We have no clue how many servers are sitting behind
the application and we don’t care—we’re just paying for access to the application
and the services it provides. Typically, these are applications that aren’t available for
hosting within our own data center, even if we wanted to, although they compete
with on‐premises solutions that provide the same kind of services.
• Cloud computing, which, from a strict viewpoint, doesn’t include either of the
previous two models. Cloud computing is a real computing platform where we
install our own applications, often with our own data in a back‐end database, to be
run on someone else’s computers. Cloud computing is designed to offer an “elastic”
computing environment, where more computing resources can be engaged to run
our application based on our demand. Cloud apps are more easily distributed
geographically, too, making them more readily‐available to users all over the world.
All of these services are provided to us by a company of some kind, which we might
variously call a hosting provider or even a managed service provider (MSP). Ultimately, this
is still the same distributed computing model we’ve known and loved for a decade or more.
We’re just moving some elements out of our own direct control and into someone else’s,
and often using the Internet as an extension to our own private networks. This new model
is increasingly being referred to as hybrid IT, meaning a hybridization of traditional
distributed computing, in conjunction with this new, super‐distributed model that includes
outsourced services as a core part of our IT portfolio.
3

But there’s the key phrase: Out of our own direct control. Without control over the servers
running these outsourced services, how can we manage them? We can’t exactly install our
own management agents on someone else’s computers, can we? And for that matter, do we
really need to monitor performance of these outsourced services? After all, isn’t that what
the providers’ SLAs are for—ensuring that we get the performance we need? These are all
questions we need to consider very carefully—and that’s exactly what we’ll be doing in this
chapter and throughout the rest of this book.
Three Perspectives in IT
The world of hybrid IT consists of three major viewpoints: the IT end user, or the person
who is the ultimate consumer of whatever technology services your company has; the IT
department, tasked with implementing and maintaining those services on behalf of the end
user; and the IT service provider, which is the external company that provides some of
your IT services to you. It’s important to understand the goals and priorities of each of
these viewpoints, because as you move more toward a hybrid IT model, you’ll find that
some of those priorities tend to shift around and change their importance.
The IT End User
The IT end user, ultimately, cares about getting their job done. They’re the ones on the
phone telling their customers, “sorry, the computer is really slow today.” They’re the ones
who don’t ultimately care about the technology very much, except as a means of
accomplishing their jobs.
Here’s something important:
The IT end user has the most important perspective in the entire world of business
technology because without the end user’s need to accomplish their job, nobody in
IT has a job.
I’m going to make that statement a manifesto for this book. In fact, I’ll introduce you to an
IT end user (whose name and company name have been changed for this book) that I’ve
interviewed. You’ll meet him a few times throughout this book, and I’ll follow his progress
as his IT department shifts him over to using hybridized IT services.
Ernesto is an inside sales manager for World Coffee, a gourmet coffee
wholesaler. Ernesto’s job is to keep coffee products flowing to the various
independent outlets who resell his company’s products. Like most users,
Ernesto consumes some basic IT services, including file storage, email, and
so on. He also interacts with a CRM application that his company owns, and
he uses an in‐house order management application. Ernesto works on a
team of more than 600 salespeople that are distributed across the globe: His
company sells products to resellers in 46 countries and has sales offices in
12 of those countries.

4

Ernesto’s biggest concerns are the speed of the CRM and order management
application. He literally spends three‐quarters of his day using these
applications, and much of his time today is spent waiting on them to process
his input and serve up the next data‐entry screen. His own admittedly
informal measurements suggest that about one‐third of that time—just
under 2 hours a day—is spent waiting on the computer. He knows exactly
how much he generates in sales every hour, and since he’s paid mainly on
commission, he knows that those 9 hours a week—almost a quarter of his
work time—are costing him dearly.

He complains and complains to his IT department, as does everyone else, but
feels that it’s mostly falling on deaf ears. The IT guys don’t seem to be able to
make things go any faster. There’s talk now of outsourcing some of the
applications Ernesto uses, such as the CRM application. Ernesto just hopes it
doesn’t run any slower—he can’t afford it.

If you work in IT, you know how common a scenario that is. Nothing’s ever fast enough for
our users, and it can be incredibly difficult to nail down the exact cause of performance
problems—so we tend to file them all in the “like to help you, but can’t, really” folder and
go on with our other projects. It’ll be interesting to see this same situation from the
perspective of Ernesto’s IT department.
The IT Department
The IT department, on paper, cares about supporting their users. You and I both know,
however, that what IT really cares about is technology. Tell us that “email is slow” and
we’re less interested because that’s a big, broad topic. We need to narrow it down to “the
network is experiencing high packet loss” or “the email server’s processor is at 90%
utilization all the time” before we can start to solve the problem. We think in those
technical terms because we’re paid to; interfacing with end users—who typically can’t
provide anything more definitive than “it seems slower than yesterday” can be challenging.
And so we create SLAs. Typically, however, those SLAs are not performance‐based but
rather are availabilitybased. “We promise to provide 99% uptime for the messaging
application, and to respond within 2 hours and correct the problem within 8 hours when
the application goes down.” That means we can have up to 87.2 hours of downtime—two
full work weeks—and still meet our SLA! 99% sounded good, though, and hopefully
nobody will think to do the math. But we still haven’t addressed slow messaging
performance because it’s difficult to measure. What do we measure? How long it takes to
send a message? How long it takes to open a message? What’s good performance—5
seconds to open a message? Honestly, if you’ve ever had to actually wait that long, you
were already drumming your fingers on the mouse. A second? That seems like a tough goal
to hit. And how do you even measure that? Go to a user’s computer, click “Open,” and start
counting, “one one‐thousand, two one‐thousand, three one‐thou… oh, there, it’s open.
That’s about two and a half seconds.”

5

Instead we tend to measure performance in terms of technical things that we can
accurately touch and measure: Network consumption, processor utilization, memory
utilization, internal message queue lengths, and so on. Nothing the end user cares about,
and nothing we can really map to an end‐user expectation—how does a longer message
queue or higher processor consumption impact the time it takes to open a message?—but
they’re things we can see and take action on, if necessary.
John works for World Coffee’s IT department, and is in charge of several
important applications that the company relies upon—including the CRM
application and the in‐house order management application.

John has set up extensive monitoring to help manage the IT department’s
SLAs for these applications. They’ve been able to maintain 99.97%
availability for both applications, a fact John is justifiably proud of. The
monitoring includes several front‐end application servers, some middle‐tier
servers, and a couple of large back‐end databases—one of which replicates
data to two other database servers in other cities. John primarily monitors
key metrics for each server, such as processor and memory utilization, and
he monitors response times for database transactions. He also has to
monitor replication latency between the three database servers. Generally
speaking, all of those performance numbers look good. As an end‐point
metric, he also monitors network utilization between the front‐end servers
and the client applications on the network. He doesn’t panic until that
utilization starts to hit 80% or so, which it rarely does. When it does, he’s
automatically alerted by the monitoring solution, so he feels like he has a
pretty good handle on performance.

The company’s users complain about performance, of course, but the client
application has always run fine on John’s own client computer, so he figures
the users are just being users.

The company plans to start moving the CRM application to an outsourced
vendor, probably using a SaaS solution. They also plan to move the in‐house
order management application into a cloud computing platform, which
should make it easier to access from around the world, and help ensure that
there are always computing resources available to the application as the
company grows. John is relieved because it’ll mean all this performance
management stuff will be out of his hands. He just needs to make sure they
get a good SLA from the hosting providers, and he can sit back and relax at
last.
6

The IT Service Provider
As we start to move to a world of hybridized IT, it’s also important to consider the
perspective of the IT service provider—the person responsible for whatever IT services are
being hosted “in the cloud.” These folks have a unique perspective: In one way, they’re like
an IT department, because they have to manage a data center, monitor performance, patch
servers, and do everything else a standard IT department would do. Their “customers,”
however, aren’t internal users employed by the same company. They don’t get to use the
term customers in the same touchy‐feely, but ultimately meaningless, way that standard IT
departments do. An IT service provider’s customers are real customers, who pay real
money for services—and when someone pays money for something, they expect a level of
service to be met. So service providers’ SLAs are much more serious, legally‐binding
contracts that often come with real, financial penalties if they’re not met.
Service providers are also in the unusual position of having to expose some of their IT
infrastructure to their customers. In a normal IT department, the end users—or
“customers,” if you like—don’t usually care about technology metrics. End users don’t care
about processor utilization, and might not even know what a “good” utilization figure is.
With a service provider, however, the customer is an IT department, and they know exactly
what some of those technology metrics mean—and they may want to know what they are
from moment to moment. At the very least, a service provider’s customers want to see
metrics that correspond to the provider’s SLA, such as metrics related to uptime,
bandwidth used, and so forth.
Li works for New Earth Services, a cloud computing provider. Li is in charge
of their network infrastructure and computing platform, and is working with
World Coffee, who plans to shift their existing Web services‐based order
management application into New Earth’s cloud computing platform.

Li knows that he’ll have to provide statistics to World Coffee’s IT department
regarding New Earth’s platform availability, because that availability is
guaranteed in the SLA between the two companies. However, Li is worried
because he knows most of World Coffee’s end users already think their
order management application is slow. He knows that, once the application
is in the cloud, those “slow” complaints will start coming across his desk. He
needs to be able to prove that his infrastructure and platform are
performing well so that World Coffee can’t pin the blame for slowness on
him. He knows, too, that he needs to be able to provide that proof in some
regular, automated way so that World Coffee has something they can look at
on their own to see that the New Earth platform is running efficiently. He
knows his customers aren’t asking for that kind of detail yet—but he knows
they will be, and he doesn’t yet know how he’s going to provide it.
7

IT Concerns and Expectations
With those three perspectives in mind, let’s look at some of the specific concerns and
expectations that each of those three audiences tend to have. This is a way of summarizing
and formalizing the most important points from each perspective so that we can start to
think of ways to meet each specific expectation and to address each specific concern. Think
of these as our “checklists” for a more evolved, hybrid IT computing model.
IT End Users
As I stated previously, IT end users ultimately care about getting their jobs done. That
means:
• They expect their applications to respond more or less immediately. They may
accept slower responses, but the expectation is that everything they need comes up
pretty much instantly.
• They expect applications to be available and stable pretty much all the time. This is
often referred to as dial tone availability, because one of the most‐reliable consumer
services of the past century was the dial tone from your land telephone—which
typically worked even if your home’s power was out.
And you know what? That’s about it. Users don’t tend to have complex expectations—they
just want everything to be immediate, all the time. That may not always be reasonable, but
it’s certainly straightforward.
IT departments, as a rule, have never done much to manage this expectation, which is why
many end users have a poor perception of their IT department. IT has, in fact, found it to be
very difficult to even formally define any alternate expectations that they could present to
their users.
IT Departments
IT departments tend to have availability as their first concern. Performance is important,
but it’s often somewhat secondary to just making certain a particular service is up and
running at all. In fact, one of the main reasons we monitor performance at all is because
certain performance trends allow us to catch a service before it goes down—not necessarily
before it becomes unacceptably slow, but before it becomes completely unavailable. We
also tend to monitor technology directly, meaning we’re looking at processor utilization,
network utilization, and so on. So you can summarize the IT department’s concerns and
expectations as follows:
• They want to be able to manage technology‐level metrics, such as resource
utilization, across servers.
• They want to be able to map raw performance data to thresholds that indicate the
health of a particular service—such as knowing that “75% processor utilization” on
a messaging server really means “still working, but approaching a bad situation.”
8

• They want to be able to track performance data and develop trends that help predict
growth.
• They want to be able to track and manage uptime and other metrics so that they can
comply with, and report on their compliance with, internal SLAs.
• They typically want to be able to track all the low‐level metrics associated with a
service. For example, messaging may depend on a server, the underlying network,
and infrastructure services such as a directory, name resolution, and so on, as well
as infrastructure components such as routers, switches, and firewalls.
IT departments, in other words, are end‐to‐end data fiends. A good IT department—in
today’s world, at least—wants to be able to track detailed performance numbers on each
and every element of their data center, right out to the network cards in client computers,
although they typically stop short of trying to track any kind of performance on client
computers. The theory is that if everything inside the data center is running acceptably,
then any lack of performance at the client computer is the client computer’s fault.
IT Service Providers
IT service providers have, as I’ve stated already, a kind of hybrid perspective. They need to
have the same concerns as any IT department, but they have additional concerns because
their customers—other IT departments—are technically savvy and spending real money
for the services being provided. So in addition to the concerns of an IT department, a
service provider has these concerns and expectations:
• They need to be able to provide performance and health information about their
infrastructure to their customers.
• In many cases, slow performance at the customer end may be due to elements on
the customer’s network, which is out of the service provider’s control. Service
providers need to be able to quantify performance of their infrastructure so that
they can defend themselves against performance accusations from their customers.
• They need to be able to prove, in a legally‐defensible fashion, their compliance with
the SLAs between themselves and their customers.
• They need to be able to communicate certain health and performance information to
their customers so that customers have some visibility into what they’re paying for.
It’s actually kind of unfair to service providers, in a way. Most IT departments would never
be expected to provide, to their own end users, the kind of metrics that the IT department
expects from their own service providers.
9

Business Drivers for the Hybrid, Super‐Distributed IT Environment
Let’s shift gears for a moment. So far, we’ve talked mainly about the perspectives and
expectations of various IT‐centric audiences. As we move into a hybrid IT environment,
with some services hosted in our own data center and others outsourced to various
providers, meeting those expectations can become increasingly complex and difficult.
But what does the business get out of it? IT concerns aside, why are businesses driving us
toward a hybrid IT model? I can assure you that if the business didn’t have some vested
interest in it, we wouldn’t be doing it; outsourcing services is never completely free, so the
business has to have some kind of ulterior motive. What is it?
Increased Flexibility
Flexibility is one of the big drivers. Let me offer you a story from my own experience from
around 2000, when the Internet was certainly big but nothing called “cloud computing”
was really in anyone’s mind.
Craftopia.com (now a part of Home Shopping Network) was a small arts and
crafts e‐tailer based in the suburbs of Philadelphia, PA. The company’s
brand‐new infrastructure consisted of two Web servers and a database
server (which, in a pinch, could be a third Web server for the company’s
site), hosted in an America Online‐owned data center (with tons of available
bandwidth). The company generally saw fewer than 1,000 simultaneous hits
on the Web site, and their small server farm was more than up to that task.

One day, the IT department—all four people, including the CTO—was
informed that the company was up for a feature segment on the Oprah
television show. Everyone gulped because they knew Oprah could generate
the kinds of hits that would melt their little server farm, even though that
level of traffic would likely only last for a few days or even hours. If the
servers could manage to stay up, they might pull in a lot of extra sales, but
not enough to justify adding the dozen or so servers needed to meet the
demand. Especially since that surge in demand would be so short. The
servers didn’t manage to stay up: It was a constant battle to restart them
after they’d crash, and it made for a long few days.

Had all this taken place in 2010, the company could simply have put its Web
site onto a cloud computing platform. The purpose of those platforms is to
offer near‐infinite, on‐demand expansion, with no up‐front infrastructure
investment. You simply pay as you go. Sure, the Oprah Surge would have
cost more—but it would presumably have resulted in a compensating
increase in sales, too. Once the Surge was over, the company would simply
be paying less for their site hosting, since the site would be consuming fewer
resources again. There wouldn’t be any “extra servers” sitting around idle,
because from the company’s perspective, there weren’t any servers at all. It
was all just a big cloud.

10

That’s the exact argument for cloud computing, as well as for SaaS and even hosted
services: Expand as much as you need without having to invest any infrastructure. If you’ve
grown just beyond one server, you don’t have to buy another whole server—which will sit
around mostly idle—just to add a tiny bit of extra capacity. The hosting provider takes care
of it, adding just what you need.
Faster Time‐to‐Market
Today’s businesses need to move faster and faster and faster, all the time. It used to be that
taking a year or more to bring a new product or service to market was fast enough; today,
product life cycles move in weeks and months. If you need to spin up a new CRM
application in order to provide better customer service, you need it now, not in 8 months.
With hosted services and SaaS, you can have new services and capabilities in minutes. After
all, the provider has already created the application and supporting infrastructure; you just
need to pay and start using it. This additional flexibility—the ability to add new services to
your company’s toolset with practically zero capital investment and zero notice—is
proving invaluable to many companies. They no longer have to figure out how their
already‐overburdened IT department will find the time to deploy a new solution; they
simply provide a purchase order and “turn on” the new solution as easy as flipping a light
switch.
Pay As You Go
Massive capital investment is something that companies have long associated with IT
projects. Roll out a major solution like a new Enterprise Resource Planning (ERP) or CRM
application, and you’re looking at new servers, new network components, new software
licenses, and more. It’s an expensive proposition, and in many cases, you’re investing in
capacity that you won’t be using immediately. In fact, a quick survey of some of my industry
contacts suggests that most data centers use about 40 to 50% of their total server capacity.
That means companies are paying fully double what they need simply because we all know
you have to leave a little extra room for growth. You want Exchange Server, and you have
500 users, but think you’ll have 1500 within 3 years? Well, then you spend for 1500.
That’s why the “pay as you go” model offered by service providers is so attractive. If you
need 500 mailboxes today, you pay for 500. When you need 501, you pay for 501. It’s
possible that what you eventually pay for all 1500 will cost more than if you were hosting
the service in your own data center, but the point is that you didn’t have to pay for all 1500
all along. If you were wrong about your growth, and only needed 1000 mailboxes, then
you’re not paying for the excess one‐third capacity. Pay as you go means you don’t have to
plan as much, or as accurately, and you’re less likely to pay a surcharge for overestimating.
Pay as you go lets you get started quickly, with less up‐front investment.
11

Business Goals and Challenges for the Hybrid IT Environment
If there are business‐level drivers for hybrid IT, there are certainly business‐level
challenges to go with them. Remember, we’re talking about the business here, rather than
specific IT concerns. These are the things that a business executive will be concerned with.
Centralizing Management Information
One major concern is where management information will come from. Today, many
businesses are already getting IT management information from too many separate places
and tools. Managers are often forced to look at one set of reports for Microsoft portions of
the environment, for example, and a separate set for the Unix‐ or Linux‐based portions.
When some services move out of the data center and into “the cloud,” the problem becomes
even more complex. In some cases, there’s a concern about whether management
information will even be available for the outsourced services; at the very least, there’s an
expectation that the outsourced services will be yet another set of reports.
What kind of management reports are we talking about? Availability, at one level, which is
a high‐level metric but is still important to know. Managers need to know that they’re
getting what they paid for, and that includes the availability of in‐house services as well as
outsourced ones.
At another level, consumption is important. Some companies may need to allocate service
costs—whether internal or external—across business units or divisions. In other cases,
managers need to see consumption levels in order to plan for growth and the
accompanying expenses. A definite business goal is get all this information in one place,
regardless of whether a particular service is hosted in‐house or on a provider’s network.
Redefining “Service Level”
Businesses really need to redefine their top‐level SLAs. Rather than worrying so much
about uptime—which seems to be the primary focus of most of today’s SLAs—businesses
should manage to the enduser experience (EUE). In other words, regardless of the service’s
basic availability, how is it performing from the perspective of the end user? If end users
are spending half their time waiting for a computer to respond, the company is potentially
wasting a lot of money on that service, regardless of where it’s hosted.
This sounds complicated, but that’s only because I—and probably you, since you’re an IT
person—tend to start thinking about the underlying technology. “Do we start guaranteeing
a transaction processing time in the database? Do we guarantee a certain network
bandwidth availability?” Nope. We guarantee a specific EUE. For example, “When you
search for a customer by name, you will receive a first page of results within 3 seconds.”
You need to identify key tasks or transactions as seen from the end user perspective, and
write an SLA that sets a goal for a specific time to complete that task or transaction from the
end user’s perspective.
12

If you’re not able to meet that SLA, then you dive into technology metrics like network
bandwidth, processor utilization, and database response times; the end metric that you
drive to is what the end user actually experiences on their desktop. That may sound
impossible to even measure, let alone guarantee, but as you move into a hybrid IT
environment, it’s absolutely essential—and most of the rest of this book will talk about
how you’ll actually achieve it.
Gaining Insight
IT departments—and thus, the business—have the option to get as much insight as they
need into their existing data centers. That is, plenty of tools and techniques exist, although
not every business chooses to utilize them. Going forward, businesses are going to have to
have deep insight into the technology assets, because that insight is going to be the only
way to achieve that EUE‐based SLA that businesses need to establish.
Hybrid IT makes this vastly more complicated to achieve. If your EUE isn’t where you want
it to be with a cloud‐based application, where do you start looking for the problem? Is it the
Internet connection between your Taiwan office and the Paris‐based cloud provider data
center? Is it processing time within your cloud‐based application? Is it response time
between the cloud‐based application server and the cloud‐based back‐end database? Or is
it network latency within the Taiwan office itself? The term hybrid IT is an apt one because
you’re never truly outsourcing the entire set of elements that comprise an IT service: Some
elements will always remain under your control, while other elements—like the public
Internet—may be out of the control of both you and your service provider. You’re going to
need tools that can give you insight into every aspect so that you can spot the problem and
either solve it or adjust your EUE expectations accordingly.
Maintaining Responsibility
Here’s another major business challenge in hybrid IT: It’s still your business. Let’s consider a
brief section from Amazon’s EC2 SLA (you can read the entire thing at
http://aws.amazon.com/ec2‐sla/):
AWS will use commercially reasonable efforts to make Amazon EC2 available with
an Annual Uptime Percentage (defined below) of at least 99.95% during the Service
Year. In the event Amazon EC2 does not meet the Annual Uptime Percentage
commitment, you will be eligible to receive a Service Credit as described below.
They define a year as 365 days of 24‐hour days, meaning the service can be unavailable for
up to about 5 hours a year. However, if they don’t meet that SLA, you’re only entitled to a
service credit:
If the Annual Uptime Percentage for a customer drops below 99.95% for the Service
Year, that customer is eligible to receive a Service Credit equal to 10% of their bill
(excluding one‐time payments made for Reserved Instances) for the Eligible Credit
Period. To file a claim, a customer does not have to have wait 365 days from the day
they started using the service or 365 days from their last successful claim. A
customer can file a claim any time their Annual Uptime Percentage over the trailing
365 days drops below 99.95%.
13

That means you can’t even file a claim until you’ve had more than 5 hours of outage in a
365‐day period. If you do file a claim, you’re eligible to receive a credit—not a refund—of
up to 10% of your bill. I’m not trying to pick on Amazon.com, either, because most service
providers in this part of the industry have very similar SLAs.
My point, rather, is that the SLA is not protecting your business. If you have a mission‐
critical application hosted by a service provider, and that application goes down, your
business is impacted. You’re losing money—potentially tens of thousands of dollars an hour,
depending on what service is impacted. The SLA is never going to pay for that damage; at
best, it’s going to refund or credit a portion of your provider fees, and that’s all.
The moral is that you need to remain responsible for your entire business, and all the services
you rely upon to operate that business. You may outsource a service, but you can’t outsource
responsibility for it. You need to have insight into its performance levels and availability,
and you need to be able to engage your service provider when things start to look bad, not
when they go completely awful or offline. This can be a major challenge with some of
today’s service providers—and most of them know it, and are struggling to provide better
metrics to those customers who demand it. You need to be one of those customers who
demands it, because it’s your business that’s on the line.
Special Challenges for IT Service Providers
If you’re an IT service provider with a hosted service, SaaS offering, or even a cloud
computing platform, then you know the difficult situation that you’re in. On the one hand,
you’re an IT department. You have data centers, and you need to manage the performance
and availability of those resources that are under your control. When things slow down or
problems arise, you need to be able to quickly troubleshoot the problem by driving directly
toward the root‐cause element, whether that’s a server, a network infrastructure device, a
network connection, a software problem, or whatever.
On the other hand, you’re providing a service to a technically‐savvy customer—typically,
another IT department that’s paying for the services you provide. Unlike end users, your
customer is accustomed to highly‐technical metrics, and they’re used to having complete
control and insight over IT services because those have traditionally been hosted in the
customer’s own data center. Just because they’re moving service elements out of their data
center doesn’t mean they want to give up all the control they’re accustomed to. In fact,
smart customers will demand deep metrics so that they can continue to manage an EUE‐
based SLA. They’ll want to know when slowdowns are on their end, or when they can hold
you responsible and ask you to work on the problem.
Competitively, you want to be the kind of provider that can offer these kinds of metrics and
this kind of insight and visibility. As the world of hybrid IT grows more prevalent and more
accepted, it will also grow more competitive—and providers that can become a seamless
extension of the customer’s IT department and data center will be the preferred providers
who earn the most money and trust from their customers.
14

So start thinking: How can you provide your customers with deep metrics in a way that
only exposes the metrics related to them and not information related to other customers?
How can you provide this information in a way that will integrate with your customers’
existing monitoring tools so that they can treat your services as a true extension of their
own data center rather than as yet another set of graphs and reports that they have to look
at?
The Perfect Picture of Hybrid IT Management
Let’s talk about what the perfect hybrid IT world might look like. This is the pie‐in‐the‐sky
view; for now, let’s not worry about what’s possible but rather focus on what would be best
for the various IT audiences and for the business as a whole. We’ll use this “perfect picture”
to drive the discussion throughout this book, looking at whether this picture is achievable,
and if so, how we could do so. If there are any instances where we realize that this perfect
picture isn’t yet fully realized, we can at least outline the capabilities and techniques that
need to exist in order to make this dream a reality.
For IT End Users
Remember, for end users, getting the job done is the key. And while they sort of naturally
expect everything to be instant and always‐available, we can reset that expectation if we
explicitly do so in terms they can understand and relate to.
Ernesto has been using the company’s newly‐outsourced applications for
several months now, and he’s quite satisfied with them. The company has
published target response times for key tasks, such as locating a customer
record in the CRM application and processing a new order in the order‐
management application.

On Ernesto’s computer—and the computers of several of his colleagues
across the world—is a small piece of software agent that continually
measures the response times of these applications as Ernesto is using them.
The information collected by that agent is, he’s told, forwarded to his
company’s IT department, which compiles the data and publishes the actual
response times from across the company as an average. Anytime Ernesto
feels that the application is slow, he can visit an intranet Web page and see
the actual, measured performance—often times, he realizes, the “slowdown”
is more his impatience at getting a big new order into the system. On a
couple of occasions, though, he’s noticed the measured response times
falling below the published standard, and he’s called the help desk. They’ve
always known about the problem before he called, and were able to let him
know which bit of the application was slow, and about when he could expect
it to return to normal. He hasn’t the slightest idea what any of that means,
but it feels good to know that the IT department seems to have a handle on
things.

15

There are actually numerous ways to measure the end user experience—and better ways
don’t require any kind of agent to be installed on actual end‐user computers. That’s
something we’ll explore in later chapters of this book.
For IT Departments
The IT department serves as a human link between the end users and the technologies
those users consume. Rather than holding themselves accountable to standards that only
they can interpret and understand, however, they’re now setting goals that the end users—
their “customers”—can comprehend. Fortunately, they’re also able to manage to those
goals, even across services that are outsourced. By having the right tools in place, the IT
department can treat outsourced services just like any other element of the IT portfolio.
John was concerned about setting SLAs based on end user experience, but
because they started with real‐world measurements, and used those as the
performance baseline, he’s found that he no longer has to fend off as many
“things are slow” complaints. End users know what kind of performance to
expect, and so long as John provides that performance, they’re satisfied if not
always delighted.

He was especially worried about providing those SLAs for services that were
outsourced. However, World Coffee now receives a stream of performance
metrics directly from their hosting providers. When things are slow at the
end user computer, John can see exactly where the slowdown is occurring.
Sometimes it’s in the communication between networks, and John can bug
his ISP about their latency. Sometimes it’s the communication within the
provider network, between database and application server, and John can
call their help desk and find out what’s going on. They’re defining new,
performance‐based SLAs with the providers, which will help ensure that the
provider is engineering their network to keep up with demand.
For IT Service Providers
Service providers want to do a good job for their customers—after all, that’s what earns
new business, retains business, and grows business relationships. They’re discovering that
the way to do that is not always by being a “black box” but by offering some visibility. After
all, customers are betting a portion of their business on the provider, and they deserve a
little insight into what they’re betting on.
Li’s work with World Coffee is going well. Thanks to the detailed metrics he’s
able to provide them, and because they’re using a single tool to monitor their
entire service portfolio, they tend to call only when there’s legitimately a
problem on Li’s end. Best of all, the same tools that provide customers like
World Coffee with data are also providing him with performance
information, helping him spot declining performance trends before actual
performance starts to near the thresholds that might trigger an SLA
violation.
16

About this Book
This chapter has really been an introduction, with a goal of helping you to understand the
goals and challenges you face as you evolve your IT services to a hybrid IT model. We’ve
outlined the evolution from today’s IT models to the future’s super‐distributed, hybrid
model, and covered some of the key concerns and problems you’re likely to face as you
move along that path. What we need to cover—and what the rest of this book is about—is
how you actually accomplish it.
Chapter 2 will dive into the issue of monitoring in some detail. I want to look at how
companies monitor their IT environments today, and discuss how they probably should be
monitoring those same environments—because we all know that not every environment is
doing all they can in terms of service monitoring! But then I’ll look specifically at why
today’s accepted practices really start to fall apart when you move into a hybrid IT model,
and explore new goals that we can set for monitoring as our IT environment moves toward
that super‐distributed model.
In Chapter 3, I’ll propose a new model for defining SLAs internally. This isn’t a radical new
model by any stretch, but in the past, it’s been impractical to achieve. I want to really lay
out what we should be looking for in terms of IT service levels, and look at some of the
techniques that we can employ to do so—right now, not years in the future.
Chapter 4 is an acknowledgement that although the EUE is a great top‐level metric, it
doesn’t actually help us solve performance problems. We still need to be able to dive into
performance at a very detailed, very granular component level—but how can you
accomplish that in a world where half of your “components” live on someone else’s
network and are even abstracted away from the hardware they run on? I’ll propose some
capabilities for new monitoring tools that can help not only solve the super‐distributed
challenge but also streamline your everyday troubleshooting processes and procedures.
Chapter 5 is the real‐world, nitty‐gritty look at what you’re going to need to successfully
manage a hybrid IT environment, where you’ve got services hosted in your own data center
as well as in someone else’s. Although I’m not going to compare and contrast specific
vendor tools, I will provide you with a shopping list of capabilities so that you can start
engaging vendors and truly evaluating products with an eye toward the value they bring to
your environment.
Chapter 6 is going to take this idea of hybrid IT monitoring and move it up a level in the
organization, proposing the idea that IT health reporting can become just another one of
the services you offer to your customers—such as managers. I’ll also spend some time
covering this topic from the perspective of an IT service provider company, when their
customers truly are paying customers, and where management reporting becomes a
significant value‐add.
We’ve got a lot of ground to cover, but I think this is one of the most important topics that
IT faces as we begin hybridizing our IT environments. Sure, issues like security and user
access are important, but in the end, we need to be able to ensure that these outsourced IT
services can support our businesses. That’s what this book is all about.
17

Chapter 2: Traditional IT Monitoring, and
Why It No Longer Works
Those of us in the IT world think we know monitoring. After all, we’ve been doing it, in
various ways and using various tools, for decades. We collect performance data, we look at
charts, and…well, that’s monitoring. Sadly, that kind of monitoring just doesn’t meet today’s
business needs.
How You’re Probably Monitoring Today
IT monitoring has evolved over the past few decades, but that evolution has pretty much
consisted of continuing refinements to a basic model. Today’s monitoring techniques
evolved more out of what was possible and less out of what the business actually needed.
Let’s take some time to look at the monitoring techniques of today, because we’ll want to
carefully consider which techniques we need to keep—and which ones we should ditch.
Standalone Technology‐Specific Tools
Today, you’re probably relying heavily on monitoring tools that are standalone and
technology‐specific. That is, once you move beyond the collection of basic performance
data, you start to move into extremely domain‐specific tools that are geared for a particular
task. You might, for example, use a tool like SQL Profiler (see Figure 2.1) to capture
diagnostic information from a Microsoft SQL Server, or you might use a tool like Network
Monitor (see Figure 2.2) to capture network packet information.

Figure 2.1: SQL Profiler.
18

The problem with these tools is that they are domain‐specific, and they require a great deal
of domain‐specific knowledge. You have to know what you’re looking at. Although these
tools will always provide us with valuable troubleshooting information, they don’t tell us
much about the health of an application that runs across multiple technology domains. In
fact, in some instances, these domain‐specific tools can lead to longer and more convoluted
troubleshooting processes.

Figure 2.2: Network Monitor.
For example, when a user complains of a slow application, a database administrator might
grab SQL Profiler to see what’s hitting the database server. At the same time, a network
administrator might start tracing packets in Network Monitor to see if the network looks
healthy. Both of them are failing to see the forest for their individual trees, and failing to
recognize that the application’s health isn’t driven entirely by one or the other technology
domain.
Local Visibility
Our current monitoring tools are, quite understandably, limited to our local environment.
We monitor our servers, our network, our infrastructure components, our software
applications. The minute we leave our last firewall, we start to lose our ability to accurately
measure and monitor; at best, we can get some response time statistics from routers and so
forth on our Internet service provider (ISP) network, but once we’re out of our own
environment, our vision becomes severely limited.

19

Technology Focus, Not User Focus
Even tools that profess to monitor an entire application “stack” still take a very domain‐
centric approach. For example, it’s not uncommon to have tools that continuously collect
performance information from individual servers and network components, compare that
performance to pre‐determined thresholds, and then display any problems. These tools can
be configured to understand which components support a given application, so they can
report a problem that is affecting the application’s health and help you trace to the root
cause of that problem. Figure 2.3 shows an example of how these solutions often present
that kind of problem to an administrator.

Figure 2.3: Tracing application problems to a specific component.
These tools, however, don’t encourage a user‐centric view of the application; they
encourage a technology‐centric view. They concern themselves with the health and
performance of application components, not with the way the end user is currently
experiencing that application. These kinds of tools can absolutely be valuable, but only
when they can also include the end users’ experience at the very top of the application’s
stack, and when they can do a better job of correlating observed application performance
to specific component health or performance.

20

Problems with Traditional Monitoring Techniques
In addition to the problems I pointed out already, our traditional monitoring techniques
have some severe shortcomings that actually make monitoring and application health
maintenance more difficult than it should be.
Too Many Tools
For starters, we simply have too many tools. They deliver too much different information in
too many different ways. There’s no way to correlate information between them, and we
have to spend a ton of time becoming an expert on every single tool’s nuances and tricks.
Consider a modern, multi‐tier application: It might rely on several servers, a database,
network connectivity, and so on. When the application “seems slow” to the users, you have
to reach for a dozen tools to troubleshoot each element of the application stack—and
you’re still not looking at the application as a whole.
We need to get centralized visibility of the entire application and all its components. We
need information presented in a uniform, consistent fashion, and we need it correlated so
that we can tell which bit of the application stack is contributing to an observed problem
with overall application health.
Fragmented Visibility into Deep Application Stacks
Another problem with our traditional monitoring tools is that they’re really not built for
today’s deep application stacks. Consider what might seem like a fairly straightforward
multi‐tier application, consisting of:
• Client application
• Middle‐tier application server
• Back‐end database server
That doesn’t include the connectivity components, though, so let’s include them:
• Network switch
• Network router
• Network switch
• Network switch

21

Each of those individual elements, however, is really a stack unto itself:
o Java runtime engine
o Java libraries
o Operating system (OS)
• Network switch
• Network router
• Network switch
o .NET Framework runtime engine
o .NET Framework classes
o Database drivers
o Operating system
o Memory
o Processor
• Network switch
o OS
o Database management system
o Disk subsystem
o Memory
o Processor
All these sub‐components can have a significant impact on application health, but it’s very
difficult to get traditional monitoring tools that can “see inside” all of them. We might be
able to get excellent data on the database management system’s performance and use of
memory and processor resources, while having virtually no idea how the middle‐tier
application’s database drivers are performing. This fragmented visibility makes it tough to
find the root cause of problems.

22

Disjointed Troubleshooting Efforts
Domain‐specific tools lead to domain‐specific troubleshooting. Let’s revisit my case study
illustration from the previous chapter and see how this domain‐specific troubleshooting
usually works in the real world:
John, the IT specialist at World Coffee, is trying to find the cause of
performance problems that Ernesto has reported in the company’s order
management application.

John initially suspected that the database server was running slowly, and he
notified the company’s DBA. The DBA, however, said that the individual
queries are executing within the expected amount of time, and that the
database server’s overall performance looks good. John then called a
desktop support technician to look at Ernesto’s computer. The technician
said that everything on the computer seems to be running smoothly—the
problem seems isolated to this one application. A software developer that
John contacted insists that the application is running fine on his computer.
John is now analyzing network packet captures to see whether there’s some
latency between the server network segment and the client segment that
Ernesto’s computer is connected to.

Sound familiar? This is how most companies deal with application health issues today: A
bunch of domain experts jump on their particular application component, tending to look
at that component in isolation and tossing the problem “over the wall” to another specialist
when they can’t find an obvious problem with their particular component.
Application stack tools like the one shown in Figure 2.3 are a good starting point for solving
this problem because they help pinpoint the component that isn’t performing to
specification. But they fail in that they’re still looking at each component as a standalone
entity, and measuring performance against predetermined thresholds. It is entirely
possible for a server to be within its performance tolerances and yet still be the root cause
for a poorly‐performing application; these tools don’t go far enough in that they don’t
correlate observed application behavior with component performance.
Difficulty Defining User‐Focused SLAs
We don’t tend to offer user‐centric service level agreements (SLAs) because our monitoring
tools don’t really let us figure out what “good” end user experiences should look like. We
can tell you that the database server running at 90% processor utilization isn’t good, but
we can’t tell you exactly how that will manifest in end user experience. Simply put, we’re
too focused on the technology and the components and not on the application and its end
users.
As we start moving into hybrid IT and outsourcing some of our applications, we need to
focus less on the technology—which isn’t going to be in our control anyway—and focus
instead on getting what we’re paying for, which means focusing on the service that our end
users are receiving.
23

No Budget Perspective
Finally, another growing problem with traditional monitoring tools is that they really don’t
have any budgetary focus. That’s not necessarily a huge deal for in‐house applications, but
as we start moving into hybrid IT and outsourcing portions—or all—of an application, we
need to know that we’re getting what we paid for. When we start moving to pay‐as‐you‐go
cloud computing, we need tools that will help correlate application health and use to that
pay‐as‐you‐go model so that we can accurately forecast and plan for those cloud computing
expenses.
Evolving Your Monitoring Focus
IT is rapidly evolving toward hybridization; every day, companies adopt Software as a
Service (SaaS) solutions, outsource specific services to Managed Service Providers (MSPs),
and move applications and components into cloud computing platforms. As IT evolves, so
must our ability to monitor these assets to ensure they are performing to our needs.
The End User Experience
The first point of evolution is to focus entirely on the end‐user experience (EUE) as your
top‐level metric. The first and foremost thing you should care about is how quickly your
end users are able to perform selected tasks with an application.
When a problem occurs in an application—whether it’s something in your control, like
your local network, or something outside of your control, like a back‐end database server
in a service provider’s data center—that problem will “flow up” through the application
stack, resulting in a problem with the EUE. That should be your indication that there’s a
problem: When the end user experiences the problem.
Does every application problem impact the EUE? No, of course not. A service provider
might lose a server but might also have redundancy built‐in to handle that exact situation.
If you don’t see a problem in the EUE, you don’t have a problem. With the right tools, you’ll
be able to start at that EUE and drill down to find the root cause of problems that are under
your control, making the EUE the perfect place to begin problem diagnoses and
troubleshooting activities.
Figure 2.4 shows how a monitoring solution can expose that EUE in a simple fashion, such
as through a color‐coded “response time” indicator, as shown. Green means the end users
are going to have an acceptable experience; anything else requires your attention.
24

Figure 2.4: Toplevel monitoring of the EUE.
The Budget Angle
As you move into pay‐as‐you‐go services, you’ll want your monitoring to be correlated to
your expenditure. Bringing all that information together into a single console, such as the
one illustrated in Figure 2.5, can help you predict expenses and plot growth in your service
consumption and the associated expenses.

Figure 2.5: Monitoring cloud computing consumption.

25

Traditional Monitoring: Inappropriate for Hybrid IT
If you’ve been following my logic closely to this point, you may be ready to make a
significant argument: Traditional monitoring can do all of this.
True. To a point. We already do have tools that let us measure things like service response
times, and many companies use those to develop a top‐level view of application health—
almost a sort of EUE metric. That works fine when you’re completely inside your own
network, but the minute you start creating a hybrid IT environment, you lose whatever
deep monitoring ability you may have. Understand, too, that hybrid IT doesn’t just mean
that you’ve outsourced a few services. It means that you may have internal services that
depend on external services. For example, consider Figure 2.6.

Figure 2.6: A truly hybridized IT environment.
Illustrated here is an e‐commerce application, hosted in the Rackspace Cloud platform.
That application requires access to SalesForce.com, a SaaS customer relationship
management (CRM) solution. It also depends on an Exchange Server messaging system,
which is hosted by an MSP. Finally, it relies on data from within your own data center,
perhaps running on Windows or Linux computers, which might in turn be virtual machines
that depend on a virtualization platform such as VMware. Your traditional IT monitoring
tools simply can’t put all those varied components into a single view for you. And how will
you monitor the EUE? Ask your customers to install a little monitoring agent on their
computers? Probably not—that’s usually known as “spyware” even if it’s being used for
noble purposes.
When you start moving to complex, hybrid IT environments, the old let’s‐get‐a‐bunch‐of‐
tools approach just doesn’t work anymore. There’s another problem, too, that can’t be
corrected using traditional monitoring tools: The problem of correlating performance to
your real‐world business.
26

It’s Your Business, So It’s Your Problem
Let’s be very clear on one thing: Hybrid IT involves outsourcing business services; it does
not entail handing off responsibility for your business. And hybridization is not a panacea
for all your IT woes; although it can solve many business problems, it can also introduce
unique challenges.
Provider SLAs Aren’t a Business Insurance Policy
Consider this troubling possibility:
John is having a bad day. Last month, World Coffee moved one of its smaller
applications to New Earth’s cloud computing platform, and for most of
today, that application has been completely unavailable. John’s boss—and
his boss, and his boss—is furious; the company estimates that it’s losing
$250,000 every hour that the application is unavailable. Much of that is in
lost sales, and the company will not only have difficulty recouping the
money but also may lose some of those customers to other distributors. So
far, they’ve lost more than a million and a half dollars.

Just then, the application comes back online. Relieved, John calls his contact
at New Earth to follow‐up. He explains to his contact that the company has
lost hundreds of thousands of dollars in sales, and that he wants to invoke
New Earth’s SLA terms.

His contact agrees but regretfully informs John that World Coffee isn’t due
any money from New Earth. The SLA provides for a refund of fees paid to
New Earth for any services that were not provided in accordance with the
SLA. However, World Coffee is on an entirely pay‐as‐you‐go plan, meaning
they have no fees beyond those they pay for what they actually used. Since
they haven’t been using the application for the past 6 hours, they would have
paid New Earth nothing, and so New Earth can’t offer any refund.

That’s how most SLAs work: They’re there to guarantee the service or your money back (or
a portion of it). They’re not there to cover the business losses that an outage or
performance problem may cause. For example, Microsoft’s Windows Azure Compute SLA
states that customers can receive a 10% service credit when monthly uptime percentage
falls below 99.95%, or a 25% credit if uptime is below 99%. The SLA doesn’t provide any
performance guarantees; if the service is up but responding slowly, customers aren’t due
any credit.
This is exactly why it’s so crucial that we be able to monitor performance across our
hybridized IT infrastructure. If we’re not seeing the EUE that we need, we can take action,
either by working with service providers to raise performance or finding new service
providers.

27

Concerns with Pay‐As‐You‐Go in the Cloud
Many cloud computing platforms operate on a pay‐as‐you‐go model, which is one of the
main points that make them so attractive to businesses. They offer the ability to scale to
almost infinite resources, provided you’re willing to pay for that. When you don’t need a lot
of resources, you’re not paying much. So you can get a giant potential infrastructure with
essentially zero capital investment.
But pay‐as‐you‐go can be full of surprises. Ever go shopping for songs on Apple’s iTunes
store? Each song is only a dollar or so, so you click, click, click—and then a few days later
get a bill for a couple of hundred dollars. Oops. These things add up, don’t they?
Cloud computing can be the same way. To continue picking on Windows Azure as an
example, pricing at the time of this writing is as follows:
• Computing power costs between $0.12 and $0.96 per compute‐hour, depending on
the size of the compute instance.
• Storage costs $0.15 per month per gigabyte, and $0.01 per 10,000 storage
transactions—meaning data reads and writes.
• Data transfers cost $0.10 to $0.45 per gigabyte, depending on the direction (in or
out) and the region of the world.
Sounds cheap—pennies for gigabytes! But how much will your application use? This is
where the performance=budget angle comes in. You don’t want to get that surprise bill at
the end of the month; you want to be able to monitor your use of the application and
predict what your bills will be, and even use trending to predict changes in usage patterns
and the resulting bills.
Evolving Monitoring for Hybrid IT
So let’s talk about what kind of monitoring our hybrid IT environment is going to need. I’ll
refer to this as evolved monitoring, to draw a distinction between this new approach and
the more traditional techniques you’re already using. Hybrid IT is an evolution in how we
deploy IT services, so it only makes sense that we’d need some evolved monitoring capacity
to go with it.
Focusing on the EUE
The first thing we need is a monitoring solution that focuses on the EUE—it’s the first and
foremost metric that we should care about, and our tools should focus on what we care
about. A tool should be able to help us test the EUE of an application, either by passively
observing our application in action or by actively “probing” our application to measure the
results. As Figure 2.7 shows, the EUE should be broken down into the various contributing
components so that we can see how each component drives the overall EUE.
28

Figure 2.7: Breaking down the EUE.
This approach not only helps us manage to that all‐important EUE metric but also provides
a starting point for diagnosing problems when that metric strays out of our comfort zone.
Our tool should allow us to define our EUE‐based SLAs, and should help us monitor
compliance with those SLAs. Figure 2.8 shows what that might look like.

Figure 2.8: Managing end user response SLAs.

29

At the top of this console, you can see that specific end user transactions, like logging in and
logging out, have been defined with SLAs, and the console is monitoring those transactions
and telling us how often the application fails to meet our defined response times. We
should even, as Figure 2.9 shows, be able to pull up a history of SLA compliance.

Figure 2.9: Monitoring SLA compliance over time.
Note that these SLAs are being defined by simple response times like pinging a service;
they’re times to complete a specific enduser task. That’s a distinction we’ll drill into in the
next chapter.
Monitoring the Application Stack
Being aware of the EUE metric is important; however, when it’s outside of our acceptable
range, we need to be able to drill further to find a problem. That’s why an evolved
monitoring solution needs to be able to drill deeper, measuring specific application
components. Figure 2.7 provided one example of how that might be visualized in a
monitoring application; Figure 2.10 shows another.
30

Figure 2.10: Monitoring the entire application stack.
From this kind of view, you should be able to drill deeper into specific problem areas. That
drill‐down should help you spot problems with that particular component. For example,
clicking on “Enterprise Server” in the end‐to‐end view might bring up a drill‐down console
like the one Figure 2.11 shows, where we can get a high‐level view of that particular
server’s performance, and start looking for problems.

Figure 2.11: Drilling down into a server’s performance.
31

This drill‐down must be aware of the kind of component we’re looking at. For example, this
view might be sufficient for a typical Windows server, but it’s not showing us anything
specific for a database server. If we’d clicked on a database server, we’d expect to see
information related to the database management platform, such as the console shown in
Figure 2.12.

Figure 2.12: Drilling down into a database server.
Now, we’re really troubleshooting problems. We can see that the server’s CPU is
dangerously close to maximum, and database response times have crept firmly into the red.
We’re running low on free space, too. With a couple of clicks, this all‐in‐one console might
take us from a potentially‐problematic EUE metric right to the root cause of the problem—
which we can now confidently assign to a DBA to start fixing.
Keeping an Eye on the Budget
Because it’s continually monitoring our application, this evolved console can also help us
keep an eye on our expenses for pay‐as‐you‐go services. By simply tracking the things we
pay for—compute time, bandwidth, storage, and so on—we can use the console to help
plan growth and learn what to expect in our bills. We might even be able to export those
numbers into a spreadsheet and play “what if” scenarios to see what our bills would be like
if we increased or decreased our application activity.

32

Coming Up Next…
We need to redefine what “service level” means, and to do that, we need to think about
what really matters in IT: the EUE. In the next chapter, we’ll go into detail about why that
EUE is so important from a business perspective, and why that must drive the technology
perspective. We’ll also look at the unique challenges imposed by a hybrid IT environment,
and how that kind of environment can make it even more challenging to accurately assess
the EUE. We’ll examine the technologies and techniques available for monitoring the users’
experience, and some of the reasons why that kind of monitoring isn’t already in your
arsenal of tools. We’ll also look at some of the unique perspectives that service providers
have on the EUE, and how they can help themselves and their customers do a better job of
maintaining application performance.

33

Chapter 3: The Customer Is King:
Monitoring the End User Experience
I’ve already presented the End User Experience (EUE) as the ultimate metric for an
application’s overall performance, whether that application lives entirely in your data
center, entirely in the cloud, or in some combination of the two. In this chapter, I’ll explore
the EUE in greater detail: How do you actually measure it? What, specifically, are you
looking at? What contributes to a good—or poor—EUE in an application? If you see your
EUE metric starting to head toward “poor,” what can you do about it? I’ll also examine
some of the reasons companies traditionally haven’t measured the EUE—and why doing so
can still be tremendously difficult, especially in newer, highly‐distributed applications that
involve elements of cloud computing.
Why the EUE Matters
First, however, let’s really pin down why the EUE is so important. As I outlined in the
previous chapter, businesses have typically measured application performance solely from
technical measurements—database response times, for example. What does the EUE really
offer that the more traditional measurements don’t?
Business‐Level Metric
Take a look at Figure 3.1. It’s a common‐enough chart in a business technology
environment, displaying a variety of performance metrics for a database. The bottom graph
shows, in orange, physical reads from disk. I guess those look normal. The middle graph
shows, also in orange, logons to the database. Had a little peak around 6pm and 5:30. Guess
that’s okay. The top graph shows a variety of statistics, primarily user input/output in blue
and CPU utilization in green. That’s a lot of I/O, I guess. CPU looks okay. At least, it’s lower
than the top of the graph.
34

Figure 3.1: Database performance.
So what does all of this mean to the business? That’s harder to gauge. Are we making money
or losing money? Do the users of our application believe it’s performing well? Do we have
phone agents sitting on the phone telling customers, “I’m sorry, the computer is slow
today,” or is everything popping up pretty quickly for them? It’s impossible to tell from this
chart.
What about Figure 3.2? This is a VMware vSphere performance graph showing CPU
utilization for a week. Got a little spike late Monday, but there’s no way to tell whether that
impacted our business operations. In fact, it might have been late at night, so possibly it
was related to a maintenance task or anything. It’s impossible to tell if any users were
impacted, though.
35

Figure 3.2: Virtualization host CPU performance.
These two examples precisely illustrate the problem with traditional performance
measurement: There’s no tie to anything that matters to the business. Memory, I/O, CPU,
disk—none of these things tell us whether the application is performing well for the
business. Sure, we could add some thresholds to those charts, maybe generating an alert
when CPU utilization spiked above 80%, as it did in Figure 3.2. But that still doesn’t tie
directly to what matters to the business.
Here’s how businesses have traditionally tried to make these technical measurements
relate to business concerns:
1. We observe technical performance metrics.
2. When we start getting end user calls about application performance, we make a note
of the performance metrics at that time and draw a threshold line.
3. From then on, if performance is on one side of the threshold, we assume that
equates to good end user performance. On the other side, we assume it means poor
end‐user performance.
The problem with this approach is that we can’t continuously measure every technical
element that contributes to good or bad end user performance. We continue to use
technical performance elements—CPU, disk, memory, and so forth—as our only real
metrics, even though we can’t directly relate them to anything the business cares about.

36

Note
It’s even worse when your end users don’t work for your company because
in that case, you might never know that there’s a perceived problem. For
example, if your customers feel that your online shopping cart is too slow,
they won’t call you—they’ll simply give up and shop elsewhere. Only
measuring technical elements can make it easy to miss the fact that you’re
losing sales and customers.
Tied to User Perceptions
As I discussed in the first chapter, end users don’t usually care about CPU utilization and
disk throughput: They care about whether the application seems slow. They base their
judgment on how long it takes them to complete common transactions, such as looking up a
customer order, completing a checkout, and so on. Ultimately, your end users are the ones
who can tell you whether your application is performing well and doing its job—
unfortunately, end users have some significant problems:
• They aren’t consistent—One user might feel the application is performing well,
while another feels it’s too slow
• They aren’t accurate—An application with identical performance might be
perceived as slightly slow one day and just fine on another
• They don’t always report—Even users within your own organization tend to not
report poor performance because they usually get a brush off; external users—such
as customers—will generally just take their business elsewhere
That’s why monitoring the EUE is so important. You get that “end user perspective,” which
is ultimately the only thing that matters to the business. But you get it more accurately,
more consistently, and without the need for users to actually report in on their own.
This is really where the term application performance management (APM), comes from. In
fact, let’s create a formal definition:
APM can be defined as the process and use of related IT tools to detect, diagnose,
remedy, and report application’s performance to ensure that it meets or exceeds
end users’ and businesses’ expectations. Application performance relates to how
fast transactions are completed on behalf of or information is delivered to the end
user by the application via a particular network, application, and/or Web services
infrastructure.

37

Challenges as You Evolve to Hybrid IT
Hybrid IT, as I described in the previous chapter, is the evolution of business technology
services that incorporate a variety of highly‐distributed technologies; superdistributed is
another term that refers to this evolution. In the previous chapter, I outlined how our
applications have evolved from monolithic code that ran on a single computer to today’s
applications that rely on resources in the cloud, from your own data center, and from
service providers. Figure 3.3 illustrates this kind of application.

Figure 3.3: Modern “hybrid IT” application.
With this kind of application, customers interact with a Web site that is hosted by a Web
hosting company—Rackspace Cloud, in this example. That application is not self‐contained,
however. It depends on a Software as a Service (SaaS) offering from SalesForce.com to
track customer information, and sends email messages through a hosted Microsoft
Exchange service. On the backend, additional services—virtualization hosts, Windows
servers, Linux servers, Oracle databases, and network elements—provide data and support
to the application. This application depends upon multiple data centers, numerous
disparate software elements, and more. Monitoring this by using traditional technology‐
centric techniques is impractical if not impossible, meaning the only way to make sure our
customers are happy is to measure the EUE directly. What specific challenges come along
with this kind of super‐distributed application?
Geographic Distribution
One significant problem is the geographic distribution of today’s more complex
applications. For example, suppose you’ve decided to host a Web service or Web site in the
Windows Azure cloud. You have no idea where your application is physically located. The
whole point of the cloud, in fact, is to make your application available more globally so that
users all across the world can access it more or less equally.
38

For example, consider an application that’s hosted in a more traditional fashion, in a shared
hosting environment, or on a dedicated server that’s located in a hosting company’s data
center. One server, or one group of collocated servers, hosts the application. Monitoring the
EUE is straightforward because you’ve only got one thing to manage. Figure 3.4 illustrates.

Figure 3.4: Monitoring the EUE in a singlelocation application.
Move into the cloud, however, and your application is inherently able to be hosted—
transparently—on multiple servers that are geographically distributed. Figure 3.5
illustrates the difficulty in monitoring the EUE: An EUE monitor in a given location will
probably be connecting, most of the time, to a geographically‐close server; end users in
other locations, however, may be connecting to entirely different servers across entirely
different infrastructure. Your EUE monitoring won’t be getting an accurate picture of the
entire application’s performance because it’s only seeing part of it.
39

Figure 3.5: Geodispersed cloud applications.
You could, of course, start deploying distributed EUE monitors as well. But that means you
have to start building a giant, globally‐distributed monitoring infrastructure—and I’ll go
out on a limb and guess that doing so isn’t part of your business’ overall goals.
Another option is to work with service providers—such as cloud hosting companies—that
provide you with distributed monitoring capabilities, meaning they’ve deployed those
capabilities within their own infrastructures and make data related to your application
available to you. If you are a service provider, that’s a very competitive feature to offer to
your customers.
The last option is to find a monitoring solution that comes with a globally‐distributed
monitoring service. In other words, rather than just buying yet another monitoring console,
you’re buying a console that integrates with an in‐place ability to monitor distributed
applications hosted in commonly‐used services, such as specific cloud computing
providers. As Figure 3.6 shows, that can give you access to performance information on
these cloud providers without requiring you to deploy your own extensive monitoring
infrastructure.
40

Figure 3.6: Monitoring cloud provider performance.
Deep, Distributed Application Stacks
Figure 3.3 showed how complicated today’s applications can really become, relying on
numerous hosted services, different SaaS offerings, and on your own backend
infrastructure—which may be running on a variety of operating systems (OSs),
virtualization hosts, and so forth. Every single element in this stack can contribute to
performance problems, and even if you’re monitoring the EUE directly, you’ll still want
insight into individual component performance for troubleshooting purposes. Considering
just the example in Figure 3.3 (which I’ll repeat here for your convenience), you’re looking
at a huge variety of things to monitor:

• SalesForce.com’s overall responsiveness
• Performance of your Web site hosted in the Rackspace Cloud—which may involve a
quantity of servers that expands and shrinks transparently in respond to demand,
and which likely involves servers that are globally distributed
• The performance of the hosted Exchange service
• The performance of your own Windows and Linux OSs, including their memory,
CPU, disk throughput, and so forth
• The responsiveness of your backend Oracle database
41

• The performance of the VMware virtualization infrastructure that hosts some or all
of your servers
• The performance of related services such as a Cisco VoIP solution
• The performance of the various Internet connections that let all of these
components communicate
That’s a lot. Although measuring the EUE doesn’t necessarily require deep insight into this
highly‐distributed application stack, you will need that deep insight in order to tune
performance and troubleshoot problems. We’ll cover that in more detail in the next
chapter.
Techniques for Monitoring the EUE
So how do you go about actually monitoring the EUE? What tools are needed, and what
skills are involved? How do those tools actually gather EUE information, store it, and
present it to you? Because today’s applications are themselves constructed from numerous
different components, we’ll have to adopt a number of techniques and technologies for
monitoring the EUE.
But let’s first talk about the unachievable ideal way to monitor the EUE: A little monitoring
agent installed on all your users’ computers. That might be possible if your end users are
entirely within your organization, but it’s outright impractical to distribute that many
agents and collect information from them. It’s impossible if your end users are external
users, such as customers; such agents are often referred to as spyware no matter how
beneficial they may seem. But an EUE‐based agent would be the perfect monitor: It would
“see” exactly what the end user sees and be able to monitor specific transactions and report
back with real‐time, real‐world performance information.
Given the impossibility of using such an agent, however, we have to look at other
techniques. In fact, because we can’t directly and empirically monitor the EUE, we may
have to use several techniques to monitor different aspects of the EUE, depending upon the
exact elements of our application. Again, we want to make sure we’re monitoring things
that map directly to the end user’s actual experience; we can’t just fall back to monitoring
CPU utilization and other technology‐centric metrics.
Platform‐Level APIs
One approach is to use application programming interfaces (APIs) for specific platforms.
For example, being able to pull performance information from a Windows server or an
Oracle database requires specific knowledge of how those platforms are built and how to
pull performance information from them.

42

Data from Providers
In some cases, we might be able to draw performance information directly from our service
providers. Some managed service providers (MSPs) can offer us detailed performance
information not only at the technology component level but also at the EUE level, helping
us to see the time it takes to complete a specific transaction, for example.
Distributed Monitoring Agents
One excellent tool for monitoring the EUE is, as I’ve already described, a network of
distributed monitoring agents. Properly deployed, these can help us measure elements of
the EUE from all around the globe—which is especially useful when we’re dealing with a
highly‐distributed application.
Click‐to‐Click Monitoring
Click‐to‐click monitoring is really the ultimate in EUE monitoring because it operates at the
EUE level. It involves measuring the exact amount of time that it takes to complete a sample
transaction or even discrete steps within a transaction. Literally, “click to click” means
measuring the time between an end user’s specific actions—clicking “Next” in a shopping
cart to clicking “Submit Order,” for example.
The reason this is such a good EUE monitoring technique is that it encompasses the entire
underlying application, including whatever components are included in it. If a shopping
cart submission requires the coordination of twelve backend components, we’ll capture all
of that delay in the click‐to‐click time; by using the techniques I’ve described earlier—
platform APIs, provider data, and monitoring agents—we can even break down the amount
of time each element contributes to the final EUE.
For example, let’s revisit a figure from the previous chapter. Figure 3.7 shows a distributed,
multi‐component application. At the EUE level, labeled “End‐user Response View,” we get
the “rollup” of the time it takes to complete specific transactions. This is a Web application,
so we see the TCP/IP response time, the HTTP connect time, and the time it takes to get a
response from the URL. Those are the “wait times” that an end user would experience, and
they’re our top‐level EUE metrics.
43

Figure 3.7: Clicktoclick monitoring.
Literally, those times are what the user waits on between clicking their mouse button and
seeing a final result on the screen, and being able to click their mouse button again to take
their next action. Those are the times we manage to, and those are the times we should
create SLAs around.
But when those EUE times don’t look good, we immediately need to find out why, and that’s
where the “End‐to‐End Component View” comes into play (it’ll be the subject of a much
more detailed discussion in the next chapter). The idea here is to use platform‐specific
APIs, monitoring agents, and so forth to find out how that EUE time breaks down. Let’s look
at a detail of that portion, in Figure 3.8.

Figure 3.8: Endtoend component view.
44

We can see that Web server to Application Server communications, in the top left, used
193ms of time; the LDAP communications from our authentication server—on the lower
left—took 889ms.
Here’s the trick, though: These times can be summed to provide us with the EUE time. In
other words, sometimes it might not be possible or practical to empirically measure the
EUE. Although we cannot derive the EUE by monitoring technical elements like CPU
utilization, we can derive the EUE by monitoring response times between specific elements
of the application architecture. That’s the secret of the EUE:
The EUE is simply a measurement of response time. It does not take into
consideration resource utilization, such as CPU, memory, or disk; it is entirely the
sum of the response times of individual application elements. Although we cannot
derive the EUE by monitoring individual elements’ consumption of resources, we
can derive the EUE by monitoring individual elements’ response times.
That’s the big paradigm shift in IT monitoring for APM. We’re shifting from monitoring
resources against predefined thresholds to monitoring actual response times. End users’
sole metric for their perception of application performance is speed, which equates to
response time, and so that’s what we measure. “Monitoring the EUE” is really just a fancy
way of saying “measure how long it takes.”
Why We Often Don’t Monitoring EUE Today
So if the EUE is such an amazing thing to monitor, why don’t we do more of it? There are a
number of reasons, one of which is sheer inertia: The IT industry hasn’t ever really
capitalized on EUE monitoring until recently, and the IT industry—for all that it is a force of
change in many companies—doesn’t like change. But there are also practical reasons.
Complexity
Measuring response times—which, again, is what the EUE is all about—can be difficult. For
one, it almost always requires external instrumentation. You can’t always ask a Windows
server, for example, to measure its response time for looking up a piece of data because
that measurement will be affected by the server’s performance. If the server isn’t
performing well, the measurement won’t be accurate. That means our traditional technical
monitoring—which just draws on performance information directly from whatever we’re
monitoring—can’t always deliver accurate response time information. We’re then faced
with the complexity of creating new monitoring schemes and implementing entirely‐new
tools. That can be complicated.
Lack of Tools
Speaking of tools, until the last half‐decade or so, EUE wasn’t on anyone’s minds. That
means our technology components—servers, databases, networks, and so on—weren’t
built to deliver response time information. It’s only in the past 5 years or so that APM has
really become a major thing, inviting innovative third parties to create a marketplace to fill
the business need. In other words, until fairly recently, there’s simply been a lack of tools
that could monitor the EUE.
45

Even now, the industry is still figuring out what works. The very first vendors in the space
adopted approaches that were suitable for applications of that time, and are often still
applying those same approaches. Today’s newer applications, however—and this has just
been happening in the past couple of years—are built so differently and along such
distributed models that the original EUE monitoring tools often can’t keep up. That means
we’ve been waiting on an entirely new set of tools and vendors who can specifically
address EUE monitoring for the highly‐distributed applications of the “hybrid IT” age.
For example, consider Figure 3.9. This is a portion of a fairly traditional APM console. You
can see that it is indeed focused on response times, which contribute to the EUE, and it is
generating alarms and alerts on response times that are exceeding predefined SLAs—in
other words, it’s helping to alert us to a problem in the EUE and helping us find the root
cause. That’s great.

Figure 3.9: Traditional EUE monitoring.
The problem is that much of this intelligence is gleaned from a series of agents installed
directly on servers, and even perhaps from network probe appliances connected to the
network. After all, you have to collect the response times somehow, and the probe‐ and
agent‐based approaches are very common and effective—in traditional applications. Take a
look at Figure 3.10, which shows the application stack from a more physical perspective,
and see if you can spot the problem with this approach.
46

Figure 3.10: Physical application model.
The problem is that these tools assume we own the entire application infrastructure. In
order to install agents and probes, we need to have all of these components in our own data
centers, under our own control.
How do you propose to install monitoring agents on SalesForce.com’s servers? How do you
think you would monitor response times from the Windows Azure cloud’s globally‐
distributed data centers? It’s impractical, and that’s why the approaches taken by
traditional APM tools aren’t always workable with highly‐distributed, partially‐outsourced
“hybrid IT” applications. But there is a new breed of tools, both from innovative new
vendors as well as from the traditional monitoring vendors, all of whom recognize the
special challenges created by hybrid IT.
Cost
On the face of it, EUE monitoring for hybrid IT applications seems very expensive to build
yourself. Simply deploying worldwide agents to monitor response times from a cloud
computing provider would be prohibitive for most companies. Working with MSPs to gain
instrumentation into their infrastructure, for the purposes of monitoring your use of that
infrastructure, would have to be incredibly expensive for every single business to do—and
that’s why most businesses haven’t done these things.

47

The trick is to leverage some economies of scale. Rather than each company building their
own global monitoring network, companies need to look to application performance
vendors who can provide that network, thus helping to subsidize the cost of that network
across many monitoring customers, making that network affordable for all of them to use.
Component‐Level Monitoring Can Be “Close Enough”
And do you know the number one reason so many companies have been content to ignore
the EUE for so long? Because, in old‐style applications, monitoring technical elements has
always been close enough. “Look,” you might say, “we know that users are happy so long as
CPU stays under 70% or so, and so long as disk queues don’t get longer than 5 or 6. So we
monitor those things, and if they start to go awry, we know we have a problem that needs
to be fixed.”
And I can’t argue with that logic—for old school applications. That is, for applications that
involve one, two, or maybe even three tiers. Where every element of the application lives
right within your own data center. Where you have full control of everything. Where you
can see the end users, get phone calls from them, and talk to them about how the
application is performing. Plus, most IT shops are really good at this kind of monitoring.
They’ve learned how to do it over decades, and refined the process down to a true science.
The problem is, applications don’t look like that anymore.
Why We Must Monitor EUE Going Forward
Applications are simply growing too complex. Outsource a single component of your
application—take a dependency on an SaaS provider, for example—and almost all your
old‐school monitoring techniques go out the window. Sure, you could just build
applications that don’t use outsources elements—but that’s letting your technical
limitations drive the business rather than letting what’s right for the business drive your
technical decisions. The fact is, to support modern business requirements, we’re going to
be seeing much more distributed, hybridized applications—and we need to figure out how
to monitor them effectively. And the bottom line there is that monitoring the EUE is a more
effective metric for any application—whether it’s entirely in‐house or mostly delivered
from the cloud.
Vastly More Complex Environments
The old performance‐and‐threshold world really didn’t offer fantastic performance
management; it was simply “good enough” and it was readily achievable within the
relatively simplistic application environments of the past. And that’s the past.

48

We are very rightly building ever‐more‐complicated application environments today:
• Cloud computing offers the potential for near‐infinite application scalability,
without massive infrastructure investments. Even for internal line‐of‐business
applications in geographically distributed companies, businesses are crazy not to at
least investigate and consider putting a portion—if not all—of their applications
into a cloud environment. It makes an incredible amount of sense in many cases.
• Anyone who’s suffered through agonizingly‐long implementations for applications
like Customer Relationship Management (CRM) solutions can appreciate the ease,
convenience, and lowered overhead of SaaS offerings. Very few businesses are in the
business of supporting massive software installations, and SaaS can deliver the
capabilities businesses need without the overhead and distractions.
• Like SaaS, managed services help businesses lower the cost and overhead—not to
mention the distraction factor—of critical business services that aren’t central to the
business’ competencies. Say you’re a retailer. You obviously need email, but you’re
not in the business of providing email services. Why not let an MSP worry about it for
you?
The arguments for this kind of piecemeal IT outsourcing are compelling, and thousands of
businesses are benefitting from these new models. But the fact remains that we still need to
build our own applications that depend upon these outsourced services. An online retailer
might not want to deal with their email or CRM systems, but they do want to be responsible
for their e‐commerce systems—which unfortunately need to interface with the email
system and the CRM system. The “old school” of monitoring would say, “well, we can’t
outsource email and CRM because they’re critical to the e‐commerce app, and the only way
we can monitor them is if they’re in‐house.” No more: We have to focus on EUE monitoring
because it lets us outsource key pieces of our infrastructure while still maintaining
visibility into what matters the most to our business.
Business‐ and Perception‐Level Focus
Measuring technical elements like CPU, memory, and disk may have been ‘good enough” to
spot impending performance problems, but it did nothing for helping drive a business focus
within the IT group. Let’s face it, “The CPU is running at a steady 80% utilization” isn’t quite
as compelling, from a business perspective, as, “The Web site is running slowly and we’re
losing 10% of our shopping carts—that’s money out the door!”

49

Although I do feel that purely‐internal applications—ones completely under your control—
can be effectively monitored even if you’re ignoring the EUE, I don’t think any application’s
top‐level performance metric should be anything except the EUE. The speed of the
application as perceived by the end user and by the business is the only thing that matters.
Knowing your EUE can help make a ton of other business decisions much easier:
1. You: “Boss, the server’s running 80% all the time. Can we add a processor?”

Boss: “Um, no.”

versus:

You: “Boss, the server is only able to maintain a 2‐second response time for
shopping cart checkouts, and we’re losing about $12,000 in carts per day. Can
we buy a new processor for $200?”

Boss: “Um, yes. Immediately, please.”
2. You: “Hi, Mr. Cloud Provider? Yeah, we have a guy in the office who feels that our
Web site is taking a really long time to load. Can you do something about it?”

Cloud Provider: (laughter)

versus:

You: “Hi, Mr. Cloud Provider? We’re seeing 800ms response times from our Web
application in your cloud. We agreed that 500ms was the limit, and we’ve
narrowed the problem down to a 200ms extra delay in your database layer’s
response time. Fix it.”

Cloud Provider: “Wow—okay. We’ll get on it.”
3. Boss: “Sales for Asia are down. We’re blaming the response times of the Web
site. You need to get on it.”

You: “Sure, just let me update my resume, first.”

versus:

You: “Boss, we’re noticing 30% slower response times for our users based in
Asia.”

Boss: “That’s a million dollars in business a year! We’ll get the provider on the
phone and dig into this immediately.”
The point is that moving to a business focus helps make a number of technology decisions
easier because it helps put a business‐colored spotlight on everything.
50

Too Much Is Out of Your Control
The last reason to move to EUE monitoring is that, quite simply, it’s the only metric you can
accurately and consistently obtain when much of the rest of the application infrastructure
is completely out of your control. Yes, you might need help in obtaining accurate EUE
numbers—especially with globally‐distributed application components and end users—
but the EUE is the only thing you can point to, with confidence, that tells you that “the
business is doing okay.” The less of your application infrastructure you can touch, the more
the EUE is going to mean to you.
The Provider Perspective: You Want Your Customers Measuring the EUE
If you’re an MSP, the preceding discussion should tell you what your customers are going to
be looking for from you. In fact, there are some excellent reasons for you to begin providing
performance metrics to your customers immediately.
The Provider Isn’t 100% Responsible for Performance
When customers rely on your services as a part of their application, it’s all too easy for
them to point to you as the weak link when things aren’t looking perfect. By providing your
customers with accurate metrics—preferably in a way that can be consumed by the
customers’ own performance monitoring tools—you can help not only dispute claims that
your performance is at fault but also avoid those claims entirely.
When customers can “see into” your network to some degree, you’re giving them a number
of benefits:
• You’re proving that you don’t have anything to hide.
• You’re making yourself a partner in their business, not just a vendor.
• You’re helping them eliminate you as a potential “weak link” because they can see
the performance they’re getting from you.
Everyone benefits. You’re able to define more granular and accurate SLAs, and your
customers are able to more easily verify that you’re meeting them. When you and your
customer have the same set of performance data in front of you, you’re both able to have a
stronger business relationship.
You Gain a Competitive Advantage
There’s an enormous competitive advantage in being able to provide your customers with
performance metrics. For one, doing so proves that you’re a mature, competent, confident
service provider. You’re not a “fly by night” company who promises amazing performance
for a too‐good‐to‐be‐true price, then doesn’t deliver. By providing metrics—and
challenging customers to evaluate your competition’s ability to do so—you’re making
yourself accountable, and providing customers with a transparency that they’ll appreciate.

51

You’re also making yourself much more a partner in your customers’ business. Look, the
bottom line is that all service providers want to gain and retain customers; customers, for
the most part, want to stick with their providers—switching is an enormous hassle with
little value‐add. If you can help to integrate your infrastructure with your customers’, you’ll
help them manage their businesses and IT investment more easily and accurately. They’re
a lot less likely to want to switch—you’ll have gone a long way toward making a customer
for life.
Coming Up Next…
We’re hopefully agreed that the EUE is the way to go for managing application performance
at the top level. If you see an EUE problem, though, you’re going to have to be able to dig
deeper to find the root cause. That’s what the next chapter will focus on: Monitoring at the
component level. This isn’t about ensuring good application performance; it’s about fixing
performance problems that you’ve noticed in your EUE measurements. I’ll look at the
traditional monitoring stack and some of the challenges that come into play with today’s
multi‐discipline applications. I’ll also look at newer monitoring techniques, and provide
insight for service providers who want to offer their customers deeper insight into their
service offerings.

52

Chapter 4: Success Is in the Details:
Monitoring at the Component Level
The EUE is your ultimate metric for whether an application is performing well. It’s what
you should base SLAs upon, and it’s certainly your ultimate measure of success or failure
with an application. The EUE isn’t, however, very useful at helping you troubleshoot
problems when they occur. For that, you’ll need a deeper, more detailed level of
monitoring. In this chapter, I want to compare and contrast two approaches for that more‐
detailed kind of monitoring: the traditional, multi‐tool monitoring stack, and a more
modern approach that focuses on getting everything into a single view.
Traditional, Multi‐Tool Monitoring
In a traditional monitoring environment, IT experts tend to rely on single‐discipline tools to
troubleshoot the application stack. That’s an approach that has served for years, although
as applications become more complex and distributed, we’ve needed a wider number and
variety of tools to get insight into everything. Typically, separate tools exist for each major
layer of the application. Consider the example in Figure 4.1, which illustrates the
application stack I’ll be using for this section of this chapter.
This diagram is more hardware‐centric, as it is intended to represent the major physical
elements of the application: client application, network infrastructure, application server
(which might be a Web server, for example), and a back‐end database. Notice that this
example—in keeping with the “traditional” approach—does not incorporate any cloud‐
based elements; we’ll come to the cloud issue shortly, though.

Figure 4.1: An example application stack.
Let’s look at each of these elements in turn.
53

Client Layer
The client obviously plays a major role in an application’s performance. A slow client
computer can do more to ruin the perception of an application than almost any other
component in the stack. Unfortunately, it’s often impractical to monitor performance
directly on the client, particularly in a commercial application setting where the client
belongs to your customer and not to you.
You can, of course, have a test client computer with hardware and software configurations
that resemble your average client computer or customer computer, and you can install
monitoring software on that computer to see what kind of performance problems, if any,
the client is introducing into the equation. That doesn’t always help you troubleshoot
problems at the client level. In fact, in some cases, client‐specific problems can’t practically
be solved. For example, suppose some of your customers run a particular brand of antivirus
software that simply slows their computers—there’s nothing you can really do about that,
aside from making sure that your client application performs as best it can under the
circumstances.
There’s unfortunately not much else to say about the client layer. It’s often outside your
control, and although you can and should test your client applications on representative
client computers, that’s about all you can do in a traditional setting.
Network Layer
At the network layer, however, you can begin to get more involved. You can certainly
monitor your network’s performance, and many tools exist to help you do so. For example,
Figure 4.2 shows a tool that helps to monitor network performance at a particular network
node.

Figure 4.2: Monitoring network performance.
54

For a larger view of your entire network, there are tools to roll up per‐node performance
monitoring into a “whole network” view, as illustrated in Figure 4.3. But there’s a problem
with this kind of monitoring: It’s absolutely unaware of the application. This type of
infrastructure‐level monitoring can tell you whether a particular router is overloaded, for
example, but it can’t correlate that problem to a particular application performance issue.

Figure 4.3: Monitoring the entire network.
Another problem with these types of tools is that they start to fail you once you begin
relying on someone else’s network. As you move toward a hybrid IT environment, and you
begin deploying elements of applications in the cloud—meaning in someone else’s data
centers—you lose the ability to closely track network device performance.
I would argue that, although this type of monitoring tool is often widely‐used within
organizations, it isn’t really all that useful for application performance monitoring because
it isn’t application aware—it’s simply reporting the state of the network. Also, as
applications begin to move toward a hybrid or cloud‐based model, this type of tool simply
loses all its utility, as it can’t do its job across a super‐distributed, partially‐outsourced
network.

55

Application and Database Layers
As you move into the application itself, performance monitoring can become even more
complicated. Ignoring the hardware that connects and runs the application, modern
applications consist of numerous interconnected components—many of which may be
shared by other applications. This is especially true in a cloud computing scenario: If you’re
storing data in Microsoft’s SQL Server for Windows Azure, there are likely multiple other
customers doing the same thing. Thus, your performance could potentially be impacted by
others’ use of the cloud infrastructure. That kind of sharing can even occur within
applications that live entirely within your own data center, making performance
troubleshooting complex and difficult. Consider a relatively straightforward Web
application, as Figure 4.4 shows.

Figure 4.4: Application software stack.
This application consists of a Web server, application code running on an AS/400, and
back‐end data from two different databases. The application itself utilizes Exchange Server
for messaging, which in turn has a dependence on Active Directory (AD) for authentication,
address books, and so on. Much of the application is running inside VMware virtual
machines, which add another layer of performance complexity. How do you monitor an
application like this?

56

Traditionally, you’d probably need about seven tools, each one to monitor a specific
component: IIS, VMware, AD, Exchange, DB/2, the AS/400, and SQL Server. Each tool would
be entirely focused on a single element of the application stack. For example, Figure 4.5
shows what you might expect from a VMware monitoring solution: information on CPU,
disk, memory, and network utilization.

Figure 4.5: VMware monitoring.
A SQL Server monitoring tool, in contrast, would present a completely different set of
statistics and views—and wouldn’t be at all aware of the underlying impact from VMware
itself. As Figure 4.6 shows, you’d be dealing with two completely different tools, with
different goals and methodologies—neither of which would have the slightest idea about
your application’s performance.
57

Figure 4.6: Monitoring SQL Server performance.
That’s ultimately the problem with traditional application performance monitoring even
before you start involving hybrid IT elements like cloud computing: every tool is focused on
one bit of the application, and those tools don’t even monitor the application. Those tools
just monitor their one element, in a completely standalone and out‐of‐context fashion. Start
involving cloud computing and things get even more difficult because you aren’t likely to
even get a tool that will let you directly monitor your cloud database performance.
Other Concerns
So let me clearly state the problem: Performance tools that focus only on a single
element of the application are ineffective. In fact, that can actually hinder your
application performance monitoring and troubleshooting efforts, as you’ll see in the next
section. And that’s assuming your applications are entirely under your control, within your
own data center; start moving application elements into the cloud—whether as Software as
a Service (SaaS) solutions, hosted services, or true cloud computing—and these kinds of
tools become even more impractical because in many cases you can’t use them to monitor
the cloud elements of your application.

58

Multi‐Discipline Monitoring and Troubleshooting
Part of the problem with element‐specific tools like those I discussed earlier is that they
encourage “siloing” within your IT organization. IT professionals tend to specialize in just
one or two elements of an application, and the fact that their tools only monitor those
elements enables them to put on blinders for the rest of the application. That simply causes
more difficulty when performance problems arise.
Applications Are Not the Sum of Their Parts
The problem is that you can never look at a single element of an application and judge its
overall performance. That’s a very intuitive concept that you might not have actively
thought about, but do so for a second: Do you think you could look strictly at a single
application element—say, the database, the Web server, or a network router—and
determine the overall performance of the application? No, of course not. Therefore, you
can’t look at every component in a standalone fashion and determine performance
information, either. You can’t take a bunch of disparate statistics from multiple tools and
merge them in your head for a “performance picture” of an application—it just isn’t
practical.
That’s because applications aren’t simply the sum of their parts. In other words, you can’t
simply add up or average the performance numbers for various application elements and
arrive at an accurate top‐level performance number for the entire application.
Application performance is, in this respect, a lot like a sports team. You can’t just add up the
average performance of each player and get a feel for the team’s overall success. Nor can
you simply look for the worst‐performing player and focus all your troubleshooting efforts
on that person. Individuals might perform very well on their own in practice, and perform
entirely differently when the whole team is in action. You have to manage the team’s
performance as a team by watching the entire group of players work together. When it
comes to an individual’s performance, you might coach them to adopt certain behaviors or
to correct specific problems. Doing so, however, might not change their interactions with
other team members during actual play. You can’t simply “tune” an individual player on
their own; you need to “tune” everyone’s performance as a part of the team.

59

Tossing Problems Over the Fence: Troubleshooting Challenges
Tuning individual application elements is what IT professionals tend to do well, but it can
result in significant delays and dead‐ends when it comes to managing the performance of
an entire application. The week before I wrote this chapter, for example, I visited a
consulting client who was experiencing performance problems with an application. I sat
down with representatives from their various IT disciplines, and had a conversation
something like this:
Me: So what’s the problem?
Manager: We’re seeing a slowdown in one of our applications. We’ve traced
the problem to a particular query for data from the database—sometimes, it
can take more than 4 minutes to execute that one query.
DBA: I’ve looked at the SQL Server performance, though, and there doesn’t
seem to be a problem. Anytime I run that same query manually, it works fine.
I’ve rebuilt the indexes on the affected tables. I’ve even run the query
repeatedly, using a load‐simulation tool, in a test environment and it works
fine.
Network Engineer: We’re not seeing anything in the network. Whenever
someone reports the slowdown in the application, the network is at the exact
same performance levels it always is.
Developer: It is definitely not the client application. We’ve tested it
extensively. Anytime this delay occurs, it happens as the client application is
waiting for data to return from the query. The application pauses, but it isn’t
doing anything—it’s just waiting on the database.
DBA: It can’t be waiting on the database; I’m not seeing any indication that
the database is taking more than a few milliseconds to process that query! It
must be the amount of data you’re querying.
Developer: No, it isn’t—we’re only grabbing two rows of data, three at most.
That continues for a half‐hour or so, with all sides offering up charts and other evidence
that their element wasn’t causing the problem. So if nothing was causing the problem, what
was the problem? This is a classic example of what I call “tossing the problem over the
fence.” Each individual IT discipline has set itself up in a fenced‐off silo, and they only
concern themselves with what’s happening inside their fence. If they determine, in their
own judgment, that their bit of the world is fine, then they toss the problem over the fence
to another discipline.

60

That conversation was about an application living entirely within the company’s data
center: Just imagine how much more fun it would have been if outside vendors—SaaS
providers, Managed Service Providers (MSPs), cloud computing vendors, and so on—were
involved! With absolutely no insight into the outsourced components’ performance,
everyone within the organization would likely have tossed the problem right outside the
organization and into the laps of those vendors. Unless those vendors had tools that could
show they weren’t the problem, the argument would have gone on forever.
So what’s the solution? Integrated monitoring, or what some would call unified monitoring.
You still have to monitor the individual application elements; you just do so all in one place.
Integrated, Bottom‐Up Monitoring
The idea behind integrated, or unified, monitoring is to give everyone access to the same
information, in the same way, in the same place. You’re still monitoring components, to be
sure—each component does contribute to or take away from the overall application
performance, after all. But you’re doing so in a way that puts all the information in front of
all the experts at the same time. That helps to break down the fences between IT
disciplines, gives everyone the same evidence to work from, and helps to integrate the
performance troubleshooting effort more effectively.
Monitoring Performance Across the Entire Stack
Consider a completely unified dashboard, like the one shown in Figure 4.7. Here, you can
get a top‐level view for all your application components—not just one, and without the
need to open multiple tools. You can establish thresholds for “good” and “bad”
performance, and get a quick at‐a‐glance view of how your components are all doing.
Anything other than green in any one spot may indicate a performance problem that will
impact your EUE.
61

Figure 4.7: Dashboard of all components.
When you do see something other than green, you need the ability to drill‐down into
component‐specific performance information. For example, the VMware and Hyper‐V
status for this application is orange, meaning there’s some kind of problem. Drilling‐down
might reveal a screen like the one in Figure 4.8, which breaks down the problem in domain‐
specific terms. The dashboard lets everyone see where the problem lies; the drill‐down
helps to start the troubleshooting process. There’s no need to toss anything over any
fences, because it’s clear where the problem exists.
62

Figure 4.8: Detail for virtualization problem.
Here, we can see that there specific service alerts for both VMware and Hyper‐V. By
viewing those alerts, we can begin to troubleshoot the problem more quickly. We’ve gotten
the right domain expert involved, made it clear that there’s no over‐the‐fence option
because we know the problem is in his or her domain, and have gotten troubleshooting
started.
The key to a toolset like this is having every part of the application represented. As Figure
4.9 shows, that representation can even include infrastructure components—in this case,
Cisco‐powered Voice over IP (VoIP) services.
63

Figure 4.9: Including infrastructure services in the monitoring.
It’s critical that we be able to see performance for everything that the application depends
upon. That way, no IT discipline is excluded or forgotten, and there aren’t any “hidden
dependencies” impacting performance under the hood or behind the scenes. As you can see
from Figure 4.9, however, including everyone doesn’t mean making the performance
information generic: This example clearly illustrates the domain‐specific details that a tool
must be able to deliver. VoIP is entirely different from, say, a database server, and the
troubleshooting tools have to respect that difference and represent appropriate
information for each element.
Our own data center needs to be included (see Figure 4.10). The idea is to roll up all the
resources under our control so that we can get a top‐level performance or health metric for
our self‐managed resources. This drill‐down lets us see network devices, individual
servers, virtualization hosts, and the other elements under our direct control. When
something exceeds a threshold, we can immediately dispatch the right domain expert to
begin troubleshooting the problem—and we would have some confidence that the problem
is on our end and within our control.
64

Figure 4.10: Including our data center in monitoring.
The corollary, of course, is that the monitoring solution also needs a way to monitor the
resources not under our direct control. As Figure 4.11 shows, that might include cloud
computing services like Amazon Web Services. This is where a monitoring solution really
needs to break with traditional techniques: Instead of monitoring these external services
directly, the tool must often do so through vendor‐supported application programming
interfaces (APIs), through direct observation of service response times, and so forth. This is
really a new field in performance monitoring, so as you begin evaluating solutions, it will be
important to consider different vendors’ ability to draw information from the outsourced
services that you rely upon to run your application.
65

Figure 4.11: Monitoring cloud computing services.
I have to emphasize tight integration with outsourced services. For example, in the original
dashboard view, you’ll notice that SalesForce.com shows a status color of blue—not the
green we’d hope for, but not as immediately alarming as yellow, orange, or red. Drilling‐
down reveals a service alert, shown in Figure 4.12. As you begin working with services
outside your own data center, you need to become more aware of their data center
operations—including, in this case, a scheduled maintenance window that may result in
diminished performance for our application that relies on SalesForce.com. By having this
information right within the monitoring console, we can help manage our future
performance more effectively. We might choose to temporarily turn off portions of our
application that depend on SalesForce.com during that maintenance window, or we might
simply need to be alert for performance problems that result from the maintenance
window.
66

Figure 4.12: Viewing service alerts.
As you move toward a hybrid IT model, this kind of integration with external service
providers will prove absolutely invaluable.
Integrated Troubleshooting Saves Time and Effort
I presented this unified monitoring concept to the client I was with, and they gave it a shot.
As I was writing this chapter, they contacted me to let me know they had deployed a
unified monitoring solution and that they were quite happy with it. It turns out the problem
was in the SQL Server, and had something to do with the way the server was recompiling
that query. The DBA was, unfortunately, a bit stubborn about admitting it, but with a single
tool showing perfect performance in all but his application component, he was forced to
start troubleshooting the problem. By getting the same tool in front of everyone, each
discipline’s fences started to come down a bit. Sure, they were all responsible for their
individual elements, but it gets a lot harder to toss a problem over the fence when everyone
can clearly see that it’s on your side.

67

The Provider Perspective: Providing Details on Your Stack
MSPs have a more challenging time of monitoring. They must not only monitor their own
data centers—contending with all the multi‐discipline problems I’ve described in this
chapter—but also provide their customers with a rolled‐up view of their services. Ideally,
they should do so in a way that customers can integrate into their monitoring tools so that
service alerts and other information “rolls down” from the MSP’s data center into the
customers’ monitoring tools.
With the right monitoring tools, you can do that pretty easily. What you’re really after is a
tool that gives you all the detailed, cross‐discipline, unified monitoring you want within
your data center, with the ability to aggregate some of that data into a “status indicator” for
your data center, done in such a way that your aggregate indicator can become a part of
your customers’ dashboards. Figure 4.13 illustrates the concept.

Figure 4.13: Aggregating your MSP network for customer information.
68

Of course, you’re quite likely going to want to provide your customers with more detailed
information as well—because they’re probably going to demand it: custom dashboards
specific to your service offerings, quality of service (QoS) reports, SLA reports, and more.
The last chapter of this book will dive more into those offerings.
Coming Up Next…
You should now have a vision for how your cloud‐based or hybrid applications can be
monitored from the EUE and the component level. In the next chapter, we’ll start exploring
some of the specific capabilities you need to start monitoring a hybrid IT environment—
from your data center into the cloud. Consider the next chapter to be a sort of “shopping
list” layout of all the features that you should at least be considering in a monitoring
solution.

69

Chapter 5: The Capabilities You Need to
Monitor IT from the Data Center into the
Cloud
In the previous chapters of this book, I’ve covered a lot of the “why” and “how” of unified
monitoring. In this chapter, I want to start focusing specifically on the “what:” What
capabilities you need to bring into your environment to successfully manage and monitor a
hybrid IT infrastructure and its applications. Think of this chapter as an instructional guide
for building a shopping list. I’ll cover not only features but also some of the finer, easily‐
overlooked details that can make all the difference in a successful implementation.
First, though, let’s clearly define some of the major business and technical goals for this
kind of evolved, hybrid IT monitoring. As I do so, I want to re‐introduce the works from
World Coffee, the case study I introduced in Chapter 1.
Business Goals for Evolved Monitoring
I want to cover the business’ goals for monitoring first because in reality the business’ goals
are the only ones that matter. The business is paying for not only the monitoring solution
but also the applications and infrastructure being monitored; meeting the goals of the
business is really the whole point of all of this. So what might a business hope to gain from
a more evolved form of monitoring?
EUE and SLAs
The business’ primary concern, of course, is to have applications and an infrastructure that
perform well. A problem—one I discussed in the first chapter of this book—is that many
businesses have given up on simply saying, “we want our applications to perform well,” and
have instead gotten themselves bogged down in the minute details of application
performance. But knowing that “Server5 is running at 80% utilization” isn’t truly a
business goal—although many businesses have accepted that this is how they have to
define “good performance.”

70

They shouldn’t accept that. Instead, businesses should back off a level and concern
themselves with what really concerns them: Applications that perform, from an end user
perspective, in a way that supports the business’ requirements. That’s the enduser
experience, or EUE, that I’ve referred to throughout this book. “We want users to be able to
complete their checkout process in 5 seconds or less from the time they click ‘Submit
Order.’” That’s an EUE‐focused metric, and it can become a part of formal Service Level
Agreements (SLAs), which are used to communicate the desired EUE‐level performance
across the business and its IT team.
Let’s be clear on something: A monitoring solution that doesn’t allow you to quickly
determine the current EUE metrics and that doesn’t help you manage to an SLA‐defined
EUE metric is not a monitoring solution you should be using. Different vendors take very
different approaches to how they show you the EUE, how they determine the EUE, and so
forth; those are technical details that are important, but the most important thing is that
the solution give you some means of managing the EUE.
EUE: All That Matters to the Business
Business have been unaccustomed to dealing with an EUE metric for so long
that many will resist the concept of relying solely on the EUE—simply
through force of habit. I had a recent consulting client tell me that they were
happy defining an EUE like, “Sales orders must be accepted by the system in
2 seconds or less after submission.” But they still wanted to add other things
to their SLA, like that the system must have “a minimum 99.5% uptime.”
Think about it: If the system is down, it isn’t accepting orders in 2 seconds or
less. So you didn’t meet the EUE metric. There’s no need to specify anything
else.
“We want to put maintenance windows in the SLA,” they told me. Well, that’s
fine—make it part of the EUE. Sales orders must be accepted within 2 or
fewer seconds…“between the hours of 7am and 7pm; outside of those hours,
sales orders do not need to be accepted by the system.” That makes it clear
what end users should expect of the system—it might not be available to
them from 7pm to 7am. By stating things in that kind of end‐user context,
you’re communicating not only your desires to your IT team but also your
commitment to your end users. Everyone’s using the same language. End
users don’t have “maintenance windows” after all; they have expectations for
when they’ll be able to do their jobs. State your SLAs in those terms, and
manage those expectations.

71

“We want to add a clause that systems must not run at more than 80%
capacity.” They really insisted on that one, but I eventually convinced them
that such a metric might be a good IT management guideline, it didn’t belong
in an SLA. So long as the EUE metrics were met, it wouldn’t matter how
burdened the systems were. If IT could meet that 2‐second rule with a 95%‐
loaded server—more power to them! They’d be saving money by doing more
work with fewer resources. And what does “95% loaded” mean, anyway?
95% processor capacity? Network throughput? Disk I/O? Don’t dive into the
technical details within an SLA: Try to stick with EUE‐based metrics that
describe your desired bottom‐line performance, and make sure IT has the
tools they need to manage the technical components to your EUE‐based SLA.
Let’s check back in with World Coffee on this. I’ll re‐introduce some of these folks, as we
haven’t heard from them in a couple of chapters.
Ernesto is an inside sales manager for World Coffee, a gourmet coffee
wholesaler. Ernesto’s job is to keep coffee products flowing to the various
independent outlets who resell his company’s products. Like most users,
Ernesto consumes basic IT services, including file storage, email, and so on.
He also interacts with a customer relationship management (CRM)
application that his company owns, and he uses an in‐house order
management application. Ernesto works on a team of more than 600
salespeople that are distributed across the globe: His company sells
products to resellers in 46 countries and has sales offices in 12 of those
countries.

Ernesto’s biggest concerns are the speed of the CRM and order management
application. He literally spends three‐quarters of his day using these
applications, and much of his time today is spent waiting on them to process
his input and serve up the next data‐entry screen. He needs that process to
be quick and efficient. When he looks up information or submits new
information, such as sales orders, he needs the system to be responsive.
Every minute he spends waiting is a minute he’s not selling, and those
wasted minutes can add up quickly.
Budget Control
As businesses start moving toward hybrid IT environments and applications and
incorporating outsourced components, budget starts to become a very real concern. For
example, consider how internally‐hosted applications and services are priced: The business
pays some up‐front cost to acquire a solution, often some kind of recurring maintenance,
and will have some amount of IT staff time spent on supporting the solution. Those costs
are relatively easy to determine, and are pretty much fixed. If the business has a really busy
week, the application will cost about as much to support as it would in a really slow week.

72

Now consider how cloud services are priced. Some, like many SaaS solutions, may be priced
per user, based on storage consumption, and so forth—relatively easy to track, trend,
predict, and control. Other services, however, are priced based on actual usage. Here’s the
current pricing, as of August 2010, for Microsoft’s Windows Azure cloud computing
platform:
• Compute = $0.12/hour
• Storage = $0.15/GB stored/month
• Storage transactions = $0.01/10K
• Data transfers = $0.10 in/$0.15 out/GB—($0.30 in/$0.45 out/GB in Asia)
This is a fairly typical pricing model for cloud computing; companies like Rackspace,
Amazon, and others all have similar pricing models. You’re paying based on usage. What
will your monthly bill be? There’s no way to know in advance.
This is where a truly unified monitoring system can help. In addition to tracking raw
performance and high‐level EUE metrics, a solution can also keep track of your service
consumption. It can help you see how much you’re using, and therefore how much you’ll be
paying. You’ll be able to keep an eye on your costs, and relate those costs to the income
produced by your hybrid applications. If an application is consuming more than it is
returning, you’ll be able to address the problem before your bills start getting out of hand.
This is an entirely new territory for monitoring software. It’s made more complicated by
the fact that some applications are really super‐distributed across different hosting
providers. Consider, for example, Figure 5.1. This is an example I’ve used before, but it
bears repeating in this new, budgetary context.

Figure 5.1: Example of a superdistributed hybrid application.
73

This application has a certain number of resources in your own data center—shown by the
Windows, Linux, Oracle, and VMware icons. Those costs, as I’ve said, are relatively fixed
and predictable. The application also relies on SalesForce.com, which is an SaaS offering,
and on the cloud computing platform Rackspace Cloud. Your use of SalesForce.com might
be per‐transaction (perhaps it’s being used to issue license keys to customers), and your
cloud‐computing costs will likely be based on actual usage as well. Being able to track that
usage, and therefore those costs, across all those different platforms can be quite complex.
The business has a clear need for a monitoring platform that can unify all of that
information into a single place so that you can get a true and accurate picture of your costs.
John works for World Coffee’s IT department and is in charge of several
important applications that the company relies upon—including the CRM
application and the in‐house order management application. World Coffee
has moved to a hybrid IT application for its in‐house order management.
The CRM element is now outsourced to SalesForce.com, an SaaS provider;
the in‐house order management application is Web‐based, and runs in
Amazon’s EC2 cloud computing platform. Amazon is also used for customer‐
facing ordering applications.

John needs to make sure that every user of every system is experiencing
response times at or better than those defined by the company’s SLAs. This
task is difficult because many of the applications’ elements are outside his
direct control. He needs ways to directly test the EUE as well as ways to
check on the direct response times for the various SaaS, cloud computing,
and other outsourced IT elements.

John is also responsible for providing his boss with information on how
much all of this is costing the company. Since the switch to hybrid IT, the
company has spent more on outsourced IT services than they anticipated.
They’ve seen an increase in sales volume, so it’s likely that the additional
expenses are justified, but they need some way to closely track actual usage
and charges so that they can relate that more directly to the resulting sales
volume.
Technology Goals for Evolved Monitoring
IT’s job is to implement the SLA that the business has agreed to. Their job is to make sure
that the EUE metrics needed by the business can be delivered within whatever parameters
the SLA outlines. That means IT essentially needs tools that can tell them when the EUE is
starting to go wrong, and help them find the root cause of the problem so that they can fix
the problem before the SLA is missed.

74

Centralized Bottom‐Up Monitoring
Today, one of IT’s biggest challenges is that they simply can’t get enough information onto a
single, consolidated screen. Instead, they’re stuck looking at numerous consoles, as
illustrated in Figures 5.2 and 5.3. One for the database. One for VMware. One for Windows.
One for Active Directory (AD). One for Exchange. One for the other database. In addition,
few of these consoles have any ability to look into outsourced systems like SalesForce.com,
Rackspace Cloud, Amazon EC2, Google AppEngine, and so forth. These consoles offer no
means of calculating the EUE; thus, they can’t tell you when you’re meeting the EUE metric
or not.

Figure 5.2: Database performance.

Figure 5.3: VMware performance.

75

EUE metrics cannot be derived from looking at component performance; the EUE must be
directly observed, and it takes a central, unified monitoring solution capable of doing so to
get an accurate EUE measurement. If you’re not meeting your EUE metric, however, you
still need to look at the individual components’ performance to find out which ones are
contributing to the reduced performance. Again, a unified monitoring console give you this
capability by putting every component’s performance right in front of you—including
outsourced elements like cloud computing platforms, hosted services, SaaS services, and so
on. So that’s the technical need: Everything in one place. It’s the only way to meet the
business’ EUE‐centric SLAs.
Improved Troubleshooting
IT departments also have a need for more streamlined, efficient troubleshooting. When
something does start to go wrong with application performance, the IT team needs to be
able to quickly pinpoint the cause of the problem and bring domain‐specific tools to bear so
that the problem can be solved quickly.
A unified monitoring platform is generally regarded as the best way to avoid the “siloing”
that can occur during IT troubleshooting scenarios. In other words, by getting every team
member on the same screen, with the same information, everyone can agree more quickly
on which major application element is causing or contributing to a problem—rather than
each team member using individual domain‐specific tools and independently stating that
“their” component is “working fine.” Once affected application elements are identified,
either the unified monitoring solution or domain‐specific troubleshooting tools, or a
combination of both, can be used to further refine the root cause of the problem and to
discover a solution.
The key is getting everyone on the same page. A unified monitoring solution does this by
presenting similar statistics for a database server, virtualization host, Windows server,
cloud computing platform, and so on. While each of these elements will obviously have a
variety of different performance metrics that need to be examined, by bringing them all
together into the same place, and presenting them similarly, the solution can create a sort
of level playing field, providing a more authoritative starting point for troubleshooting.
A Shopping List for Evolved Monitoring
Those are our business and technology goals. Now we need to outline the exact capabilities
that an evolved, hybrid IT monitoring system needs.
High‐Level Consoles
It’s a screen shot I’ve shown you before, but the first and most important aspect of a truly
unified monitoring system is a high‐level console that gives you a broad, dashboard‐style
view of your overall application. Figure 5.4 shows an example.
76

Figure 5.4: Toplevel, unified monitoring dashboard.
This dashboard provides at‐a‐glance information for several cloud‐based services, and
offers an overview chart of response time from those services. A line chart shows historical
response times for the past few days or hours, helping administrators quickly identify
performance trends, spot weak links, and so forth. Drilling down into any of these services
provides additional information. Additional panels might include your own datacenter;
Figure 5.5 shows what a next‐level drill‐down into that data center might look like. Here,
we can see a more‐detailed view of what’s happening in the data center. Busy routers are
highlighted, and graphs show top utilization levels for storage, memory, processor, and so
forth. We’re essentially treating our data center as a kind of “cloud” that’s fully under our
control. This second‐level drill‐down lets us dive into the cloud and see some of the
individual elements that run it.
77

Figure 5.5: Drilling down into a data center.
For other data centers—our cloud providers, for example—we might not get that same
level of detail. After all, the “cloud” is supposed to just be a big bucket of functionality and
services, not individual servers. So the drill‐down here might offer a different kind of detail,
as Figure 5.6 shows.
Here, we’re presented with information on specific service instances, response times (over
time), and the number of transactions we’ve been sending to this provider—a key in
helping us meet that business requirement of monitoring actual usage. Every cloud
provider’s drill‐down might be a bit different, as each one works somewhat differently.
78

Figure 5.6: Drillingdown into a cloud provider.
When there’s a performance problem with a cloud provider, this is likely as far down as
we’d drill; the next step is to get them on the phone and find out what’s happening. Within
our own data center, however, we’d likely want to drill a bit deeper.
Domain‐Specific Drilldown
Within our own data center, having additional levels of detail can help focus
troubleshooting efforts. For example, in Figure 5.7, we can see alerts generated from a
specific server as well as summary information for key performance metrics from a variety
of servers.
By configuring performance thresholds, we can receive alarms when something looks
wrong. These should be from across all our servers, whether they’re running Windows,
Linux, Unix, or whatever. In fact, in the figure, you can see that both Windows servers
running SQL Server and Red Hat Linux computers are included in the list of alarms. Getting
all of this information onto the same page will help direct efforts to resolve these alarms.
79

Figure 5.7: Drilling deeper into specific servers.
Performance Thresholds
Ideally, a monitoring solution will come preconfigured with performance thresholds based
on the vendor’s experience with the technologies involved. Commonly, you’ll also be able to
define your own thresholds, perhaps defining a larger pad between “good performance”
and “bad performance” to give your team more time to react. As Figure 5.8 shows, these
thresholds should be used to create instantly‐readable visual displays: Indicators, graphs,
and gauges that help draw your eye to elements that need the most immediate attention.
80

Figure 5.8: Performance thresholds help drive graphical displays.
Thresholds will be different across different technologies. This example is for Exchange
Server, so it includes information about message queues and transfer agents in addition to
more generic metrics such as memory, CPU, disk, and network measurements.
Broad Technology Support: Virtualization, Applications, Servers, Databases, and
Networks
A unified monitoring solution is only as useful as the number of your technologies it can
unify into a single console—and ideally, you want a solution that can handle everything
you’ve got. There are subtle differences in how solutions work. Here are some
considerations:
• Virtualization
o Look for VMware, Hyper‐V, Citrix, Sun, and IBM support
o Consider support for agentless monitoring, which requires less impact on
your infrastructure and means less long‐term support and maintenance
o Look for templates that provide a starting point for Quality of Service (QoS)
metrics
• Applications
o Look for Messaging and Directory Services support for Exchange and AD at a
minimum, along with any other technologies you have in house, such as
Lotus Notes/Domino
o For Web servers, Internet Information Services and Apache support are both
desirable
o Collaboration platforms should be supported, too, including Lotus
Notes/Domino and Microsoft SharePoint
81

o You might need support for Voice over IP (VoIP) services, such as Cisco VoIP
components
o Also consider support for application services, such as those from Citrix, IBM
WebSphere, IBM WebLogic, JBoss, Tomcat, and Sun’s Java Virtual Machine
• Servers—You probably have a variety of servers in your environment, and you want
to make sure they’re all included in your monitoring solution:
o AS/400 (for example, Figure 5.9 shows how a monitoring solution might
raise alarms for IPL or reboot conditions)
o Linux
o Unix
o NetWare
o Windows

Figure 5.9: Viewing AS/400 alarms.
Databases—These form the backbone of most modern applications, and you don’t
want to be tied to just one or two simply because they’re all your monitoring
solution supports; get maximum flexibility by looking for support for:
o Sybase
o Informix
o Oracle
o Microsoft SQL Server
o DB2
o MySQL
82

• Networks—The infrastructure that connects everything can play a critical role in
your applications’ performance; look for a solution that can monitor:
o Cisco’s IP SLA—Figure 5.10 shows a monitoring dashboard for Cisco SLA,
which is critical on modern converged network carrying voice, data, video,
and other traffic
o Core DNS, DHCP, and LDAP services
o SNMP management information
o Routers and switches
o Raw network traffic

Figure 5.10: Monitoring Cisco IP SLA.
End‐User Response Monitoring
There are two main kinds of EUE monitoring: Active and Passive. With Active, you can
actually set up “synthetic” transactions to feed into your system. The monitoring solution
can trace those, and accurately report on response times. You might end up with a screen
like the one in Figure 5.11, showing granular detail for user transactions.
83

Figure 5.11: Active user transaction monitoring.
Passive monitoring doesn’t inject transactions into the system but rather monitors the
individual system elements and creates an aggregated EUE metric. It isn’t always as
accurate as active monitoring, but it’s completely non‐intrusive. Most organizations use
both active and passive monitoring, and you should look for a solution that present both
because the information they generate is largely complementary.
SLA Reporting
This is something we’ll dive into more in the next chapter, but SLA reporting is absolutely a
capability your monitoring solution must provide. Even a straightforward report like the
one Figure 5.12 shows can be tremendously useful. It shows a breakdown of specific
elements of the application—launch, login, lookup, and logoff—and indicates which
elements are meeting their SLA. It also gives you the good or bad news, indicating whether,
at current rates, you are trending toward a breach in your SLA.
84

Figure 5.12: An SLA report.
Public Cloud Support: IaaS, PaaS, SaaS
Finally, because so many applications are becoming reliant on cloud‐based services, your
monitoring solution must be able to include them. Whether you’re using Infrastructure as a
Service (IaaS—think cloud computing like Amazon’s), Platform as a Service (PaaS—think
cloud computing like Microsoft’s Azure), or Software as a Service (SaaS—think
SalesForce.com). Popular choices you should absolutely be able to include in your
monitoring are:
• Amazon EC2 and S3
• Rackspace Cloud
• Google AppEngine
• Windows Azure
• SalesForce.com CRM
The Provider Perspective: Capabilities for Your Customers
As a Managed Service Provider (MSP), you become a part of your customers’ IT teams.
Typically, that means you have a dual problem that your customers don’t often face:
• You need to maintain, monitor, and manage your own network for your own
reasons. After all, you want to provide excellent services to your customers.
• You need to help your customers include your network and applications in their own
monitoring. There are a number of ways in which you can do this (many of which
we’ll discuss in the next chapter), but the bottom line is that you need to provide
your customers with some visibility into your infrastructure so that they can treat
your systems as a true part of their systems.
85

Li works for New Earth Services, a cloud computing provider. Li is in charge
of their network infrastructure and computing platform and is working with
World Coffee, who plans to shift their existing Web services‐based order
management application into New Earth’s cloud computing platform.

Li knows that he’ll have to provide statistics to World Coffee’s IT department
regarding New Earth’s platform availability because that availability is
guaranteed in the SLA between the two companies. However, he also knows
that he’ll need to provide them with more detailed insight into certain
aspects of New Earth’s infrastructure. After all, World Coffee is essentially
making New Earth’s network a part of World Coffee’s network through their
hybridized IT applications—so as a customer, they deserve some insight.

Li plans to search for monitoring applications that can be used by his
internal network engineers that will also allow him to provide dashboards
and reports directly to customers like World Coffee. That way, he doesn’t
have to build his own monitoring and data‐provisioning mechanism.

MSPs often look for monitoring systems that have multitenant capabilities. In other words,
the MSP can buy a monitoring system that lets them monitor their entire infrastructure
while also providing monitoring capabilities directly to their customers—restricting each
customer so that they can only see their portion of the MSP’s infrastructure. Such
monitoring systems are obviously more complex.
In some cases, the monitoring itself might be something that the MSP offers as an additional
service. Imagine a conversation between Li, who works for an MSP, and John, who works
for one of Li’s customers:
John: Will we be able to monitor the servers that we’re using on your
network?

Li: It’s possible. Let me ask, what sort of monitoring tools do you have now?

John: We use a lot of different ones. We have some for Oracle, others for
VMware, and others for Microsoft Windows. We don’t have anything
specifically designed for monitoring cloud‐based resources.

Li: You know, in addition to the cloud computing that we’re providing you,
we can provide you with a complete unified monitoring solution. It’s
basically a Software‐as‐a‐Service offering. It can monitor your entire
infrastructure in one console, including your Oracle, VMware, and Windows
systems. It can also include monitoring of our systems so that you’ll get your
entire network—including those bits you’ve outsourced to us—on the same
screen. There’s minimal impact on your network, and you wouldn’t be
responsible for maintaining or patching the monitoring system—it’s just a
service we’d provide to you.

John: I never knew such a thing was possible.

86

It is possible. Vendors are becoming increasingly creative and efficient at handling this kind
of hybrid IT environment, and the ability for service providers to offer monitoring as just
another service to their customers—well, it’s compelling.
Coming Up Next…
In the next and final chapter of this book, I’ll focus on one last set of capabilities that your
monitoring solution should offer: reporting. It’s very easy to get caught up in the actual
details of monitoring, concerns about performance alerts, and setting thresholds and forget
that management reporting is equally important. I’ll look at different kinds of reports that
can be used to help manage SLAs and keep the business on‐budget as well as reports that
show component‐level health, usage trends, and so on. I’ll also look at newer kinds of
reports, including dashboards, in addition to ways in which you might want to leverage
performance information elsewhere, such as data stores and application programming
interfaces (APIs). For the MSP perspective, I’ll also look at how things like multi‐tenant
capabilities can help deliver added value to your service offerings.

87

Chapter 6: IT Health: Management
Reporting as a Service
I’ve spent a lot of time in this book explaining the capabilities and technologies you need to
add to your environment in order to enable truly hybrid, data center‐to‐the‐cloud
application and service monitoring. But all of that monitoring is useless without output:
One of the end goals of this entire effort is to provide your managers and executives with
effective reports—whether they are internal “customers” or external customers.
Dashboards and other elements that show a manager that the environment is healthy and
on budget or that show them which service (not IT component) isn’t doing well. The goal of
this final chapter is to focus on these reports, what they should look like, and what value
you can expect to derive from them.
Note
This is an unusual chapter in that I’ll mainly be presenting examples of
reports. My goal is to help you develop a kind of “shopping list” for the types
of reports you should look for in systems that you’re evaluating and to
explain some of the finer details that I like in these reports. Most of these
examples are taken from live systems, so in some cases, I’ve obfuscated
customer‐specific information such as publicly‐accessible server names, IP
addresses, and so forth.
The Value of Management Reporting
There’s no question that reporting has value, but what, specifically, is that value? In other
words, what should you expect reports to provide other than pretty graphs? What will you
get out of reports? Let me quickly outline the major points so that I can then show you
examples of monitoring system reports that deliver those benefits.
Business Value
Businesses look, primarily, for reliability and return on investment (ROI). Specifically,
businesses want reports that can:
• Monitor compliance with service level agreements (SLAs)
• Monitor application performance from an end‐user perspective
• Track utilization, especially when that utilization relates to cost, as it does with most
cloud‐computing platforms
• Help predict growth in utilization (to help estimate the costs of supporting that
growth)
• Assist in maintaining maximum uptime and responsiveness for entire applications
88

Technology Value
Technologists need reports and tools that can help them achieve the business’ goals. That
means technology‐focused reports are often a bit (or a lot) more detailed, focusing on
implementation details that support the business’ high‐level views and metrics.
Technologists look for reports that can:
• Quickly detail key performance metrics for components, highlighting out‐of‐
tolerance metrics that require attention
• Show usage trends so that IT can predict when usage will exceed the system’s ability
to perform within tolerance
• Dive from high‐level metrics, like end‐user experience measurements, into deeper,
technology‐specific metrics for troubleshooting purposes
Reporting Elements
In the examples that follow, I will highlight specific capabilities of a monitoring system. In
most cases, I’ll call out specific features of these reports that I find especially useful, and
that I think you should look for in your own monitoring solution. I’ll spend the most time
on detailed performance reports because those provide the bulk of the intelligence you’ll
need to operate your infrastructure. I’ll also look at SLA‐specific reports and a few
dashboards that help provide a high‐level, at‐a‐glance view of the environment or specific
applications and services.
Performance Reports
Let’s dive into the examples I’ve gathered. First up, in Figure 6.1 is a look at Exchange
Server availability. This report really highlights the value of being able to monitor a hybrid
IT environment: This Exchange system is hosted at Rackspace, not in our own data center.
Being able to monitor system uptime—especially when other applications depend on this
system—is crucial to maintaining the overall performance of our environment.

Figure 6.1: An example Exchange availability report.
89

Note
You’re seeing a really good day on my Exchange Server; here the
performance line is the blue one at the very top of the graph, indicating 100%
uptime.
Figure 6.2 shows another way of monitoring email availability, and it illustrates a key
capability that you should look for. This chart is showing overall email availability from a
variety of services—Web mail, SMTP, POP, and so on. Those are the services you rely on, so
rather than monitoring the system, this report is monitoring those services. This kind of
service‐level availability is important for anyone who is relying on hosted or cloud‐based
services as a part of their IT infrastructure.

Figure 6.2: An example general email availability report—rackspace hosted services.
Note
Once again, everything’s looking good—all of these services are at 100%.
Boring‐looking performance charts are the ones you hope to see all the time!
Figure 6.3 shows another hosted element: Salesforce.com statistics. This is a more basic
performance report, showing the number of transactions as well as a couple of service
level‐type statistics: transaction speed and overall system status. There are numerous
other stats you would want to track for Salesforce.com if you relied upon it, and your
monitoring system should deliver in this kind of easy‐to‐read, live report.
90

Figure 6.3: Example Salesforce.com statistics.
Note
Notice that a drop in the number of transactions isn’t bad, although the drop
in transaction speed might be worrying. Neither of these statistics affects the
system’s total uptime, shown on the bottom graph as 100%.
91

If you’re using a network (and what else would you be using?), you should be concerned
about its performance and availability. Figure 6.4 highlights another key capability I’ve
discussed throughout this book: getting everything into one monitoring system. Just
because you can monitor network protocol statistics using other tools, you should still
want them monitored in the same place as everything else. That way, when a problem
occurs, you have all the troubleshooting information you need in one place.

Figure 6.4: Network statistics from Cisco Netflow.

92

Note
No DNS traffic at all? That could potentially be a bad sign, except that in my
network, most DNS is resolved internally, so we’re not seeing DNS use much
bandwidth.
Figure 6.5 is another example of a service‐level report, showing overall network bandwidth
utilization. I especially like the inclusion of a “high water mark” line, showing where
bandwidth has maxed out in the past 12 hours. Notice that there’s also a break down of
protocol traffic, so if there is a problem, you can get a good idea of what protocol is
contributing to that problem.

Figure 6.5: Network bandwidth.
If you have any service or application that relies on external services—such as an external
LDAP server—then you need to be able to monitor that. Figure 6.6 shows how a monitoring
system can do so, connecting to an external LDAP system and measuring response times.
By establishing health thresholds for these response times, you can start to create alerts
and other notifications when response times exceed your tolerances.
93

Figure 6.6: Service response time—LDAP.
Traditional server monitoring should be included as well, as illustrated in Figure 6.7, which
shows common stats for a Windows server. Again, it’s not so much that you don’t already
have monitoring tools that can do this; it’s that you want all your monitoring information in
one place, whether it’s a simple Windows server or a completely‐outsourced server or
service.
94

Figure 6.7: Windows server utilization.

95

Note
Figure 6.7 is huge—and I actually cropped off additional charts. This is one
reason I prefer Web‐based reports, because the browser can scroll as much
as it needs in order to display large, detailed reports.
Broad platform coverage is a must. Even if you don’t have Unix (or Linux) today, for
example, you might well have a server or two in the future. As Figure 6.8 shows, your
monitoring system needs to be able to accommodate that growth. I like to see reports like
this, which essentially mirrors the Windows report and includes specifics for Unix.
Windows and Unix are similar, and their performance would be monitored similarly, but a
monitoring system can’t ignore their unique aspects.

Figure 6.8: Linux server utilization.
96

As the use of virtualization grows, so must your ability to monitor it—no matter which
brand you’re using. Figure 6.9 shows guest performance statistics on an IBM virtualization
host—using terms and elements that are specific to IBM’s implementation.

Figure 6.9: IBM virtualization guest statistics.
Note
Another excellent server—notice the stable performance over time.

97

Figure 6.10, however, shows that a monitoring solution can include other brands—such as
VMware. This figure focuses on host statistics, showing key performance indicators for
CPU, network, and so forth. Again, you want cross‐platform reports to look similar so that
you can do a sort of “apples to apples” mental comparison, but you don’t want to exclude
vendor‐specific information.

Figure 6.10: VMware vCenter monitoring.
98

More and more companies are adding VoIP to their technology mix, and there’s no reason a
monitoring system can’t include that. Figure 6.11 shows a report for Cisco’s CallManager
system, providing a way to monitor and troubleshoot VoIP performance.

Figure 6.11: Cisco CallManager monitoring.

99

Every business has databases, and some of your applications will access those via Java
Database Connectivity (JDBC), so you need to be able to monitor JDBC. Figure 6.12 is my
first example of low‐level, under‐application monitoring, showing JDBC statistics. This
might not be a report you look at first when a problem arises, but the point is that a
monitoring system should provide this kind of information to help you dive deeper into a
problem and either confirm or eliminate potential systems and technologies as the source
of, or contributor to, a performance problem.

Figure 6.12: JDBC statistics.
100

You’ll want to be able to monitor the database connectivity as well as the database platform
itself. Figure 6.13 shows an example of MySQL monitoring, but your monitoring solution
should include support for all major platforms, including Microsoft SQL Server, IBM DB2,
Oracle, Sybase, and so on.

Figure 6.13: MySQL statistics.
Getting back to service levels for a moment, take a look at Figure 6.14, which shows how a
monitoring system can also provide high‐level, service‐focused information—such as email
round‐trip times. This is a good indicator of overall system health, and I especially like the
inclusion of a trend line that shows where performance is heading. That’s a great way to get
on top of a problem before it becomes a problem.
101

Figure 6.14: Email response times (SMTP).
Other services contribute to your overall IT performance, such as Active Directory (AD)—a
lynchpin for many Microsoft‐based (and third‐party) services. Figure 6.15 shows that AD
responsiveness can be monitored right within the same monitoring solution, watching
statistics like connect time, replication speed, search load, and so on. Again, notice the
inclusion of an average line, which lets you visually ignore peaks and valleys and focus on
the overall average performance of a given service.

Figure 6.15: AD response times.
102

Web servers are running more and more applications, both internal and external, and
monitoring the Web server platform is critical. Again, having this in a single solution makes
it easier to monitor the entire application stack: Web server, database platform, database
connectivity, network protocols, and so forth. You can start to see how this kind of system
gives you insight into every aspect of the application, making it easier to spot and solve
problems. Figure 6.16 looks at Microsoft’s IIS Web server.

Figure 6.16: IIS Web Server statistics.

103

Because few shops are completely homogenous these days, Figure 6.17 shows that the
same solution can also monitor the Apache Web server. Again, this report is similar to the
IIS one, as both IIS and Apache are quite similar, but the Apache report includes specifics to
that platform.

Figure 6.17: Apache Web Server statistics.
Note
See that vertical red area on each of the three charts? That’s a time period
during which my monitoring system wasn’t able to talk to the Web server
being monitored, so it couldn’t draw the chart accurately.
104

Figure 6.18 once again returns to a service level‐focused report, showing responsiveness
for a Customer Relationship Management (CRM) solution. In fact, this particular report
shows the results of synthetic transactions injected into the system to manage real‐world
performance from the end‐user perspective—the enduser experience (EUE), that I’ve
discussed in prior chapters. Here, we can see real‐world response times for end‐user tasks
such as opening the application’s home page, logging in, and searching. A problem at this
level would drive us to dive deeper—into the Web platform, database platform, network
utilization, and so on.

Figure 6.18: CRM system responsiveness.

105

Finally, you can’t ignore the physical aspect of your infrastructure, and Figure 6.19 shows
that a monitoring system can include considerations such as your server room’s
temperature—provided, of course, you have the right measurement probes in place to
gather this information.

Figure 6.19: Server room temperature.
SLA Reports
Having inundated you with examples, I’ll just provide a couple for this section. The idea
here is to roll up performance into something that can map to your SLAs, making it easier
for you to manage those SLAs. Figure 6.20 shows the first example, rolling up numerous
statistics into a simple “you made it or you didn’t” measurement for several services,
including a database server, Web server, Web response times as measured from two
locations, and so forth. There’s a trend analysis, too, indicating that the SLA is in no danger
of being breached given current performance.
106

Figure 6.20: An SLA report.
A historical look is also nice, and Figure 6.21 shows an example.

Figure 6.21: Historical SLA performance.
Here, we can easily see when the SLA was breached and how badly. This can be useful
when it comes time to negotiate pricing or performance, especially for hosted services and
applications.
107

Dashboards
I love dashboards, and I dislike monitoring solutions that don’t provide lots of ‘em. These
are a great tool for quickly checking the status of your environment at a high level, and for
starting the detail dive when something is wrong. Figure 6.22 provides an excellent
example, showing the end‐to‐end component view of an application, including individual
servers, connectivity between them, response times for specific services such as SQL or
LDAP queries, and so on. Problem systems are conveniently highlighted in orange,
directing my attention to the components that require it.

Figure 6.22: Endtoend performance dashboard.
Figure 6.23 is an EUE dashboard, showing me—in simple colors and graphs—what my
users are experiencing when they use a particular application (in this case, my Bugzilla
bug‐tracking application). I can see how fast the home page is launching, how long it takes
to log in, how long it takes to find bugs and open them, and so on. I get a quick view (on the
right) of the major platforms that comprise this application: the application code, MySQL
database, Apache Web server, and Linux operating system (OS).
108

Figure 6.23: EUE dashboard.
For highly‐distributed applications, geo‐views like the one in Figure 6.24 are tremendously
useful. This helps you see, at a large scale, where your application may be having specific
problems.

Figure 6.24: Geographic dashboard.
109

Finally, I especially like monitoring systems that can provide a customized, whole‐
environment rollup like the one shown in Figure 6.25, which was developed for a hospital.
This dashboard provides an at‐a‐glance view of everything critical to healthcare
applications in the environment. It’s literally the thing you want running all the time on
some monitor somewhere so that everyone can be assured that all the systems are okay—
or quickly take action if something isn’t.

Figure 6.25: Highlevel applications and services dashboard.
The Provider Perspective: Reports for Your Customers
Managed Service Providers (MSPs) will appreciate most of the reports and dashboards I’ve
shown so far, but they also need something specific to the kind of business they’re in. Most
MSPs will also need the ability to look at performance from a client perspective so that they
can see how a given client’s services are performing. A monitoring solution should
absolutely be able to provide that, and Figure 6.26 shows one way in which it might do so:
grouping services by customer and showing the overall utilization of each customer.
110

Figure 6.26: MSP dashboard.
Conclusion
There you have it: Advanced, modern monitoring for the hybrid IT environment—from the
data center to the cloud and even for MSPs that need to provide customers with insight into
their own networks and systems. It is possible, using the right tools and the right
techniques—and some vendors can even provide you with these monitoring capabilities as
an SaaS solution, giving you an almost instant implementation, if desired. The solutions are
out there—time to start looking.
111

Monitoring Data Center Ebook

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Monitoring Data Center Ebook

Uploaded by

Copyright:

Available Formats

The Definitive Guide To

You might also like