High Availability

Reshaping IT to Improve Service Delivery
A proposal for implementing ITIL to increase service availability, reshape IT as a business enabler, reduce costs, and decrease the time it takes to resolve problems. Nathan W. Lindstrom 3/7/2010

Cover Letter
Dear CEO, As your new CTO it is incumbent upon me to lead the Information Technology organization to accomplish two simultaneous goals: 1. Enable our customers to enjoy the best possible day-to-day experience through keeping site uptime as close to 100% as is humanly possible, and 2. Enable our customers to enjoy the latest features and services through empowering the engineering group in their pursuit of the fastest possible concept-to-implementation lifecycle. These two goals, when consistently realized, are the primary cause of this company’s continued success and exponential growth. Conversely, when the IT group falls short of one or both of these goals, the company is held back, customers slow their spending and everyone suffers. This proposal will provide the framework for the complete overhaul of the existing IT organization, which will result in our IT group joining the ranks of the world’s best high-performing IT operations. This improvement will ripple across the width and breadth of our company, radically revitalizing our ability to reach new levels of success. And most important, IT will move from being a roadblock and cost center and become the engine behind customer growth and increasing revenues. I look forward to hearing your thoughts and feedback on this vital proposition. Sincerely,

Nathan Lindstrom
Nathan W. Lindstrom

Table of Contents
Cover Letter...............................................................................................................2 Executive Summary...................................................................................................4 Something Clearly Needs Improvement.....................................................................5 Shared Characteristics of High-Performing IT Organizations......................................6 High Availability......................................................................................................6 Clear Metrics...........................................................................................................6 Early Investment.....................................................................................................6 Compliance is Easy..................................................................................................6 High Server Ratios...................................................................................................6 Step One: Stop the Bleeding......................................................................................7 Step Two: Stabilize the Patient.................................................................................10 Dress the Wound...................................................................................................10 Heal the Patient.....................................................................................................10 Step Three: Stop Reinventing the Wheel..................................................................12 Configuration Management Database...................................................................12 Centralize Configuration Management..................................................................12 Repeatable Build Process......................................................................................12 Comprehensive Monitoring...................................................................................12 Step Four: Continual Service Improvement..............................................................13 Known Errors.........................................................................................................13 Product Pipeline.....................................................................................................13 Business Enablement............................................................................................13 Conclusion................................................................................................................14 Conclusion

Executive Summary
People both within and without an Information Technology organization often feel that there are only two choices when it comes to conducting the day-to-day operations of IT: • • The “console cowboy” approach where everyone is free to make changes under the illusion of “agility” and “nimble responsiveness.” The bureaucratic change management approach where everyone is subject to draconian controls that stifle change and slow response to a snail’s pace.

These people are mistaken. While those two approaches represent the opposite extremes of IT change management policy, and both bring significant and serious problems which are destructive to the business as a whole, they are not the only choices. It is possible to be both agile and rigorous in our change management approach. We can simultaneously reduce human error and increase responsiveness. In other words, this proposal will detail how we can realize both world-class uptime and be the fastest kid on the block when it comes to releasing new features to our customers. These two seemingly conflicting goals, that of maintaining stability and promoting rapid change, can comfortably coincide within the best practices framework that this proposal sets forth. This proposal makes extensive use of the IT Information Library (ITIL) framework of best practices. With its origins in the early 1980s, the ITIL concepts and practices have withstood the test of time, and are at present the most widely-adopted approach to IT in the world. Other quality assurance methods, such as Six Sigma and TQM are used nowhere near as much as ITIL, and do not speak to many of the specific challenges seen within IT. Additionally, the basic concepts of ITIL are simple and easily taught to the average user, while implementing Six Sigma or similar often requires extensive training coupled with expensive systems and policy shifts.

Something Clearly Needs Improvement
This is the time and place where I would typically insert a graph which demonstrates our uptime and availability over the past month. It would probably look something like this:

But in our case, even creating this graph is impossible: we lack the metrics to even understand just how bad things really are! Anecdotal evidence, such as this recent email from the customer support group, points to our web site availability being truly abysmal: From: Customer Support To: Operations Subject: Ongoing outage Do you guys have any sense at all for when the site will be back up? It’s been down since last night, and this morning alone we’ve received over 100 calls from irate customers demanding to know when we’ll have this fixed. This is absolutely unacceptable! The lack of communication regarding status is troubling. Let us know when the site will be back so we can tell the people who are calling! Also, once this is finally fixed, be sure to communicate the root cause, as a lot of people are going to be demanding to know both what caused this outage and what we’ll be doing to prevent another, as we seem to be having outages practically every week! Please reply immediately so we know you read this! This email is a classic example of the kind of emails I’ve been seeing ever since I started here just a week ago. Not only does it point to a serious and systemic problem within IT, but it also says loud and clear that the rest of the business has lost all faith in the IT department’s ability to accomplish their job. Clearly something needs to be done, and needs to be done quickly.

Shared Characteristics of High-Performing IT Organizations
High Availability
This may seem like an obvious one, but it is at the heart of where the practices and outcomes of all high-performing IT organizations overlap. Put succinctly, the sites and services that are the responsibility of a world-class IT group routinely and consistently experience extremely high uptime – typically no less than 99.9% for the year. That’s just slightly over 8 hours of downtime for the entire year.

Clear Metrics
You can’t achieve success if you can’t define it. And even if you have defined it, you can’t tell when you’ve achieved it unless you’re measuring yourself day in, and day out. One very common characteristic of elite IT organizations is that they have clear metrics in place. The data is consistently gathered; the measurements make sense and are consistently applied; and the results are reported in a lucid manner to the rest of the business.

Early Investment
Great IT organizations behave in a proactive manner, as opposed to the reactive manner of failing organizations. This proactive stance is seen early in the product pipeline as the IT organization invests the time and energy early in the lifecycle of the company’s products and services. This allows them to be proactive in solving challenges that might otherwise impede progress, and to serve as an enabler from product conception all the way to end-user delivery. There are no surprises, no lastminute scrambling.

Compliance is Easy
Non-high-performing IT organizations dread the thought of needing to achieve compliance with industry or regulatory standards. This is because such compliance comes at a very high cost in terms of needing to shoehorn a broken environment into some semblance of compliance. On the other hand, world-class IT organizations always behave in a manner consist with compliance. They are able to successfully blend enabling the business with remaining in full compliance with SOX, PCI, security standards, etc. Thus, when the time comes for an audit, the high-performing organization is calm and confident.

High Server Ratios
The most common measure of the efficiency and cost-effectiveness of an IT organization is the ratio of servers to system administrators. It stands to reason

that the more “noisy” and troublesome the servers are the more sysadmins are needed to keep them functioning. A good line, which divides the chaotic, failing organizations from the successful ones is 100:1 – one hundred servers to every one sysadmin. But in a truly high-performing IT organization, that ratio can comfortably go even higher – ratios of 200, 300, or even 500 to 1 are not unusual. As salaries are usually the single highest cost of doing business, anything that allows us to do more with less brings a positive impact to the bottom line. This is clearly the state in which we want to be. This is also clearly the state we are not presently in—far from it. So how do we get from here, to where we want to be?

Step One: Stop the Bleeding
When a person staggers into the emergency room and collapses on the floor in a rapidly-spreading pool of blood, what is the first action? To localize (find) the source of the bleeding, and staunch it. Until the loss of blood is stymied, all other considerations take a back seat. Let’s take this emergency room analogy and apply it to our own situation. Historically and across every industry, what is the primary source of outages (bleeding)?

In brief, an “act of god” is something like a hurricane or tornado wiping out the data center where the servers are located, or a flood washing away a critical fiber line that connects the data center to the Internet. “Environment failure” is something like a widespread power loss or the failure of air conditioning equipment in the data center, leading to hardware failure. “Hardware failure” comes in many forms; with the most common being hard drive failures. All of those pale in significance when compared with human error. “Human error” comprises all those things that can go wrong every time there is a change made to the production environment. No matter how seemingly small or insignificant the change, every time someone touches something there is the likelihood of catastrophic failure. Donna Scott, VP of Research at the Gartner Group notes that “80 percent of unplanned downtime is caused by people and process issues, including poor change management practices, while the remainder is caused by technology failures and disasters.” If you consider that even a single hour’s outage can cost the company hundreds of thousands of dollars in lost revenue, not to mention damage to goodwill (“I don’t go there anymore, their site is always down”) and the loss of customers to our competitors (“they were down, so I placed my order elsewhere”) the cost of repeated outages is truly staggering. Once failure does happen, then the next variable to come into play is one called Mean Time to Resolution, or MTTR. This is the measure of how long it takes for the offending change to be determined and reversed. Of course, the longer it takes to fix the cause of the outage, the more revenue is lost. MTTR has a dramatic impact on company profitability, and it is therefore critical that we understand how to shorten it. There are two factors have a huge impact on the duration of MTTR:

1. How often are changes being made? 2. How well understood and recorded is each change? To answer these questions two processes must be immediately put into effect: • • Each and every IT person uses a uniquely-identifiable login to access each and every server Each and every change is recorded within a ticketing system

Within our environment, the literal implementation of the above will involve eight hours of time from two system administrators; the distribution of user logins to every server; the installation of “sudo” on every server; changing the root password on every server; and purchasing, installing, and configuring JIRA as the ticket system. At this point, the following policies will be put into effect: • • • • To access a server, you must login as yourself. To temporarily elevate execution privilege, you must use sudo. Before making a change, you must log a Request for Change (RFC) within the ticket system. The status and outcome of every change must be added to the corresponding RFC.

The exact time needed to implement the above policies, to provide any required user training in the use of new technologies such as sudo, and the lock down of the root password will take no more than one week.

Step Two: Stabilize the Patient
Once the bleeding has been staunched, the overall health of the patient becomes the next concern. So too, within our organization, the enforcement of the policies outlined in the previous section (which stopped the bleeding) are of paramount importance.

Dress the Wound
After the root password is changed site-wide by a single trusted member of the team, the password is distributed to several department heads for safe keeping. Unless exceptional circumstances call for it, the root password is not disclosed to any of the IT team members. How to use sudo, and how to open, modify, and close an RFC within the ticket system is taught to every member of the IT group: policy enforcement only begins after there can be no doubt that everyone understands the policies and the tools needed to comply with those policies. The expectations of each member of the IT organization are made crystal clear. For example, I will say something like this to the team on more than one occasion: “Team, let me be clear on this: these processes are here to enable the success of the entire team, not just individuals. Anyone making a change without following procedure undermines the success of the team, and we’ll have to deal with that. At a minimum, you’ll have to explain to the entire team why you made your cowboy change. If it keeps happening, you may quickly find that you’re no longer welcome in this team.”

Heal the Patient
Once we have moved past the triage phase with the root password being secured, everyone using uniquely-identifiable access methods, and all changes being logged as RFCs, we can work toward healing the patient by further refining the processes we have implemented. Two more events now take place, typically within the same timeframe of one week: • • Establish the Change Advisory Board (CAB) and schedule regular meetings “Electrify the fence” through establishing the audit of all changes

The Change Advisory Board, or CAB, is typically comprised of the following individuals: • The Director or VP of Information Technology

• • • • •

A network engineer (or system administrator who is knowledgeable about networking) A database administrator (or system administrator who is knowledgeable about databases) A system administrator A resource from the engineering or development team A resource from the QA or load testing team

This group meets once per week and does the following: 1. Reviews the success or failure of the RFCs implemented since the last CAB meeting 2. Reviews and then approves or rejects new RFCs 3. If an RFC is approved, its implementation is scheduled with the person who requested the change; if an RFC is rejected, the reasons for its rejection are communicated to the person who made the request. The second critical event which now takes place is the implementation of a technology such as AIDE or Tripwire which performs a nightly scan of every server for changes. The audit logs produced by this nightly scan are then examined by the CAB, and changes which took place and for which there exists no corresponding RFC are raised to my attention. As each individual is now logging in as themselves, and not using an “anonymous” access method via the root account, unauthorized changes may be traced to the responsible individual, and that person dealt with accordingly. The typical phases that a “repeat offender” moves through in terms of policy enforcement are as follows: 1. The policy is made clear, and required training provided. 2. A lack of clarity about the policy or a lack of understanding about the tools is rectified. 3. The individual is warned that a fourth unauthorized change will result in them being put on a Performance Improvement Plan (PIP). 4. The individual is put on a 30 day PIP. 5. The individual is let go. Why all these seemingly strict measures? Because all high-performing IT organizations have only a single acceptable number of unauthorized changes: zero. Without exception, unauthorized changes lead directly to outages and lost revenue.

Step Three: Stop Reinventing the Wheel
Once the patient is back on his feet, it is time to prevent a recurrence of the injury that brought him to the hospital in the first place. Now that we have basic change management in place, it is time to turn our attention to creating an environment that is both predictable and repeatable. This will hold chaos at bay and free the IT organization to devote its energies on enabling the success of the business.

Configuration Management Database
A Configuration Management Database (CMDB) must be created which contains the vital statistics of every single piece of hardware within the production environment. It is the “source of truth” where every other piece of technology is concerned; other components, like those that follow, will look to the CMDB for information. As such, it is vital that the CMDB be as accurate and up-to-date as possible.

Centralize Configuration Management
A technology such as Puppet or cfengine should be deployed, and every configuration file that is changed subsequent to the installation of the operating system must be placed under its control. This will permit site-wide changes to be made in a controlled, auditable, and repeatable fashion. This is also the single greatest contributing factor to increasing the ratio of servers to system administrators.

Repeatable Build Process
A technology such as Kickstart should be deployed, and over time every single server should be rebuilt from the “bare metal” to use the standard operating system image deployed by Kickstart. This will permit individual servers within a functional group (for example, web servers, application servers, or database servers) to be identical and homogenous. This makes troubleshooting and repair significantly faster, sharply decreases MTTR for problems seen across more than one server, and is the second greatest contributing factor to having a high ratio of servers to sysadmins.

Comprehensive Monitoring
A technology such as OpenNMS or Nagios should be deployed, and every single server, as well as each service provided by each server, should be monitored for availability and responsiveness. It is only with this final step that metrics can be accurately captured and reported, and Key Performance Indicators (KPIs) created and tracked. These in turn can be reported to the rest of the business, giving all the stakeholders the visibility into IT operations that they need to understand and trust IT as their business partners.

Step Four: Continual Service Improvement
Once unauthorized changes have remained at zero for some time, KPIs are in place and being effectively tracked, and the overall IT environment is both stable and easily scaled, the IT group can focus on several things: • • • The myriad little “problems” that have a minor but irritating impact on users, and which previously were overshadowed by the huge stability challenges Understanding the product development pipeline and “getting out in front” of upcoming changes Working with the rest of the business to identify ways that IT can make everyone’s job easier and more efficient

Known Errors
In any organization, regardless of age or size, there exists dozens to hundreds of little annoyances that plague individual users and IT team members alike, but that have been ignored or deferred because of the far larger problems that were consuming IT’s time. As these minor issues surface they should be captured within the ticket system as a special kind of entry, called a Known Error (KE). The collection of known errors within the ticket system is called the Known Error Database (KEdb). Stated simply, a KE is a problem that either impacts so few people that it is considered minor, or where the amount of effort required to fix it far outweighs the benefit of it being fixed. However, as the IT organization has more time and energy to devote to fine-tuning the environment, the KEdb becomes a source of vital information about things that should be fixed in the near future, and which may be underlying causes of larger problems.

Product Pipeline
To remain in a proactive state, and avoid returning to the trap of reactivity, it is critical that the IT organization embed themselves within every level of product design, development, testing, and implementation. By the time a product reaches production, IT should possess a thorough understanding of it, have implemented whatever hardware and software is needed to support it, and have worked closely

with product engineering to have the documentation needed to effectively support the product. The words “we didn’t know that” should never enter the vocabulary of the IT organization.

Business Enablement
The true role of IT within the business is to enable the business to be successful. But beyond the obvious means of accomplishing this (like maintaining excellent uptime) are more subtle but equally important services such as: • • • Keeping users informed of what’s going on so that there are never any surprises Educating users (particularly developers) as to the production environment, so that they understand the technology on which their product will run Constantly checking for performance bottlenecks or possible scaling issues, and solving them proactively before business is impaired

Conclusion
Getting from where we are now (bleeding out on the floor) to where we should be (enjoying exponential success as a world-class organization) is actually quite simple. It just takes guts and a lot of hard work. But the path is clear, the rewards huge, and the costs of getting there quite minimal. 100% uptime. That is our goal, and with your support, we will realize that goal within 30 days. This means that 30 days from now, the business can officially enter the “hockey stick” of exponential growth. At that point, the sky will be the only limit.

Sign up to vote on this title
UsefulNot useful