Can your organisation deal with a sudden and immediate loss of its IT systems?

And can you avoid it?
A guide to removing user downtime as part of a core business Initiative Includes a practical check-list for assessing the effectiveness of your proposed High Availability, Business Continuity or Disaster Recovery solution

By Neil Robertson

© Copyright of the Neverfail Group 2005

3

FOREWORD

4

It is no secret that technology moves forward at seemingly ever increasing pace. As a result, it becomes progressively more difficult for senior management to keep-up-to date with what is technically possible and therefore what has become commercially desirable. It is a fact that almost every organisation is continually increasing its dependency on IT to do business. Critical applications are woven into the very fabric of day-to-day business operations, yet senior management of the vast majority of organisations still fail to even review their vulnerability, exposure and liability to IT downtime. The options available to organisations for IT Disaster Recovery (DR), High Availability (HA) and Business Continuity (BC) planning have fundamentally changed in what is achievable, whilst the complexity and cost of deployment / on-going management have reduced significantly. The IT industry has been built on the delivery of solutions that are fast, easier to use and cheaper than their predecessors. This process delivers better value to all businesses, whilst making the solution available to an ever increasing market. For the larger organisations this means better protection at lower cost. For the smaller organisation, it means that vastly superior protection is not only available to them for the first time, but is financially justifiable and, perhaps more importantly, commercially desirable. This book will help corporate executives and senior IT staff to reevaluate their core strategy in the use, protection and availability of the critical IT systems that underpin the moment-by-moment running of their business.

It will also help IT staff to cut through the marketing spin and address the critical issues during the purchasing process of systems and software that claim to eliminate user-downtime. The emphasis has moved from protecting data to protecting the productivity of the user; whether that user is a member of your staff, or one of your customers, suppliers or prospects, whether they are internal, on-line or accessing applications and information on your intranet, extranet or website. Every business recognises its increasing reliance on IT for the day-today operation of its business. Now, management has a clear choice on whether they are prepared to recognise and quantify their corporate exposure to downtime, and based on those findings, tolerate downtime or not. The significance of that decision is relevant to your shareholders, stakeholders, employees, partners, customers and prospects, who look to you to protect and advance their respective interests.

Neil Robertson
April 2005

To contents page

© Copyright of the Neverfail Group 2005

Connected! A management guide to removing user downtime as part of a core business initiative.

5

CONTENTS CONTENTS

Page

6

Foreword Contents A Glossary of Terms Why do we tolerate Downtime? Quantification of Downtime Risk High Availability or Disaster Recovery? Does Downtime really Matter? Who is Responsible and therefore who is to Blame? Next Steps Critical Server Identification: Data Protection High Availability and Disaster Recovery: Selection criteria Architecture: Basic Structure of the HA / DR solution Reliability & Monitoring Application Software Protection / Auxiliary Software Protection Switch / Failover Criteria and Performance Bandwidth Considerations Summary The Critical Components required to removing Downtime About the Author

3-4 5-6 7-8 9-10 11-12 13-15 16-20 21-22 23-24 25-26 27-28 29-30 31-34 35-38 39-41 42-44 45-47 48 49-50 51-52

Go to page

Go to page

Go to page

Go to page

Go to page

Go to page

Go to page

Go to page

Go to page

Go to page

Go to page

Go to page

Go to page

Go to page

Go to page

Go to page

Go to page
Go to page
Go to page
Go to page

www.neverfailgroup.com

Contact Neverfail

To contents page

© Copyright of the Neverfail Group 2005

Connected! A management guide to removing user downtime as part of a core business initiative.

7

A GLOSSARY OF TERMS

...A GLOSSARY OF TERMS

8

Users:
The company’s most valuable assets and probably greatest overhead, your staff. Users also include prospects, customers and suppliers that access IT information and services via email, extranet and your website.

Critical Server Downtime:
The failure of the business and the users to undertake and execute critical actions required in the moment-by-moment operation of the business.

Repeated Critical Server Downtime:
Career threatening, sleep inhibiting, worst nightmare scenario, because whenever it happens, for some users, customers, suppliers and stakeholders, it will be the worst possible and most damaging moment.

Downtime:
The disconnection of users from the software & data they require in order to work effectively (for whatever reason).

User Downtime:
The real cost of server downtime. User lost productivity, broken commitments, reduced performance and missed expectations.

High Availability (HA) Strategy:
Ensuring that the users either remain connected, or can be reconnected to a working application in the shortest time possible.

Planned Downtime:
Deliberately disconnecting the users from software and data they require in order to work effectively.

Disaster Recovery (DR) Strategy:
Ensuring that the users either remain connected, or can be reconnected to a working application in the event of a disaster such as fire, flood, hurricane, terrorism (etc) in the shortest time possible.

Unplanned Downtime:
Randomly disconnecting the users from software and data they required in order to work effectively.

Disaster:
Re-definition: The disconnection of users from a working critical application / data for an extended period of time, for WHATEVER reason.

Critical Server:
A server that enables user access to software and data that is fundamental to the user in their moment-by-moment operation and in the moment-by-moment operation of the business.

To contents page

© Copyright of the Neverfail Group 2005

Connected! A management guide to removing user downtime as part of a core business initiative.

9

WHY DO WE TOLERATE DOWNTIME?

...WHY DO WE TOLERATE DOWNTIME?

10

Two dynamics have changed.
The first is that business dependency on critical applications is increasing and will continue to do so. Therefore the cost and risk associated with downtime is also increasing. The second is the significant reduction in the complexity and cost of IT solutions that remove the threat of downtime. The objective of this book is to help senior management to re-define their priorities and objectives in maximising the benefits of technology within their businesses, whilst removing the threat and cost of userdowntime for the most critical IT services.

The world has changed.
Any downtime is now a matter of choice - provided that management have the knowledge and information to make that choice and the will to execute on that decision.

WHY DO WE TOLERATE DOWNTIME?

Within your business, if you could remove the risk of any downtime at zero cost, with zero complexity, zero additional resource, and with zero overhead, wouldn’t you do it for every server in your business?

Surely, logically, it would be stupid not to!
The ONLY commercially justifiable reason for not protecting every server against downtime is because historically it has been seen as too expensive and too complex when compared to the probable risk and perceived cost of downtime.

To contents page

© Copyright of the Neverfail Group 2005

Connected! A management guide to removing user downtime as part of a core business initiative.

11

QUANTIFICATION OF DOWNTIME RISK

...QUANTIFICATION OF DOWNTIME RISK

12

The downtime is the period during which the users are unable to access the software and / or data they require in order to work effectively. Therefore we can commercially define downtime as:
THE PERIOD OF TIME REQUIRED TO RECONNECT THE USERS TO A WORKING APPLICATION OR DATA SOURCE

“The period of time required to re-connect the users to a working application / data.”
This definition may seem obvious, but a detailed market survey showed that almost all organisations focus on protecting data as the cornerstone of a high availability / disaster recovery solution. User re-connection is a secondary objective and therefore secondary purchasing requirement. This focus on data is a reflection of the heritage of HA / DR solutions, as that is what the vast majority of them do. Yet the real risk of significant damage to a business from downtime is ONLY the impact on the user and their ability to function effectively or at all without access to the critical application and the most up-to-date data. Eliminating user-downtime delivers the optimum solution for both High Availability and for Disaster Recovery strategies. The commercial objective is to do so at a cost and a level of simplicity that makes it highly relevant and financially justifiable to all. The delivery of such a solution starts at the design stage, not as an after-thought to a ‘heritage’ data protection solution.

Is downtime something that happens to a server, or something that happens to the user?
The answer is both, but which is more important? The physical problem is a technical issue that requires fixing. This is relatively low cost and low risk. A small number of people can eventually get a system operational again in a matter of days with varying degrees of data loss. In terms of cost of resource and equipment, this is inexpensive. The commercial damage is the impact on the users and the consequences of that. The disconnection of users from critical applications and data can seriously damage a business. This is high cost and high risk.

That is how ‘next generation’ products are born.

To contents page

© Copyright of the Neverfail Group 2005

Connected! A management guide to removing user downtime as part of a core business initiative.

13

HIGH AVAILABILITY OR DISASTER RECOVERY?

...HIGH AVAILABILITY OR DISASTER RECOVERY?...

14

The definitions of High Availability (HA) and Disaster Recovery (DR) have evolved over time to mean different things.
The most common understanding is that HA addresses the need to improve the reliability and availability of critical IT systems in the normal day-to-day working environment. Whereas, DR is the planning and execution of a process that enables the restoration of critical IT systems after a catastrophic event, such as the total lost of a premises through fire, flood, hurricane, tornado, earthquake, terrorism (etc). It is obvious, both statistically and logically, that a total failure of a critical server for whatever reason is far more likely to happen in the normal day-to-day working environment than the occurrence of a disaster that destroys the office. So it would be reasonable to assume that providing protection to critical servers would focus first on the higher, more common risk of HA. Yet, in practice, the opposite is true. Most organisations have formal DR plans for their IT systems (even if it is just the off-site copy of a daily backup) whilst few have any HA plans. This usually reflects the focus of the senior management and the availability of budget. The purpose of both the HA and the DR plan is to get the users working as fast as possible after a failure, with the minimal loss of data. Yet more often than not, the method of achieving this is to focus all the attention and budget on an electronic copy of data, leaving the complex process of building the environment to take that data until

after the disaster has occurred. The difference between HA and DR is the location of the recovery server. In an HA environment, it is in the same location. In a DR environment, it is in a second, separate location. If your business has two locations, the difference in cost should only be the cost of bandwidth. If you are looking for HA or DR, make sure you are buying both. If you want to quantify the level of HA / DR you are purchasing, measure it by the time the user is disconnected from the critical application and data. A good example of an HA / DR requirement would be in the use of email. This is now accepted as fundamental to business in much the same way as a phone system is. There is no argument that email downtime is damaging, however the impact of downtime after a disaster can literally kill a business. If your primary place of business was destroyed, how long do you want to wait until you restore the communication between the key constituents of your business?

• • • • • •

Staff Customers and Prospects Suppliers Shareholders, Stakeholders Market & Press Insurance organisation and builders

The full recovery process might take months to complete, but the communication needs to be uninterrupted. You need all of your

To contents page

© Copyright of the Neverfail Group 2005

Connected! A management guide to removing user downtime as part of a core business initiative.

15

...HIGH AVAILABILITY OR DISASTER RECOVERY?

DOES DOWNTIME REALLY MATTER?

16

people on-line and communicating comfort, clarity and contingency actions, instantly. For most organisations, the process commences with an IT person holding a digital tape of yesterday’s data, a box of software CDs of operating systems, application software, anti-virus software, some notes on the original configuration (if you’re lucky) and a blank hardware server. Usually, that means you are at least 36 – 48 hours away from re-connecting email to anybody. That is enough time for all parties to form an opinion on whether or not to risk continue trading with you, based on what little they know and what they have heard and read. If you are not communicating with them, the information is probably coming from your competitors. One disaster precipitates another, both of which have a disastrous impact upon your business. Yet the loss of the same service for an extended period of time in a normal trading environment is also a disaster and can deliver a similar disastrous impact. The ideal scenario is to protect the critical applications and servers to ensure that the user remains connected to a working application / data, whatever happens, perhaps locally in HA mode, perhaps in a second location offering HA and DR. Despite the obvious sense this makes, the message is not reaching the organisation’s decision makers and budget holders. They are living in the past, because the information they have is simply out of date.

The IT industry has grown up with an acceptance of the ‘break – fix mentality’. This is essentially based upon the premise that ‘It is going to break. Then we are going to fix it.’
In the 1980 – 1990s, few IT systems were truly critical and downtime was an acceptable fact of life. That statement is no longer true for a growing number of software solutions in almost every company. The impact of downtime will change based upon a number of criteria: 1. The number of users of the software application / data. 2. The level of dependency of those users on that application to perform their jobs. 3. The importance of the activity of the user to the moment-bymoment running of the business. 4. The ability for the user to “catch up” on lost time through out of hours work or whether the downtime cannot be re-captured (i.e. loss of a telesales function for a period). 5. The importance and value of the role of the users affected. 6. The implications of those users not working effectively - as a result a. The direct cost of the users lost productivity. b. The risk / impact elsewhere, both internally and externally of that lost productivity. For example, for most organisations, email represents over 70% of all their communication internally and externally. If it stops working for a couple of minutes, most organisations can live with that.

To contents page

© Copyright of the Neverfail Group 2005

Connected! A management guide to removing user downtime as part of a core business initiative.

17

...DOES DOWNTIME REALLY MATTER?...

...DOES DOWNTIME REALLY MATTER?...

18

However, if email is lost for an entire day the impact to the business is going to be significant. Email is used in every aspect of the day-today operation of the business, internally and externally. Marketing, sales, management, services, finance, partners, customers and suppliers will all be affected. It becomes obvious that not all IT systems are equal as some are far more important than others. Whilst downtime on some applications is an inconvenience, for others it is a disaster in terms of productivity and potential real commercial damage.

Let’s do some mathematics:
Simple Justification This cost justification is deliberately conservative in all aspects of the maths. Assume that Microsoft Exchange email is being used for 30% of the working day. Assume that there are 200 such users in the company.

One size does not fit all.
Therefore, should one HA / DR strategy fit every server? Probably not! In order to determine the commercial argument on the financial justification and commercial need, we need to think through the justification rational. Assume the average annual ‘fully loaded’ cost of an individual is £75,000 (for most organisations this is extremely low). A simple process to get the average cost per full time employee is to divide the total overhead of the business for last year by the average number of employees last year. Assume each individual works 40 hours a week for 50 weeks, which equates to 2000 hours per annum. Then each hour costs the company £37.50. A four hour system outage has a quantifiable cost of: 200 users x 30% usage x 4 hours = 60 x 4 = 240 lost hours 240 x £37.50 = £9,000 for one 4 hour outage But this is not the real cost, as it is easy to argue that the user could find something else to do (although a quick survey of your users will deliver a very different message).
ONE SIZE DOES NOT FIT ALL

Assume that the role of these users spans the organisation covering Senior Management, Marketing, Sales, Finance, Production and Services.

To contents page

© Copyright of the Neverfail Group 2005

Connected! A management guide to removing user downtime as part of a core business initiative.

19

...DOES DOWNTIME REALLY MATTER?...

...DOES DOWNTIME REALLY MATTER?

20

The real cost is the consequence of 240 lost hours of optimum productivity in one four hour period during which 80% of all communication has ceased. During that time, all those users are unable to carry out the tasks and activities they are aware of, or to address the demands and requests that they did not receive during this period. The commercial consequences of this lost time are almost impossible to accurately calculate, as they are dependent on what didn’t take place for every individual. For example, if customer service calls are captured within your extranet, but transported by email, no extranet service calls will be received for 4 hours. This may well exceed your service level agreements. An existing customer may decide to cancel their relationship due to this experience, but action it on the renewal date six months later. A critical quote may get delayed, or a “red hot” enquiry has a slow response. Given the diversity of the moment-by-moment use of email within almost all organisations, the list of potential disasters caused by downtime is very long.

There are three questions that arise when reviewing your strategy for downtime:
1. What is a reasonable direct cost that can be placed against userdowntime of a critical IT server? 2. What is a reasonable estimate of the indirect cost / RISK of lost revenue, damaged reputation, legal liability (etc) of user-downtime for 2 / 4 / 8 / 24 / 48+hours? 3. Who is responsible in your business for undertaking that calculation and making the recommendations of appropriate action and priority?

If you can’t name the individual, or identify the documentation that this individual has produced and regularly maintains, the answer is probably no-one. Without wishing to “scaremonger”, having zero knowledge of your business’ exposure to critical IT infrastructure failure could be seen as anything from poor management through to negligence.

That is a real risk.

To contents page

© Copyright of the Neverfail Group 2005

Connected! A management guide to removing user downtime as part of a core business initiative.

21

WHO IS TO BLAME?

...WHO IS TO BLAME?

22

What becomes immediately apparent is that the real cost to the business is potentially going to be many tens of thousands of pounds. If the sales and customer services department could cross charge another department for the lost productivity, and make up the revenue shortfall through inter-departmental billing, which department would have to pay? Would that bill be based on the tangible damage of lost man-time, or would it include the real cost of lost customers who don’t buy; lost customers who don’t renew their service contracts at some later stage, or intangibles such as the damaged reputation of the business? If there is not an individual who has the responsibility for reporting on the potential risk of downtime, for making a formal effort to quantify the likely costs and looking for cost effective solutions to minimise that risk, then I am sorry to say that the ‘Three monkey rule’ applies:

IF A COMPANY’S BOARD OF DIRECTORS CAN’T HEAR IT, SEE IT AND REFUSE TO DISCUSS IT, THEN THE THREAT DOES NOT EXIST

The ‘break-fix mentality’ has compounded the ownership problem, as it has allowed us to ignore the issue. The mentality of “it breaks and we fix it” may be acceptable for lowlevel non-critical servers, but it is no longer commercially sensible for critical applications. It is remarkable that whilst we have all adopted technology into the very fabric of our businesses, most organisations have no reporting process, knowledge or understanding of the regularity or implications of downtime. A major outage of a CRM solution may take the entire telesales activity offline for a day, stop web-based enquires from reaching the sales department, remove the ability to take new and manage existing support calls, stop the sales staff undertaking planned negotiation activities and sending out quotes. But who is counting that cost?

If a company board of directors can’t hear it, see it and refuse to discuss it, then the threat does not exist.
The alternative scenario is that the threat has never been brought to the board’s attention. It should come as no surprise that the alternative scenario is usually the way the board will act after a disaster. If there is an individual with the responsibility for ensuring useruptime, you may want to give them a copy of this book. If not, then I recommend a copy is given to each member of the board and investors. Better still, provide a report on the risk and support it with the book! (They may even thank you at some later date).

To contents page

© Copyright of the Neverfail Group 2005

Connected! A management guide to removing user downtime as part of a core business initiative.

23

NEXT STEPS:

...NEXT STEPS:

24

Let us assume that you fully recognise that userdowntime is bad for business and warrants re-investigation. The key criterion to consider is that the risk / consequential cost of user-downtime is greater than the cost of protecting against it, making protection a commercial necessity. This is driven by the fact that the dependency and risk of downtime to your business has grown, whilst the cost and complexity of removing downtime has reduced.

The first issue to recognise is that the criticality is always driven by the nature of the application, not the data. The most common and obvious applications include: • Email • Document Management Systems • Customer Relationship Management • Sales Force Automation • Support & Help Desk • Critical business applications • Intranet / Extranet / Web applications It is also important to recognise that the level of criticality increases with the period of downtime. For example, losing a non-critical application for a day is a nuisance, but not commercially dangerous. In some cases even losing the data from today and resorting to last night’s backup tape is acceptable. However, with critical applications the level of risk and probable commercial damage increases exponentially the longer they are unavailable. Being one hour late may be redeemable whereas being 48 hours late is probably not. The recovery time from the failure of a critical server is dependent on what went wrong, when it happened and the steps required to delivering a fully operational solution re-connected to the users. It is almost impossible to determine the likely period of downtime for any given server, so rather than try to determine every possible eventuality and what would be required to recover, review the probable commercial risk and damage associated with a length of downtime. Keep in mind that a 60 minute outage is often as likely as a 12 hour outage.

NEXT STEPS...

The level of urgency of deployment should be driven by the recognition and quantification of the size of the risk / commercial exposure. How are you going to identify and prioritise the criticality of your servers?

Critical Server Status:
A survey of the business activities of an organisation will usually quickly reveal the business IT applications that are critical to the smooth operation of the business.

To contents page

© Copyright of the Neverfail Group 2005

Connected! A management guide to removing user downtime as part of a core business initiative.

25

CRITICAL SERVER IDENTIFICATION:

...CRITICAL SERVER IDENTIFICATION:

26

Produce a list of servers and the applications they run. Consider the implications of downtime of each of these AFTER a discussion with a sample of users and their manager. Good decisions are made on good information, so it is worth getting feedback. Over 75% of email users concluded that loss of email was more stressful than divorce in a recent survey. That kind of feedback provides some insight into users’ dependency on applications, yet most email servers remain unprotected.

Put an estimate of loss against the lost productivity of those unproductive hours based on the purpose and use of the software application and user feedback. This is very simplistic, but you will notice that the real cost of downtime is probably much higher than you anticipated and critical applications are clearly identified. Rather than look at and address every single server at once, focus in on the most critical and address the highest area of risk.

Suggestions for calculating the likely costs of downtime
Detail the number of users for each application and the number of hours that they would use it in a normal day. Add a rough valuation of the cost per hour for that group (£37.50 is a commonly used generic total employee cost). Downtime is a major variable in terms of risk and likely cost, so calculate a number of them: 15 / 30 / 60 minutes 1 / 4 / 8 / 16 / 24 / 48 hours Calculate the total lost hours of productivity and the direct costs for each downtime period. The real cost is in the lost productivity and time sensitivity of the user and that depends on the use of the application. For example, a telesales operation may go offline for a day. That is lost time that cannot be “caught up” and would represent a loss of 0.4% of revenues per annum from this source alone.

To contents page

© Copyright of the Neverfail Group 2005

Connected! A management guide to removing user downtime as part of a core business initiative.

27

DATA PROTECTION

...DATA PROTECTION

28

Every server already has some form of protection, even if it is just a regular tape backup. The purpose of this is to provide a copy of the data that can be used in the event of a failure that permanently removes access to that data. Whilst all data updates that took place after that backup was completed have been lost, at least the company can revert to “yesterday’s” copy. Data is obviously important, but it is the BACKUP TAPES ARE PRONE TO FAILURE first and most basic form WHEN THEY ARE MOST NEEDED of availability. Tape backup offers the slowest form recovery available and therefore resides at the bottom of the high availability and disaster recovery food chain. If the object is to get a user connected to a working application and its data as quickly as possible, then tape backup is as far away as you can get, (even assuming that the backup tape actually works, as that is no guarantee based on tape failure statistics). If there is an area of misrepresentation by the high availability and disaster recovery industry providers, then it would be in a clear quantification of what is really required in time, effort and risk to achieve full recovery from a failure. Full recovery is defined as the users being connected to the working application on a complete set of data.

If the only protection of a critical server data is a tape backup, then the next question is whether there is any means of capturing and re-entering data that has taken place since that entry. For some applications, the information can be sourced and re-entered, however for many more, the data is forever lost. Backup tapes are prone to failure when they are most needed. The time taken to restore the data onto a server can run to 12 or more hours. Everything that was done in between the tape backup and the failure will, at best, need to be done again; at worst be permanently lost. If you add the probable risk / cost of the loss of a day’s data to the cost of the downtime itself, then the true risk / liability of downtime is becoming significant. However, to get data to work, you need a server, operating system, application and connectivity to a network as a minimum. The “real-time” data replication software market has grown rapidly to address the loss of data since last backup. It is proven, inexpensive and well understood. Data replication is an ideal solution for less critical servers where user-downtime for a few hours or even a day is acceptable. But the goal for a critical server is that the user remains working whatever and whenever failure occurs, using the very latest and upto-date dataset, without any action being required by anyone. This level of protection is a natural upgrade to data replication products.

The protection of data is a given, but it is the nonproductive user that represents the risk and cost.

To contents page

© Copyright of the Neverfail Group 2005

Connected! A management guide to removing user downtime as part of a core business initiative.

29

HIGH AVAILABILITY AND DISASTER RECOVERY: SELECTION CRITERIA
The objective is to keep the users seamlessly connected to a working application / data irrespective of the nature of failure, with very low costs, minimal disruption and minimal risk. To re-cap: historically, HA and DR have been seen as two different components of protection. HA addresses the day-to-day issues (and the likelihood that critical systems will fail) whilst DR is a costly commercial necessity to protect against a much less likely threat of disaster, usually associated with a ‘loss of site’. What has changed is that the level of criticality of a growing HIGH AVAILABILITY number of IT systems means that any downtime is a disaster and therefore a disaster can occur without the loss of site. The result is that a comprehensive HA & DR solution should be one and the same thing. The only question is how it is deployed. So what are the critical components of a full HA / DR solution that ensures that the user remains connected to a working application, irrespective of the nature of failure of hardware, operating system, application software or network?

...HIGH AVAILABILITY AND DISASTER RECOVERY: SELECTION CRITERIA

30

The objective of the next section is to help IT management identify and review the critical components of integrated HA / DR solution providers offerings. The areas covered include: Architecture: Basic structure of the combined HA/DR solution Reliability & monitoring Data protection Specific Application software protection Switch and Fail-over – Switch and Fail-back LAN and WAN Bandwidth considerations

The HA (High Availability) component is required to address userdowntime locally whilst the DR (Disaster Recovery) component is off-site protection required to address a site failure.

To contents page

© Copyright of the Neverfail Group 2005

Connected! A management guide to removing user downtime as part of a core business initiative.

31

ARCHITECTURE: BASIC STRUCTURE OF THE HA/DR SOLUTION
A comprehensive HA and DR solution should enable both local, and if required, offsite protection against downtime from a single solution. That solution can be implemented locally and stretched to cover DR if and when required, without change or cost. The critical objective is that the user remains seamlessly connected IN AN IDEAL WORLD, IF ONE THING BREAKS, THEN YOU HAVE ANOTHER to the working JUST LIKE IT AVAILABLE application irrespective of the nature of failure of hardware, operating system, application or network, without any intervention by the user or the IT department. In an ideal world, if one thing breaks, then you have another, just like it, available to take over immediately. This type of solution is available but historically required the two systems to be identical. That approach is OK, until you are in an environment where everything is constantly changing, 1000s of times / second. The common approach to this has been to protect a single aspect, usually the data, through replication of data onto a tape, or replicating real-time to another hard disk. Replication involves placing another (sometimes identical, sometimes similar) server alongside the “primary” server and copying data real-time from one server to the other.

...ARCHITECTURE: BASIC STRUCTURE OF THE HA / DR SOLUTION...

32

For convenience, we will call these the primary server and secondary server. The advantage of this approach is that the amount of data lost in a failure is significantly reduced to potentially zero, as there is almost immediate access to aspects of up-to-date data at the time of failure. The limitation is that all other changes to the environment are not being captured, let alone updated. Changes to the operating system, the database management software, the application, the anti-virus protection, even the network will all often need to be addressed prior to being able to ensure the secondary server will function correctly. That of course assumes that somebody knows exactly what has changed and what relevance it has to the replicated data before restarting the secondary. This is very rarely the case, as detailed change control on IT servers is expensive in time and effort to maintain. Data replication is a significant improvement to tape backup, as it removes the data loss from last backup to the point of failure. But more often than not, it is still a long way from a working user, as so much more is required to ensure that the full system is restored and the user can commence work again.

True Pair Architecture
In an ideal world, the primary and secondary servers would remain completely in sync, every update that occured on one, happening on the other.

To contents page

© Copyright of the Neverfail Group 2005

Connected! A management guide to removing user downtime as part of a core business initiative.

33

...ARCHITECTURE: BASIC STRUCTURE OF THE HA / DR SOLUTION...

...ARCHITECTURE: BASIC STRUCTURE OF THE HA / DR SOLUTION

34

This basic architecture is a form of ‘cluster’, where in the initial state, the primary is active, connected to the network and servicing the users, whilst the secondary is passive, invisible to the network but operational. To maximise the value of this architecture, you would want to address a number of requirements: It is important to remove the historic characteristic that delivery of a true pair requires 100% identical hardware. Identical hardware is expensive to provide at the outset and difficult / expensive to maintain going forward.

As a true pair, the solution can be deployed in a LAN, extended LAN or WAN environment.

The advantage being the immediately delivery of an HA solution when implemented locally, but supporting a DR solution when the pair are split across two locations, delivering both an HA and a DR solution.
This approach provides a technology platform that addresses the basic requirements of ensuring there is an operational server and data available to the user, whatever happens. However, the most significant characteristic of this architecture is the speed of switch over and failover and switch back.

The advantage of the ability to use dissimilar hardware is the ability to use current equipment for both the primary and, if available, for the secondary. It also removes complexities in upgrades and maintenance on either server in the future.
As a true pair, it becomes possible for the passive server to undertake comprehensive monitoring of the entire primary server environment.

The architecture enables a completely automated switch or failover in between 1 – 4 minutes.
As important and often overlooked, a switch or failover requires no action from the users; they are automatically reconnected to the secondary as it becomes active.

The advantage being the ability to automatically undertake intelligent pre-emptive actions to address problems as they arise, ensure the smooth operation of the primary and only to switch or failover if absolutely required.
As a true pair, we want the ability to switch back and forth, with a single key depression, between the servers at will, with minimal to zero user disruption and guaranteed zero data loss.

Automated user re-connection is fundamental. How would you notify your email users to log out and log back in again when email is down? It’s going to be a busy hour or two on the phone!
The architecture delivers the foundation; the strategy is about useruptime.

The advantage being the ability to undertake maintenance work on the primary at will, to test upgrades to the environment without risk and once tested, upgrade the secondary.

To contents page

© Copyright of the Neverfail Group 2005

Connected! A management guide to removing user downtime as part of a core business initiative.

35

RELIABILITY & MONITORING

...RELIABILITY & MONITORING...

36

Prevention is better than cure. The majority of failures relate to reliability issues, many of which might have been addressed if they had been discovered in time. Delivering a credible HA / DR solution has to address downtime prevention, or it is no better than offering a safety net to a highwire act. You still fall immediately and still have to take the time to recover back to where you were prior to the crash.

moment-by-moment basis. Operating systems are now updated automatically via the web with service packs and “hot fixes”, as are many applications such as anti-virus, spam filters, security software. Users join and leave the environment / company and there are many examples of staff that have left the company regaining access to critical systems after a restore from backup and the subsequent action of their removal never taking place again. There is a real and significant overhead in the day-to-day review and maintenance of servers to make sure they remain healthy and can meet the changing demands. But very few IT departments have the resources required to maintain full change control on critical servers so that there is a documented recovery process to ensure they are able to rebuild the primary, undertake all the housekeeping activities and that the protected data will function on a restored system.
DELIVERING A CREDIBLE HA/DR SOLUTION HAS TO ADDRESS DOWNTIME PREVENTION OR IT IS NO BETTER THAN OFFERING A SAFETY NET TO A HIGH WIRE ACT

The consequence is that a relatively minor failure can lead to many hours and sometimes days to fix as the IT staff try to sort out the mess. So it obviously makes a great deal of sense to automate this entire function of change control and monitoring, providing a very valuable tool to fully utilised IT staff and significantly reducing the likelihood of downtime.

In its most basic form, reliability has to address the current and ongoing health of the entire protected server environment. Ideally, reliability and monitoring will address both the primary and secondary server and be pro-active in ensuring on-going health and performance as the first step in providing HA / DR. A critical failure of the hardware, operating system, application, supporting application (i.e. anti-virus software) and network failures will take all the users off-line. Yet each of these components is changing all the time. Hardware utilisation is changing with memory and disk usage evolving on a

There are two other reasons why this approach is useful.
The first is that experience has shown that only a few organisations can easily provide an accurate, comprehensive and complete report on the profile of the server they wish to protect. The provision of a simple utility to perform this function automatically removes the costs and resource requirement of a manual investigation. It ensures

To contents page

© Copyright of the Neverfail Group 2005

Connected! A management guide to removing user downtime as part of a core business initiative.

37

...RELIABILITY & MONITORING...

...RELIABILITY & MONITORING

38

that the implementation process can be planned and executed successfully, within timescale and budget as there can be no surprises. The second reason is that, more often than not a primary server will benefit from a detailed review. In the majority of cases the server’s health can be significantly improved with a small amount of work once the issues have been clearly identified. We now have a server environment in prime health and we want to keep it there. But if that becomes impossible, the solution needs to know about potential downtime threats and take the agreed steps to effectively address it. This requires pro-active monitoring functionality. In the past, monitoring solutions have often been passive, delivering alerts but taking no action. A classic example in a market leading data replication product is that nothing happens at all when the critical application fails. Because the software only checks to see if the primary server is still turned on, automated failover does not happen, as it has no way of knowing the status of the application. That leads to the 3am phone call where the users are unable to gain access to software (from wherever they may be), yet the high availability solution has failed to even notice, let alone take action, (even if that action is just to let you know!). When looking at this area, the focus of investigation should be built around the ability of the IT management to determine the initial status and health of the critical server. It should also review what is being checked in the live environment, what manual and automated actions to take initially on the identification of a problem and what steps to take should they re-occur a second and even a third time. Would you be confident that, should a disaster occur in the middle of

the night, that the system will take all the necessary steps to try and fix the issue automatically and then, only if necessary, initiate a switch over without disrupting the users or loss of data? Or perhaps your preferred action is an email / text / paging / phone message and allow human intervention. It is about choice. At what point and by what means should the solution notify the IT department when a problem has been encountered? There is no single answer, it is a matter of choice and that choice should be dependent on the nature of the problem. Complete flexibility. Reliability and Monitoring functionality is not a “nice to have”. They are both essential to maximise the effectiveness of your current IT resources, whilst keeping your users operational.

To contents page

© Copyright of the Neverfail Group 2005

Connected! A management guide to removing user downtime as part of a core business initiative.

39

APPLICATION SOFTWARE PROTECTION / AUXILIARY SOFTWARE PROTECTION
A normal server will have a number of applications running in order to provide full service to the users. For example, a Microsoft® Exchange server will always have antivirus software. Usually there will also be anti-spam and backup software, but each environment is different. When a failure occurs, the secondary server must be able to provide all the services of the primary with a similar level of security. A live Microsoft® Exchange server operating without fully updated virus protection is probably more dangerous and damaging than no service at all. HA / DR solutions have to protect both primary and supporting applications and ensure that whichever server is providing the service to the user, it is a true pair and therefore fully protected. However, this is more complex than it may seem. Protection of primary applications (the main activity) and auxiliary applications (required for security etc) is touted by many HA and DR providers. In some upmarket clustering environments, it requires significant change by the authors of the application to make their products “cluster aware” for a particular clustering vendor. Having done this once, it then needs to be maintained. As a result, the cost of the application increases dramatically or alternatively (and more commonly) the software authors don’t offer “cluster aware” versions of their software. In the replication software market, often the delivery is through a consulting engineer undertaking bespoke scripting “on the fly” at the user’s site, on their critical server. More often than not, the scripting is

...APPLICATION SOFTWARE PROTECTION / AUXILIARY SOFTWARE PROTECTION...

40

never documented or electronically captured and the result is an ever growing number of bespoke sites. This may be a sale-able solution, but it is rarely understood by the purchaser what is really being offered at the time of sale. Once discovered, it is too late. The result is the widely reported problems that occur through undocumented bespoke software for support services, on-going consulting, upgrades etc. The consulting / bespoke approach can work, but it assumes that everything then remains the same. All too often the consequence is that the failover protection fails in the moment of need. If the architecture recognises the requirement to deliver application protection, then the process should be very simple and enable the rapid development of application modules that protect specific applications. These application modules are products that can be developed, fully tested, swiftly deployed and easily maintained. They can be produced without any change to the software application simplifying all aspects of support. As important, they can be updated to address changes that take place at a future date and made available to all within a normal support agreement. The Reliability and Monitoring services can be utilised to manage each application in the most efficient and effective manner as they utilise a common methodology. But most importantly, the solution ensures that changes and upgrades to the environment are maintained on both the primary and secondary

To contents page

© Copyright of the Neverfail Group 2005

Connected! A management guide to removing user downtime as part of a core business initiative.

41

...APPLICATION SOFTWARE PROTECTION / AUXILIARY SOFTWARE PROTECTION

SWITCH / FAILOVER CRITERIA AND PERFORMANCE
As has been highlighted, a primary requirement of HA / DR solutions should be that the user keeps working, whatever happens. There are a number of reasons why it may be necessary or desirable to move processing from the primary server to the secondary (and back again) 1. The irrecoverable failure of a critical component of the primary server for whatever reason. This can be the hardware, operating system, application, auxiliary applications, network, site etc. The enablement of maintenance work such as hardware or software upgrades on the primary server (followed by full testing) without loss of service to the users. The regular testing of the solution as part of a best practice methodology on HA / DR strategy.

42

servers, whether active or passive, whether local or remote. The result is a standard product approach that cuts out both the cost and the risk of undocumented bespoke software and leverages the greatest asset of the software industry. Write once, sell many times and charge accordingly. This area of HA and DR solutions is a potential minefield. A quick review of marketing collateral will show that many organisations seem to offer a fully productised “out of the box” application-specific solution. Yet a detailed review and some smart questioning will show that in most cases these are consulting driven engagements around a misleading marketing message. The problem for a prospective purchaser is that this aspect of HA / DR represents such a significant competitive advantage that disadvantaged suppliers are compelled to “over market” their solutions. Does the supplier have a price list that includes products for protection of a number of software applications, both the primary applications such as Microsoft® Exchange, File Server (etc)? Does that price list include secondary applications such as anti-virus software? Does their literature identify this as a critical component of their solution and provide all the relevant information? If still in doubt, ask the very specific question on how application protection is going to be achieved and possibly find someone else other than a salesperson to ask. (Nothing against sales people, as more often than not, they are selling what they believe; it just is not always complete in detail. Technical staff can offer greater clarity).

2.

3.

In each of these scenarios, the user should remain operational with minimal, if any, disruption. We have already covered the objective to avoid a switch or failover occurring in the Reliability and Monitoring section. If it is going to happen, what are the characteristics that would make the process as quick and painless as possible, with minimal risk? Surprisingly, the very first requirement is that the solution allows seamless switch back to be undertaken with the same level of automation and with minimal to zero disruption. Whilst this may seem to be a secondary requirement when the disaster occurs, it will become the next most urgent step in every failure, as the system needs to be fully protected again as soon as possible with the minimal disruption after every failure.

To contents page

© Copyright of the Neverfail Group 2005

Connected! A management guide to removing user downtime as part of a core business initiative.

43

...SWITCH / FAILOVER CRITERIA AND PERFORMANCE...

...SWITCH / FAILOVER CRITERIA AND PERFORMANCE

44

For many solutions, an automated failover will only occur when the primary computer has a catastrophic failure and can no longer be reached by a basic network ‘ping’ from the secondary. This approach fails to address any scenario that can lead to the user becoming unable to use the system, yet the server still responds to a ”dumb ping”. Whilst this is often referred to as a fully automated failover, such a claim is obviously overstated or under-explained. There is a solution in the market that offers “automated failover and fail-back” using the “dumb ping” method to instigate a failover and then requiring the secondary server to be off-line during the data restoration to the primary server after a failure. With a large data set, this could mean 6 – 12 hours of user-downtime to get back to the primary. If a switchover was used to avoid downtime on an upgrade, having 12 hours of downtime to get back to the primary is hardly a sensible solution. But if you didn’t ask..... An automatic switch occurs when a failure has occurred that cannot be corrected successfully by the reliability / monitoring activity. In every event, there should be zero data loss in a switchover process, as by definition it is a controlled process. Any data backlog on the failing active server would be transmitted to the passive server and the failing server is shut down cleanly, avoiding any further damage that might make repairs more difficult. The passive server kicks off, connects to the network and the users are automatically connected to it. The length of time taken in this process is critical. A user may not notice any disruption for a minute or two whilst this process is taking place. However, a five minute or more outage is going to lead to dozens of problem calls. A manual switchover / switch back occurs in the same way with two exceptions. The first is that it is on demand and the second is that the time taken is dependent on the ‘delta’ of data between the pair. In a short maintenance activity, the delta may just be a couple of minutes; however a

complete disk failure will require a full sync and verify process. What is important is that the process is fully automatic and the final switch process will not occur until the data is sync-ed and verified. Again, discuss the process in detail, ask to see it in action, quantify exactly what will happen and how long it may take. All too often purchasers discover the questions they should have asked only when the solution disappoints and it is rarely ever a good time to explain this type of problem to senior management.

Think about this scenario:
It is midnight and the IT department is at home in bed. The CEO is burning the midnight oil to complete sign off on the final proposal for a major tender that has to be delivered by 9.00am the following morning. And the Microsoft® Exchange server stops working (for whatever reason). There are two ways forward from this point. In one, someone in IT gets the call at 12.10am from the CEO instantly followed by all the effort and stress that goes with resolving that particular scenario. The other is that the CEO never knows that the email service failed in the first place because the service is maintained automatically. It is not a hard decision to decide which is the optimal, ‘less stress’, solution. It is also clear which one avoided a potential disaster of the tender not being sent at all and the business opportunity being lost.

To contents page

© Copyright of the Neverfail Group 2005

Connected! A management guide to removing user downtime as part of a core business initiative.

45

BANDWIDTH CONSIDERATIONS

...BANDWIDTH CONSIDERATIONS...

46

LAN
In a LAN environment it is possible to use the existing network to carry this data volume, however this is often not ideal. The link to the current network by a single NIC represents a single point of failure. It is also rarely desirable to add significant data volumes to existing networks unnecessarily. It makes sense to have a low cost dedicated channel, particularly if this is duplicated to remove any single point of failure. Very high local bandwidth with zero impact on the current network ensures absolute minimum data loss in the event of a catastrophic server failure.

WAN
However, moving to a WAN or extended LAN raises the issue of cost of bandwidth. It is essential to have an accurate picture of the data volumes generated and the implications to get an accurate cost of bandwidth needs. The cost of bandwidth will continue to fall in the future, but this will be countered by the likelihood of increasing volumes of data, so this is a long term expense and deserves close attention in the buying process. The decision on bandwidth is a practical one. If the bandwidth does not address the peak requirements, then a backlog of data is going to build up on the active server and would be lost if a catastrophic failure took place. But having bandwidth that is able to address peak data requirements is going to be expensive, with much of the bandwidth idle most of the time.

BANDWIDTH BOTTLENECK

There are implications for a solution which is based upon a replication model with regard to available bandwidth, both in a LAN and particularly in a WAN environment. The volume of data to be replicated should not be a mystery as long as a utility is provided to accurately measure it. It is very important that all data volume information is captured as it is all too easy to miss critical activities such as a relatively simple “defrag” which is massively system and data intensive. The risk of data loss is equal to the amount of “lag” between the active server processing data and any delay caused by bandwidth bottleneck that slows data transmission to the passive server.

This is usually 100% related to the amount of bandwidth available.

To contents page

© Copyright of the Neverfail Group 2005

Connected! A management guide to removing user downtime as part of a core business initiative.

47

...BANDWIDTH CONSIDERATIONS.

SUMMARY

48

How can bandwidth requirements be optimised?
The most obvious route is to reduce the volume of data being transmitted, as this will have a direct relationship on the bandwidth requirements. A data compression tool can reduce bandwidth requirements by 60 – 80%, which means 60 - 80% less bandwidth required. The ability to review data backlog and likely clearance / catch up time allows the most sensible commercial decision on bandwidth. The ability to play “what if” scenarios on actual data requirements with different bandwidths ensures the most cost effective decision both initially and through out the life of the solution as your requirements (and the cost of bandwidth) change.

Downtime is about users, not systems and data.
User downtime represents very considerable cost and real business risk for a growing number of business applications, but not all. An HA and DR strategy should not be represented by a “one size fits all” approach. A mixed strategy from tape backup through replication to zero user-downtime makes commercial sense. Invest where the risk is greatest. The process of identifying and quantifying risk should be a standard business practice that is repeated on a regular basis as the level of risk will constantly change. If the solution to remove downtime was free, everyone would have it because it would be stupid not to. A complete implemented HA and DR solution costs from just £6,000. If this cost is considerably less that the cost and risk of one extended period of user-downtime on a critical server during the next 3- 5 years, “Don’t you think you should do something about it?”

To contents page

© Copyright of the Neverfail Group 2005

Connected! A management guide to removing user downtime as part of a core business initiative.

49

THE CRITICAL COMPONENTS REQUIRED TO REMOVING DOWNTIME:
1. The first and most important is the silent inclusion of the phrase “without adding undue cost and complexity” to each and every aspect of the solution. The goal after all is to make this solution available to the widest possible market. 2. Undertake a detailed review of business critical applications and identify the largest area of risk to your business. 3. Quantify a value of that risk for cost justification purposes. 4. Obtain management buy-in to the value of a High Availability and Disaster Recovery strategy. 5. Avoid the common mistake of budgeting to protect everything and getting nothing. Select the greatest area of risk and address that. A step-by-step process is less risk than all or nothing! 6. Commence a procurement process that reflects the size of risk / urgency of obtaining protection from downtime both in terms of budget and the decision making process. Downtime is probably the greatest area of risk to the business. 7. Ensure that the focus of every aspect of the solution is geared to maintaining the user connectivity to a working application / data and not just protecting data. 8. Minimise the probability of a failure in the first place with the creation and maintenance a healthy “self healing” environment by fixing minor failures on the fly, before they lead to downtime.

...THE CRITICAL COMPONENTS REQUIRED TO REMOVING DOWNTIME:

50

9. Ensure there is “no single point of failure” within the solution. 10.Ensure that application protection is achieved through the use of products and not delivered through consulting and un-documented bespoke software unique to your installation. 11.Fully automate the process of a switchover and failover that enables the users to continue working without any actions by the user or the IT department. 12.Ensure the recovery process to get back to the repaired primary server is fully automated (on demand) and with minimal to zero user-downtime. 13.Enable planned maintenance to be undertaken without causing downtime 14.Minimise data loss to zero for a controlled switchover and minimal data loss for catastrophic failures. 15.Enable the solution to address both LAN and WAN deployment within a single solution thereby address both the HA requirements and enabling DR cover on demand. 16.Make the decision and deploy. 17.Repeat as appropriate for other critical applications / servers over time.

To contents page

© Copyright of the Neverfail Group 2005

Connected! A management guide to removing user downtime as part of a core business initiative.

51

ABOUT THE AUTHOR

...ABOUT THE AUTHOR

52

Neil Robertson has more than 26 years’ experience within the IT industry. From joining the computer marketplace in 1979 with Olivetti’s electro-mechanical accounting systems, his experience spans the advent of word processing, spreadsheets, FAX machines and accounting software for small to medium sized businesses, the rise of Windows-based business applications and the domination of the Internet in today’s business world. After leaving Olivetti, he founded Team Systems Group in 1983, rapidly becoming the UK market leader for PC-based technology. Team was sold to Misys Plc in 1989 where he became an Operating Board Director with the role of Chief Executive of a number of Misys companies. Having left Misys, he joined Kewill Systems Plc to head up their newly formed Great Plains Dynamics operation. He engineered the repurchase of the distribution rights by Great Plains and set up its first off-shore subsidiary in the UK. Great Plains became the recognised leader in its market sector, signing a global distribution arrangement with Siebel Systems in 1999, offering CRM solutions through the Great Plains channel. He left Great Plains in 2000 to found 30/30 Vision, an eCRM consultancy that offered a unique insight into maximising the benefits of CRM and avoiding the common pitfalls. 30/30 Vision was sold to a Global Enterprise in August 2002. In September 2002, Neil joined the Neverfail Group as Group CEO with the mandate to migrate the business from a successful hardware based Disaster Recovery Company into a global software business. That remains the job at hand and this book is part of that process.

Neil has written 4 other books over the past decade. In each case, these books enabled senior management to quickly grasp, understand and put new technology and methodologies in to practice.

Tricks of the Trade – a buyers guide to financial software
(1996)

Tricks of the Trade II – an FD’s guide to cut through the sales
pitch to get at the critical facts (1997).

E-business: The Invisible Revolution, - A decision
maker’s guide to the business benefits of e-business. (1998)

The 29 Most Common Mistakes in purchasing and
implementing eCRM. (2001)

Copyright of the Neverfail Group 2005
All rights reserved. Except for the quotation of short passages for the purposes of criticism and review, no part of this publication may be reproduced, stored in a retrieval system or transmitted, in any other form or by any means, electronic, mechanical, photocopying, recording or otherwise, without prior permission of the author or his agents. The right of Neil Robertson to be identified as author of this work has been asserted in accordance with the Copyright, Designs and Patents Act 1998. Any registered trademarks used are acknowledged and recognised as being the property of the organisations to whom they belong.

Contact Neverfail

To contents page

© Copyright of the Neverfail Group 2005

Connected! A management guide to removing user downtime as part of a core business initiative.

Sign up to vote on this title
UsefulNot useful