You are on page 1of 13

Tiered Infrastructure

Maintenance Standards™
(TIMS)
For Mission-Critical
Environments

By Bob Woolley, Author and


Mike Hagan, Co-Author
Lee Technologies
The Need for
Tiered Infrastructure Maintenance Standards™
The world has changed.Technology has transformed every aspect our lives.
And access to technology has gone from luxury to necessity. A generation ago, a
failure may have been simply inconvenient.Today, it can be devastating.
Envisioning life without technology is nearly impossible. However, each
hurricane season nature provides us powerful reminders. Services taken for
granted every day vanish without technology.The results can be tragic.
On a small scale, the failure of technological resources results in lost business
productivity. On a large scale or in critical applications, it can literally make the
difference between life and death.
Natural disasters and terrorism have raised public and private sector
awareness.The need to maximize uptime in mission-critical facilities has led to
increased focus on disaster recovery and business continuity planning efforts.
The Uptime Institute’s introduction of a tiered classification approach to data
center design provided an important first step in helping data center’s achieve
increased reliability.This tiered system provided a benchmark for the design
criteria needed to achieve high levels of reliability, often measured in the number
of "9’s" as the percentage of uptime achieved (99.XXXX%), required by an
organization’s mission critical applications.
Business and government have invested billions in pursuit of "buying" uptime.
However, an unavoidable fact of life remains: While beginning with a strong
foundation is crucial, ongoing maintenance will make the difference between
success and failure. And the more critical the mission, the more intense the
maintenance program needed.
It should be noted that critical facility operators have worked valiantly to
maintain their physical infrastructure systems. However, all too often the effort is
undermined by several factors. At the highest levels of an organization, profit and
loss pressures often result in short-sighted budget decisions, without a full
understanding of the long-term implications of maintenance failures. Secondly,
high-profile "state-of-the-art" technology catches upper-level management’s
interest, while less glamorous maintenance goes under-funded. Finally, no
formalized maintenance standards existed to provide facility operators a
benchmark on which to base their programs... until now.

1
Lee Technologies’ Tiered Infrastructure Maintenance
Standards (TIMS) For Mission Critical Environments
Lee Technologies has created a system of infrastructure maintenance
practices and procedures for mission-critical facilities.These Tiered Infrastructure
Maintenance Standards (TIMS) offer a systematic approach to achieving synergy
between maintenance activity levels and the level of reliability expected of the
facility.
The goal of Tiered Infrastructure Maintenance Standards is to provide
organizations a means to evaluate their maintenance programs, understand their
level of risk and to effectively allocate their resources.
Success, however, relies on management developing a maintenance
philosophy.This philosophy must align with the organization’s overall performance
goals and must be enforced and managed throughout every aspect of the
maintenance organization.
Every facility and organization is unique, and the TIMS will need to be
adapted to the specific environment. However,TIMS provides essential guidelines
and rationale for all those involved in the operation, administration and
management of mission-critical environments.

Defining the Goal

A critical first step is to determine the level of risk acceptable to the


organization.This is a familiar concept to IT professionals. Acceptable risk levels
are regularly employed to determine the level of redundancy in both hardware
and software systems, as well as in the design of the data center infrastructure.
The challenge with setting such a baseline may be significant. Within most
organizations there are often multiple opinions on the impact and cost of
downtime. It is a balancing act, weighing the level of risk, the cost of failure, and
the cost of designing, building and maintaining a facility and infrastructure that
minimizes the risk.

Defining the Standards

Four Maintenance Service Tiers have been established:


• TIMS-1: Run to Fail
• TIMS-2: Unstructured
• TIMS-3: Structured
• TIMS-4: Facilitated

2
TIMS-1 Run to Fail

This level of service reflects the old adage, "If it isn’t broken, don’t fix it."
Maintenance at this level is essentially reactive. When a problem develops, a
vendor is called to perform the repair. When redundant systems are present,
there may be little or no impact to the critical load for an isolated event.
However, a lack of preventive maintenance often results in overall system
weakness and overloading.This creates a domino scenario, where multiple weak
links overwhelm and defeat system redundancies.
Operating at this level implies that the cost of an outage is low compared to
the cost of higher level maintenance. Unfortunately, when budgets are tight,
deferring maintenance is often perceived as a way to cut cost. It is in essence a
form of gambling similar to forgoing insurance. Statistically, any perceived short-
term savings in maintenance costs will likely be overshadowed in the long run by
costly outages and expensive repairs.

TIMS-2 Unstructured Maintenance

TIMS-2 involves the performance of basic preventative maintenance on


critical infrastructure equipment by a qualified vendor or in-house technical staff.
This is the industry norm. Unfortunately, a chain is only as strong as its weakest
link. The lack of structure typically found in this approach can create a false sense
of security. Even qualified personnel can overlook essential or little known trouble
spots.This often leads to a combination of meticulously maintained components
that are nonetheless vulnerable to a series of small, undetected troublespots.
Attending to equipment is no guaranty that all of the necessary steps are
being taken to maximize availability. If the maintenance program lacks a detailed
scope of work for each piece of equipment, chances are that important
maintenance items are being neglected. If methods of procedure (MOP’s) are
not employed for critical systems, there’s an elevated risk of human error
occurring during maintenance events.
A common characteristic of Unstructured Maintenance is an over-reliance
on individual effort. It is reassuring to rely on a trusted individual who has been
providing maintenance services for years. However, it creates a high degree of
risk when an organization’s facility maintenance knowledge resides inside the
head of individual technicians.
Unstructured, under-documented maintenance programs create an
environment in which maintenance is more haphazard, and the risk of human
error is elevated.

3
TIMS-3 Structured Maintenance

The goal of Structured Maintenance is to maximize uptime by eliminating


guesswork and minimizing human error.This is a complex task requiring discipline
and experience to execute. Every part of the maintenance process is closely
evaluated. Procedures are established to exert control over how information is
gathered, acted upon and recorded. Programs are created to identify, train,
supervise and evaluate qualified personnel. Procedures are developed to
precisely manage how and when work is performed.
Structured Maintenance brings together best practices for each maintenance
element and integrates them into a program that is more than the sum of its
parts.The goal is to systematically eliminate variables that can introduce errors.
This maintenance level is extremely proactive.
Characteristics of Structured Maintenance include a formal staff training
program, a document library that includes a scope of service and standard
operating procedures for all site equipment, a change management program that
utilizes methods of procedure for all maintenance activities, a robust vendor
management program, quality control procedures and specialized support
systems such as a Computerized Maintenance Management System (CMMS) and
a Document Management System (DMS).

TIMS-4 Facilitated Maintenance

The highest level of maintenance service is achieved when a Structured


Maintenance program is combined with an architecture that facilitates
maintenance by providing multiple power and cooling distribution paths with
redundant components.This design will allow individual equipment to be isolated
and maintained without a disruption in services. Another important component
is a Building Management System (BMS), which continually monitors the critical
infrastructure, trends equipment performance, alerts operators when conditions
fall outside preset parameters and allows automated control of equipment
sequencing.These systems also provide a controlled means for bringing
equipment on- and off-line for maintenance.
When Structured Maintenance is performed in this environment, the highest
level of reliability is achieved. Automated systems take much of the risk of human
error out of the equation, and can respond more quickly and accurately to
sudden changes. Monitoring of the critical systems and the ability to trend specific
operating parameters facilitates predictive maintenance.The ability to easily
isolate redundant system components for comprehensive testing and
maintenance greatly increases reliability while minimizing the risk of downtime.

4
Categorizing Your Maintenance Program
Few maintenance programs fall neatly into a single category. More often,
there will be elements of two or more Maintenance Service Tiers present. One
common example would be a program that embraces Structured Maintenance
on the electrical side of the house, but might be unstructured on the mechanical
side. Another example would be a Tier III-designed facility being run without a
formal change management program. In cases such as these, the weakest link
principle applies:Your overall service level is only as high as the lowest level of
maintenance being performed in any area of your facility.
Without a detailed understanding of high-level maintenance procedures,
evaluating an organization’s maintenance program can be difficult, particularly in
cases where maintenance is managed internally. It is generally preferred to hire an
independent consultant to perform a Mission-Critical Infrastructure Site
Assessment and Risk Analysis.
Certified technicians will provide a comprehensive walkthrough of the facility,
identifying potential points of failure and creating a roadmap for improvement.
The consultant and internal staff will discuss known or suspected reliability
problems, discuss capacity/load planning and review up-to-date facility drawings,
specifications, operation documentation and maintenance records.
The results of such an audit may prove pivotal.The investment in the audit
will be returned many times over by identifying critical steps that will increase the
facility’s reliability.

Making the Most of your Existing Infrastructure

Adding additional equipment to improve redundancy is one possible path to


increased reliability. However, restraints in capital budgets or physical space often
limit the viability of this solution. In this case, the goal should be to elevate your
Maintenance Service Level to compensate. Applying TIMS-3 principles to an
existing infrastructure will minimize risk and maximize your bottom line. It could
be argued that a Tier II facility operating at TIMS-3 can be more reliable in than a
TIMS-2,Tier III facility.

Titanic Tier Data Centers

Organizations with high tier data centers typically allocate the appropriate
budget levels for maintenance.These organizations acknowledge the risks of an
outage and have invested heavily to add layers of protection. However, this can
lead to a sense of invulnerability, and the resulting complacency often creates a

5
lack of attention to detail in the maintenance category. In order to fully realize
and maintain the benefits of the original capital investment, time, energy and
resources must be allocated to developing a high TIMS.
The day-to-day demands placed on internal IT and facilities staff often
precludes the ability to develop high-level maintenance. Even the most
sophisticated organizations may lack the capabilities (skill sets, impartiality,
expertise, resources) to internally develop and manage a TIMS-2,TIMS-3 or
TIMS-4 program.

Considerations

Before a target TIMS can be selected and implemented, the organization’s


cost of downtime and risk-tolerance level must be established.This knowledge is
a prerequisite for the development of a realistic maintenance program.
Ultimately, the TIMS level achievable will be determined by the availability of
resources and the commitment of the organization’s management team.
In preparing to undertake the establishment of an effective TIMS program,
several general considerations must be taken into account:

1) Scope: What specific actions need to be taken to achieve the desired TIMS
tier?
2) Budget: Does your budget allow you to meet your chosen goals?
3) Skills: Do you have the internal skills to manage and perform the activities
required?
4) Impact: What is the impact on your business operation to implement the
plan, and what are the risks?

Conclusion

Now more than ever, the public and private sector recognize the need to
maximize uptime in mission-critical facilities. Increased focus on disaster recovery
and business continuity has empowered organizations with much needed focus
on system reliability, including the tiered classification approach to mission-critical
center design and construction.
However, it is time to strengthen the final link in the chain with a systematic
approach to the ongoing maintenance of the facility and its infrastructure.To be
effective, it must be realistic and provide management with a dollars-and-cents
justification to ensure a long-term commitment.
When evaluating the entire scope of the mission critical enterprise, the
effectiveness of the maintenance program is one of the key components that

6
must be factored-in to determine the true measure of sustained reliability.The
tremendous variability in how maintenance is implemented can make it difficult
to judge what constitutes the proper level of service in a given situation. Defining
maintenance levels is a tool to achieving such an understanding. Matching-up high
reliability systems with high maintenance service levels will allow organizations to
achieve the highest levels of reliability and uptime.
The Tiered Infrastructure Maintenance Standards offers a systematic
approach to achieving synergy between maintenance activity levels to the level of
reliability expected of the facility. Applying these principals to your maintenance
program will go a long way to help attaining uptime and overall continuity goals.

About the Authors:

Bob Woolley, Author


Field Operations Manager,Western Region
Lee Technologies
Mr. Woolley has been involved in the critical facilities management field for
over 20 years. Bob served as Vice President of Data Center Operations for
Navisite, as well as Vice President of Engineering for COLO.COM. He was also a
Regional Manager for the Securities Industry Automation Corporation (SIAC)
telecommunications division and operated his own critical facilities consulting
practice. Mr. Woolley has extensive experience in building technical service
programs and developing operations programs for mission critical operations in
both the telecommunications and data center environments. He may be reached
at bwoolley@leetechnologies.com or (925) 519-1739.

Mike Hagan, Co-Author


Senior Vice President
Lee Technologies
Mr. Hagan has worked in the mission-critical facility industry for over 20
years. From working in-depth with his client base, he has developed an extensive
knowledge of the complex issues and challenges clients face with the service
aspect of their mission-critical facilities and has become an industry advocate on
the importance of preventative maintenance in mission-critical facilities. Mr. Hagan
has a Bachelor of Science in Manufacturing Engineering from Miami University in
Oxford, Ohio. He may be reached at mhagan@leetechnologies.com or (703)
968-0300.

7
SERVICE TIMS-1 TIMS-2 TIMS-3 TIMS-4

Critical systems vendor response - 4 hour minimum X X X X

Spare parts on site (critical systems) X X X X

Manufacturer recommended preventative maintenance performed X X X

On-site facilities staff X X X

Walk-through check list (daily) X X X

Maintenance logs X X X

Record drawings - (up to date “as builts”) X X X

Emergency escalation procedures (documented) X X X


MOP/SOP's (all critical systems) X X

Customized maintenance/service scopes X X

Load bank testing (annual) X X

New component testing (prior to installation) X X

Post-modification system testing X X

Predictive maintenance program X X

Critical systems monitoring (24/7) X X

Computerized Maintenance Management System (CMMS) X X

On-site facilities staff (24/7) X X

Vendor supervision (qualified personnel) X X

Training program (mission-critical systems) X X

Electronic Document Management System (DMS) X X

Maintenance QA process X X

Vendor qualification and management program X X

Building Management System (BMS) X

Breaker coordination study (annual) X

Integrated System Testing (annual) X

Tier III or IV facility design * X

*As specified by the Uptime Institute Tier Classification

8
Glossary of Terms

Building Management System (BMS): A system designed and


implemented to control and monitor the functions of a building and its
associated plant.

Breaker Coordination: Coordinated adjustment of circuit breaker settings to


ensure that over-current conditions occurring anywhere in the system are
handled at the lowest possible level and to not cause out-of-sequence or
cascading circuit interruptions.These settings need to be checked regularly, as
changes in load may result in unbalanced conditions.

Critical Systems Monitoring: Electronic monitoring of critical equipment


status and major data points, such as generator crankcase temperature, UPS
voltage and current, etc.The main purpose of system monitoring is to report any
out of tolerance conditions that either indicate a service affecting condition or
may be a precursor indication.Valuable historical operating information is also
accumulated in the process.

Critical Systems Vendor Response: The contractual SLA for critical system
vendor response time in the event of a facility emergency.

Customized Maintenance/Service Scopes: A detailed listing of all the


maintenance activities required for a specific piece of equipment and the
frequency of each activity.This list usually includes the manufacturer's suggested
maintenance, but may also take into account the equipment history, experience
of the service personnel and special application requirements.

CMMS: Computerized Maintenance Management Systems (CMMS) are


software applications that schedule, track and monitor maintenance activities and
provide cost, personnel and other reporting data and history.

DMS: Document Management Systems (DMS) are software applications that


scan, store and retrieve documents that are used by an organization. In the
facilities environment, these documents are typically Operations and Maintenance
manuals, maintenance schedules, MOP's, SOP's, service reports, etc.

Emergency Escalation Procedures: Documented procedures for bringing


additional resources to bear in the event of an emergency.This list should be
hosted at a centralized operations center, such as a NOC, and be made available
to each member of the critical environment team.

9
Infrared Scan: A periodic thermal imaging of critical electrical components
such as circuit breakers.The results of this non-invasive testing can pinpoint hot
spots, which are indicators of poor connections or failing components before
they become a service affecting issue.

Integrated System Testing: Integrated System Testing (IST) verifies that all of
the mechanical, electrical, control and safety systems work as designed, during
normal operations as well as during multiple systems failures. IST provides
baseline information on the operation of the facility during all anticipated modes
of operation and can pinpoint any weaknesses that may not be discovered
during the normal commissioning process.

Load Bank Testing: The process of placing a resistive load on the output of an
electrical system in order to test its operation at various stages of loading.This is
employed to validate the operation of partially loaded systems, to test
equipment prior to placing it in service and to create high heat loads to load test
HVAC systems.

Maintenance Logs: Complete equipment lists, including serial numbers, service


provider, type of service contract, system location, system tag information, current
state of firmware/software and maintenance history.

Maintenance QA: Quality Assurance processes are used to verify the


expected outcome of a particular service activity and to reduce the risk of
service related failures.This includes utilizing a MOP review process, pre-testing
components prior to installation, post-service testing and quality checking the
finished work.

Manufacturer's Recommended Service: Preventative maintenance


activities for specific pieces of equipment as set forth in the manufacturer's
Operations & Maintenance instructions.

MOP: A Method of Procedure (MOP) is a detailed work document that is


utilized to perform maintenance on critical systems.The MOP specifies what
equipment is being worked on, who will be performing the procedure, what
tools and safety procedures are necessary, describes the risk, lists the step-by-
step procedure, identifies backout procedures and escalation protocols, contains
authorization signatures and records maintenance data.

10
Onsite Facilities Staff: Dedicated on-site facilities staff that focuses on the
site critical systems.This group performs daily walkthroughs, manages vendors
and performs some level of self-performed service.The facilities staff is
responsible for creating and maintaining all of the site documentation, including
MOP's, SOP's and emergency procedures.This staff may or may not be providing
24x7 coverage, depending on the level of service required.

New Component Testing: Pre-testing of components prior to installation in


critical systems.This testing can be performed on-site when possible, but may
need to be done at the factory with appropriate documentation provided.

Post-modification System Testing: Comprehensive testing of a system that


has had a component change to ensure that the unit is performing within
specifications.

Predictive Maintenance: Maintenance activity that's designed to identify


precursor indications to equipment wear or failure. Early warning provided by
predictive maintenance can be used to budget and plan maintenance activities in
advance of the need to actually perform the service.This increases efficiency and
reduces the risk of unplanned outages.

Qualified Personnel: Maintenance personnel (either in-house or contracted)


that have undergone a qualification process to verify their competence on the
equipment that they are servicing.Typically this would involve some type of
certification or licensing.

Record Drawings: Up to date architectural, electrical, mechanical and


equipment layout drawings that accurately reflect the facility as it was actually
built, plus any adds, moves or changes that have occurred up to the present day.

SOP: A Standard Operating Procedure (SOP) is a document that is used to


describe specific steps to be taken to implement a well understood and defined
process. An example would be putting a UPS into bypass or putting a fire system
in test mode.

Tier Rating System: A rating system developed by the Uptime Institute to


classify facility infrastructure reliability in four levels or tiers, from lowest (Tier I)
to highest (Tier IV).

11
Training Program: A formal and comprehensive staff training program that
defines various levels of qualification along with a rigorous testing and
certification process.This is used in conjunction with a matrix that identifies
specific maintenance tasks and what the qualification levels are for performing
them.

Vendor Management Program: A systematic program of vendor


identification, selection, management and evaluation.The purpose is to find
competent vendors, document their qualifications, clearly specify the scope of
their activities, obtain competitive pricing, monitor their performance and provide
feedback.

Walk-through Check List: A detailed list of critical systems and facility


infrastructure equipment, containing fields for inputting data (such as voltage,
temperature and pressure) or status checks.This list is used to perform periodic
walk-throughs of the facility to monitor status and create a written record of
critical system settings and values.

12

You might also like