You are on page 1of 10

Enterprise maintenance strategy for IBM AIX running

on IBM Power Systems servers

Nathan Edwards March 11, 2013

Systems maintenance plays an essential role within every enterprise to ensure supportability
and system reliability in order to maximize the availability of the services that make up the
technical landscape. With multiple components to consider (IBM® AIX®, database, application,
and so on), juggling systems maintenance requirements with service level agreements (SLAs)
can be challenging, but by aligning a schedule of activities, the frequency of planned outages
can be reduced. This article looks at the importance of treating the service as an entity and with
it, creating an enterprise maintenance strategy that maximizes system availability.

Introduction
Systems software maintenance is similar to "Painting the Forth Bridge". When reaching the end of
a round of updates, it is time to start the next.

In a world where software updates are released almost daily and mean time between failures
(MTBF) is literally a waiting game before an item of hardware breaks, the endless cycle of planning
and fixing looks very much like it is here to stay.

Generally, when it comes to technology, lightening performance and around-the-clock service has
become an expectation, without exception.

Waiting for the response from a mouse-click or for a script or job to finish is frustrating, and system
unavailability is intolerable. What this means for any business is a necessity to set-sail for Uptime
Utopia, to strive for seamless, instant and endless availability.

Outages, regardless of the cause, have consequences. The ideal of "100% uptime" is eroded and
costs to the business can be measured in large monetary terms and reputation.

Many businesses are dependent on their IT systems to the point where extended outages have
dire commercial consequences. The woe-betide word of downtime has the uncanny ability to make
months and months of uptime and reliability appear relatively insignificant.

© Copyright IBM Corporation 2013 Trademarks


Enterprise maintenance strategy for IBM AIX running on IBM Page 1 of 10
Power Systems servers
developerWorks® ibm.com/developerWorks/

What immediately comes to mind when there is an outage is that something has broken; a system
has failed, an application has crashed, and so on. In reality, planned outages are responsible for
a fair proportion of downtime and as such, all intentional disruption should be minimized as far as
possible.

There are numerous articles that have been written that outline the importance of system
maintenance and the perils of not addressing it. A very good example was written by Anthony
English.

What is sometimes not obvious is that we do not simply manage AIX or the Virtual I/O Server
(VIOS), we manage parts of the service, and there are a number of other factors to consider.

This article expands on the other factors and components to help define an enterprise
maintenance strategy for AIX on Power Systems™ servers.

The key objectives of the strategy are:

• To minimize planned disruptions to the service


• To increase system availability through careful planning

Effective enterprise maintenance


Defining the service
Enterprise-class environments on IBM Power servers consist of multiple components. A typical
example is as follows:

Enterprise maintenance strategy for IBM AIX running on IBM Page 2 of 10


Power Systems servers
ibm.com/developerWorks/ developerWorks®

Figure 1: Example of an enterprise-class environment

Regardless of whether the functionality of the application is hosting a company website or it is


processing a critical overnight batch, every component is required in order to deliver the outcome.

In the same way that a car is not of much use without an engine, and an engine without a car is
hardly convenient, if AIX is up but the application is down, functionality is affected.

Every component has some form of maintenance associated with it, whether it is related to
software, hardware, or both. It is common for large organizations to have multiple teams looking
after specific components, and from experience, it is not unusual for each team to have its own
strategy for maintenance.

The result is that throughout the course of the year, service can be affected by multiple planned
outages as a consequence of having multiple maintenance schedules.

Minimizing the number of planned outages increases service availability. Achieving this requires
every team to consider the service as an entity as opposed to the specific component that their
team manages.

Developing an enterprise maintenance strategy for AIX on Power (which addresses the key
objectives), requires an understanding of the maintenance plans and schedules relating to every
component. The outcome is to combine or rearrange activities so that a schedule of limited
outages can be developed, in accordance with specific business needs.

Enterprise maintenance strategy for IBM AIX running on IBM Page 3 of 10


Power Systems servers
developerWorks® ibm.com/developerWorks/

A plain example
The following relates to a situation which was experienced when working with a very large IBM
Power environment.

The environment consisted of multiple Power servers, each of which ran multiple AIX logical
partitions (LPARs) for a variety of business units.

A planned outage that affected a Power server would impact multiple business units and therefore,
outages were very difficult to arrange.

Each business unit was represented by an application owner who was responsible for service
availability. One of the application owners represented the largest and most important applications
for the company, and during one particular meeting, it was highlighted that recent outages included
the following disruptions:

• IBM Power Systems™ firmware upgrade: Full Power Systems stop/start: Outage for the LPAR
and service.
• Application release: Outage for the application and service
• AIX update: Reboot required: Outage for the LPAR and service
• Database update: Outage for the application and service

The application was in constant use around the world and any outage, planned or unplanned, had
a significant impact.

This prompted other application owners to evaluate outages affecting their services and the
outcome was similar: Multiple planned outages, each relating to a different component.

In each case, the application owners were asked at short notice for downtime, which was difficult
for them to arrange with the business community.

The beginning of a combined strategy

It was not too long before the IT Service Manager arranged a meeting with the technical teams to
discuss planned outages for the remainder of the year.

As the meeting progressed, it became clear that service-affecting work was being scheduled by
almost all of the teams, and in most cases, in complete isolation to one another. An objective was
set: Devise a calendar-based schedule, where updates and maintenance for the core components
of the service were planned.

Communication was built into the schedule to request outages to the service far in advance. This
eliminated the frustration associated with arranging outages at short notice.

The schedule incorporated flexibility to allow businesses to choose a suitable time within a four-
month window for planned outages.

Enterprise maintenance strategy for IBM AIX running on IBM Page 4 of 10


Power Systems servers
ibm.com/developerWorks/ developerWorks®

An annual system-wide outage was built in, reserved for disruptive tasks such as a Power
firmware upgrade or hardware replacements that were not hot-swappable (such as processor
replacement, and so on). The outage would only be used if required, but it was essential to cater
for this type of maintenance, should the need arise.

Where possible, PowerHA failovers were used to minimize downtime before a maintenance
window: clustered LPARs were 'failed-over' to the failover node so that services experienced
a shorter outage related to the failover, as opposed to the longer outage associated with the
maintenance. From IBM POWER6®, IBM PowerVM® Live Partition Mobility (LPM) became an
additional tool for critical applications to maintain uptime, for moving LPARs away from the Power
server on which maintenance was scheduled.

After finalizing the strategy, it became possible to offer the businesses the option to combine
multiple components into a single planned outage per year, with the understanding that a second
outage might become necessary if there were technical reasons to apply fixes.

The enterprise maintenance strategy for AIX on Power was defined within the attached
spreadsheet:

Download spreadsheet

The spreadsheet is organized from the top row, starting at January and working across to
December.

Column A contains the components that collectively formed the service, with a schedule of
activities to occur throughout the year. For the sake of minimizing the number of components,
other software (such as PowerHA) was included in AIX Image Packaging and Delivering AIX
Package.

The maintenance strategy included important aspects of managing service availability in an


enterprise environment: The cycle of communication, evaluation of fixes, the testing and packaging
of updates, and finally, deployment.

Known disruptive outages were defined as follows:

• Power Systems outage, February was the only planned disruptive upgrade to carry out
essential tasks that require an outage to the Power server. For example, disruptive firmware
upgrades, although concurrent updates were performed in preference. The planned disruptive
outage was also a time to correct hardware failures that had occurred throughout the year and
were not hot-swappable (processor failure, and so on).
• AIX update: Definite Technology Level update and a possible service pack update, with the
option to combine with any other outage.
• Database application updates: Option to combine with any other outage.
• Application updates: Option to combine with any other outage.
From an AIX and VIOS perspective, the strategy was to apply one update per year, with the
possibility of a second (service pack or important fix) update mid year. The mid-year outage

Enterprise maintenance strategy for IBM AIX running on IBM Page 5 of 10


Power Systems servers
developerWorks® ibm.com/developerWorks/

was planned for in advance, but would only be used if it was considered necessary for technical
reasons.

Aligning the maintenance activities of various teams meant that testing periods were extended,
providing a more resilient base for the application.

Publishing the enterprise maintenance strategy to the business community provided benefits too.
The individual businesses could plan their application testing and release to coincide with any
other planned outages.

After implementing the enterprise maintenance strategy, it became clear that the infrastructure was
being managed seamlessly from an enterprise perspective as opposed to an isolated component-
based view.

The process of defining a maintenance strategy


The benefits of having an enterprise maintenance strategy include improved service availability
by combining maintenance tasks, increased reliability through regular maintenance, and positively
raising the profile of the technical teams responsible for managing the environment.

This section focuses on the processes and topics that assist with defining an enterprise
maintenance strategy for AIX on Power.

Define a maintenance strategy owner

A calendar-based schedule is likely to result in one or more activities taking place every month.
Defining a maintenance strategy owner (who is supported by management and the business
community) ensures that the schedule remains on track.

The maintenance strategy owner is responsible for the following activities:

• Ensures that outages are planned with the businesses in advance.


• Provides communication to the business one month in advance to confirm impact of planned
outages.
• Liaises with the technical teams to ensure that the testing and packaging phases are
completed on schedule.
• Tracks the progress of updates for individual services and LPARs to ensure that every LPAR
is being updated on time.
• Reviews the enterprise strategy annually so that it remains relevant and achievable.
Plan for realistic scenarios

When devising a strategy, consider business needs in relation to IBM AIX release strategy.

IBM releases one AIX Technology Level per year, approximately four service packs per year, and
AIX Technology Levels receive fixes for three years.

Refer to AIX release and service delivery strategy for further information.

Enterprise maintenance strategy for IBM AIX running on IBM Page 6 of 10


Power Systems servers
ibm.com/developerWorks/ developerWorks®

As a guide:

• Plan for at least one outage per year and recognize that additional outages are
sometimes unavoidable.
Remain realistic: Emergency critical fixes that affect any one of the key components (AIX,
database, application, and firmware) may need to be applied to the environment at short
notice.
• Advising of "more uptime" is easier than requesting "more downtime".
This is important! If it is decided that the mid-year scheduled outage is no-longer necessary, it
can be communicated to the business in advance, according to the schedule.
• Recognize planned outages as an important part of providing stability.
Under certain circumstances, it is necessary to arrange downtime in order to carry out
planned disruptive software maintenance as well as replacing unexpected hardware
failures. As the shared infrastructure becomes more complex, the need for a manageable
maintenance strategy increases.

Communication is essential

• Define an outage window for known service disruption and communicate in advance.
The attached spreadsheet indicates that planned outages were arranged 6 months in
advance, with a four-month window to schedule an outage. For example, communication was
sent to the individual businesses in June for them to choose outages from anywhere between
April and July the following year.
Some businesses preferred for everything to be updated at once (one planned outage per
year) and other business units stipulated the requirement for each component to be updated
individually, during separate outages. The choice was left with the individual business, and
they understood the increased complexities associated with problem analysis when changing
multiple components at the same time.
Advanced planning can also assist with scheduling technical resources to carry out
maintenance. Trying to find technical resources on a Friday afternoon to perform upgrades on
Saturday is not advanced planning!
Actively following the Maintenance Strategy keeps the technical environment consistent
(everything on the same software level) and up-to-date (supported by vendors).

• Define an outage window for potential service disruption and communicate in advance.
The spreadsheet shows that a window for mid-year service pack updates was defined to
apply unexpected fixes.
One month before the mid-year service pack update, it was evaluated whether or not an
outage was necessary. In the case that a service pack or a critical fix was not required,
notification was sent to the businesses one month in advance that no outage was necessary.
Again, it is easier to advise the business of "more uptime" than request "more downtime".

Testing and packaging of enterprise components

• Packaging VIOS, AIX, and database components

Enterprise maintenance strategy for IBM AIX running on IBM Page 7 of 10


Power Systems servers
developerWorks® ibm.com/developerWorks/

The maintenance strategy should include time for extended and thorough testing. This can
improve confidence in the solution by reducing technical issues and minimizing downtime by
fine tuning, rehearsing, and automating the update process on test LPARs.
• Host bus adapter (HBA) and Ethernet microcode:
Although frequently overlooked, adding adapter microcode to the maintenance schedule
ensures a regular review against vendor interoperability matrices, for supportability and
important fixes.

• Implementation - AIX, database, and firmware:


Implementation windows are defined to achieve the following tasks:
• Implement necessary updates with an option for individual businesses to combine
components to reduce the number of planned outages, or arrange separate outages for
each component.
• Introduce a tightly-managed environment. After a maintenance strategy is defined and
followed, the technical environment as a whole becomes tightly managed by ensuring
that all services are at the defined supported level. Managing fewer software levels leads
to reduced testing and development time, as there will be fewer combinations to test.

Remain realistic

• Realize that there will be exceptions:


It is inevitable that one or more LPARs will not be updated for a variety of reasons, such as
application code restricts it to specific database or AIX level, or businesses are unable to
commit to prearranged outages because of unforseen commercial reasons. These LPARs
should be managed as exceptions, which can be reported on with the view to updating them
as soon as possible.

Remain informed

IBM continually improves the reliability of services running on IBM Power servers by frequently
releasing software and firmware updates for components including the Hardware Management
Console (HMC), VIOS, AIX, PowerHA, Power server, adapter microcode, and so on. Some
updates enable features, others correct known issues.

Regularly reviewing available software updates is an important consideration for an effective


enterprise maintenance strategy, and this is simplified by taking advantage of IBM My
Notifications subscription service.

The service has multiple configuration options that allow for receiving updates on topics relevant to
a specific environment, at a frequency that is right for you.

For example, by using the subscription service, it is possible to set up daily or weekly email alerts
for new system firmware on specific Power server types (Power 770, Power 780, and so on) as
well as email alerts for specific AIX and VIOS versions. The emails are delivered to your inbox
when a new update becomes available.

Enterprise maintenance strategy for IBM AIX running on IBM Page 8 of 10


Power Systems servers
ibm.com/developerWorks/ developerWorks®

When IBM releases software or firmware, it is usually associated with a 'severity' which is an
indication of the importance of the update, along with a description of the problem.

Notable entries are as follows:

Acronym Explanation Definition

HIPER High Impact/PERvasive Should be installed as soon as possible.

SPE SPEcial Attention Should be installed at earliest convenience.


Fixes for low-potential, high-impact problems.

ATT ATTention Should be installed at earliest convenience.


Fixes for low-potential, low-to-medium impact
problems.

If required, refer to the full list of definitions.

The information that IBM provides when releasing software updates allows you to evaluate the risk
to your environment and to be aware of the updates to include during scheduled maintenance.

Refer to the My Notifications website.

Summary
It is widely acknowledged that system maintenance is an integral part of managing an
environment, for supportability and stability reasons.

With increased reliance on IT systems and with its demands for availability, minimizing the number
and frequency of outages can be achieved by introducing an enterprise maintenance strategy,
which encompasses the components of a service.

Reviewing the latest software and firmware updates is an important part of the enterprise
management strategy. IBM simplifies this process by providing access to the My Notifications
subscription service, to configure an easy way to receive the latest information that is relevant to
your environment.

Outages can be avoided with, IBM PowerVM® Live Partition Mobility to relocate running services
to an alternative Power server. Planned downtime can be minimized by high availability solutions
so that clustered LPARs can be moved during a short planned window ahead of scheduled
maintenance. These factors can be incorporated into the enterprise maintenance strategy that is
relevant to your environment.

Resources
• IBM My Portal: Enables you to stay up-to-date with fixes by configuring a subscription service,
manage service requests, and more.
• AIX release and service delivery strategy: Provides information relating to the release strategy
of AIX.

Enterprise maintenance strategy for IBM AIX running on IBM Page 9 of 10


Power Systems servers
developerWorks® ibm.com/developerWorks/

• AIX Service Strategy Details and Best Practices: Provides a list of helpful resources and
information relating to the components of AIX, such as technology levels, service packs, and
so on.
• Fix Level Recommendation Tool: Enables you to access fixes and updates from IBM for any
component, including adapter microcode, system firmware, AIX and VIOS. For AIX, there is
a Compare Report option. A file that contains installed file sets is uploaded using the link and
the report advises on recommended updates and fixes which are not currently applied.
• System Software Maps: Provides information on software levels in relation to AIX, IBM i, and
PowerVM Virtual I/O Server.
• Power Code Matrix: Provides at-a-glance information relating to System firmware and HMC
code.

© Copyright IBM Corporation 2013


(www.ibm.com/legal/copytrade.shtml)
Trademarks
(www.ibm.com/developerworks/ibm/trademarks/)

Enterprise maintenance strategy for IBM AIX running on IBM Page 10 of 10


Power Systems servers

You might also like