RAC Operational Best Practices

RAC & ASM Best Practices
You Probably Need More than just RAC
Kirk McGowan
Technical Director RAC Pack
Oracle Server Technologies
Cluster and Parallel Storage Development
Agenda
Operational Best Practices (IT MGMT 101)

Background
Requirements
Why RAC Implementations Fail
Case Study
Criticality of IT Service Management (ITIL)
process
Best Practices
People, Process, AND Technology
Why do people buy RAC?

Low cost scalability
Cost reduction, consolidation, infrastructure that

can grow with the business
High Availability
Growing expectations for uninterrupted service.
Why do RAC Implementations

fail?
RAC, scale-out clustering is new technology
Insufficient budget and effort is put towards filling

the knowledge gap
HA is difficult to do, and cannot be done with

technology alone
Operational processes and discipline are critical

success factors, but are not addressed
sufficiently
Case Study
Based on true stories. Any resemblance, in
full or in part, to your own experiences is
intentional and expected.
Names have been changed to protect the
innocent
Case Study
Background
8-12 months spent implementing 2 systems somewhat

different architectures, very different workloads, identical
tech stacks
Oracle expertise (Development) engaged to help flatten
tech learning curve
Non-mission critical systems, but important elements of a
larger enterprise re-architecture effort.
Many technology issues encountered across the stack, and
resolved over the 8-12 month implementation
Hw, OS, storage, network, rdbms,

cluster, and application
Case Study
Situation
New mission critical deployment using same technology stack

Distinct architecture, applications development teams, and
operations teams
Large staff turnover
Major escalation, post production
CIO: Oracle products do not meet our

business requirements
RAC is unstable
DG doesnt handle the workload
JDBC connections dont failover
Case Study
Operational Issues
Requirements, aka SLOs were not defined

e.g. Claim of 20s failover time; application logic included 80s
failover time, cluster failure detection time alone set to 120s.
Inadequate test environments

Problems encountered first in production including the fact that
SLOs could not be met
Inadequate change control

Lessons learned in previous deployments were not applied to
new deployment rediscovery of same problems
Some changes implemented in test, but never rolled into
production re-occuring problems (outages) in production
No process for confirming a change actually fixes the problem
prior to implementing in production
Case Study
More Operational Issues
Poor knowledge xfer between internal teams

Configuration recommendations, patches, fixes identified in previous
deployments were not communicated.
Evictions are a symptom, not the problem.
Inadequate system monitoring

OS level statistics (CPU, IO, memory) were not being captured.
Impossible to RCA on many problems without ability to correlate cluster /
database symptoms with system level activity.
Inadequate Support procedures

Inconsistent data capture
No on-site vendor support consistent with criticality of system
No operations manual
Managing and responding to outages
Responding and restoring service after outages
Overview of Operational
Process Requirements
What are ITIL Guidelines?
ITIL (the IT Infrastructure Library) is the most widely accepted
approach to IT service management in the world, ITIL
provides a comprehensive and consistent set of best
practices for IT service management, promoting a quality
approach to achieving business effectiveness and efficiency
in the use of information systems.
IT Service Management
IT Service Management = Service Delivery
+ Service Support
Service Delivery: partially concerned with
setting up agreements and monitoring the
targets within these agreements.
Service Support: processes can be viewed
as delivering services as laid down in
these agreements.
Provisioning of IT Service Mgmt

In all organizations, must be matched to current and rapidly
changing business demands. The objective is to continually
improve the quality of service, aligned to the business
requirements, cost-effectively. To meet this objective, three
areas need to be considered:
People with the right skills, appropriate training and the right
service culture
Effective and efficient Service Management processes
Good IT Infrastructure in terms of tools and technology.
Unless People, Processes and Technology are considered

and implemented appropriately within a steering framework,
the objectives of Service Management will not be realized.
Service Delivery
Financial Management
Service Level Management
Severity/priority definitions
e.g. Sev1, Sev2, Sev3, Sev4
Response time guidelines

SLAs
Capacity Management
IT Service Continuity Management
Availability Management
Service Support
Incident Management
Incident documentation & Reporting, incident handling,

escalation procedures
Problem Management
RCAs, QA & Process improvement
Configuration Management
Standard configs, gold images, CEMLIs
Change Management
Risk assessment, backout, sw maintenance, decommission
Release Management
New deployments, upgrades, Emergency release, component

release
BP: Set & Manage Expectations

Why is this important?
Expectations with RAC are different at the outset

HA is as much (if not moreso) about the processes and
procedures, than it is about the technology
No matter what technology stack you implement, on its own it
is incapable of meeting stringent SLAs
Must communicate what the technology can AND

cant do
Must be clear on what else needs to be in place to
supplement the technology if HA business
requirements are going to be met.
HA isnt cheap!
BP: Clearly define SLOs

Sufficiently granular
Cannot architect, design, OR manage a system without clearly understanding the

SLOs
24x7 is NOT an SLO
Define HA/recovery time objectives, throughput, response time, data

loss, etc
Need to be established with an understanding of the cost of downtime for the system.
RTO and RPO are key availability metrics
Response time and throughput are key performance metrics
Must address different failure conditions
Planned vs unplanned
Localized vs site-wide
Must be linked to the business requirements
Response time and resolution time
Must be realistic
Manage to the SLOs

Definitions of problem severity levels
Documented targets for both incident response time, and resolution
time, based on severity
Classification of applications w.r.t. business criticality
Establish SLA with business
Negotiated response and resolution times

Definition of metrics
E.g. Application Availability shall be measured using the following
formula:
Total Minutes In A Calendar Month minus
Unscheduled Outage Minutes minus Scheduled Outage Minutes
in such month, divided by Total Minutes In A Calendar Month
Negotiated SLOs
Effectively documents expectations between IT and business
Incident log: date, time, description, duration, resolution
Example Resolution Time

Matrix
Severity 1 Priority 1 and 2 SRs
< 1 hour
Severity 1 Priority 3 SRs
< 13 Hours
Severity 2 Priority 1 SRs
< 14 hours
Severity 2 SRs
< 132 hrs
Example Response Time

Matrix
Status
Sev1/P1 Sev1/P2
Sev2/P1 Sev2
Sev3/
Sev4
New,XFR
15
30
15
30
60
ASG
15
60
15
30
60
IRR, 2CB
15
30
15
60
120
RVW,1CB 15
60
15
60
120
PCR,RDV 60
N/A
60
120
3 hrs
WIP
60
60
60
18 hrs
4 days
INT
60
60
120 min 3 hrs
LMS,CUS 4
2 days
4 days
DEV
3 days
10 days
BP: TEST, TEST, TEST

Testing is a shared responsibility
Functional, destructive, and stress testing
Test environments must be representative of production
Both in terms of configuration, and capacity

Separate from Production
Building a test harness to mimic production workload is a necessary, but nontrivial effort
Ideally, problems would never be encountered first in production
If they are, the first question should be: Why didnt we catch the problem in test?
Exceeding some threshold
Unique timing or race condition
What can we do so we catch this type of problem in the future?
Build a test case that can be reused as part of pre-production testing.
BP: Define, document, and

adhere to Change Control
Processes
This amounts to self discipline
Applies to all changes at all levels of the tech stack
Hw changes, configuration changes, patches and patchsets, upgrades,

and even significant changes in workload.
If no changes are introduced, system will reach a steady state, and
function for ever.
A well designed system will be able to tolerate some

fluctuations, and faults.
A well managed system will meet service levels
If a problem (that was fixed) is encountered again elsewhere, it is a

change management process problem, not a technology problem. I.e.
rediscovery should not happen.
Ensure fixes are applied across all nodes in a cluster, and all
environments to which the fix applies.
BP: Plan for, and execute

Knowledge Xfer
New technology has a learning curve.
10g, RAC, and ASM cross traditional job boundaries so
knowledge xfer must be executed across all affected groups
Architecture, development, and operations

Network admin, sysadmin, storage admin, dba
Learn how to identify and diagnose problems
e.g. evictions are not a problem, they are a symptom

Learn how to use the various tools and interpret output
Hanganalyze, system state dumps, truss, etc

Understand behaviour distinction between cause and
symptom
Needs to occur pre-production
Operational Readiness
BP: Monitor your system

Define key metrics and monitor them actively
Establish a (performance) baseline
Learn how to use Oracle-provided tools
RDA (+ RACDDT)
AWR/ADDM
Active Session History
OSWatcher
Coordinate monitoring and collection of OS level stats as

well as db-level stats
Problems observed at one layer are often just symptoms of

problems that exist at a different layer
Dont jump to conclusions
BP: Define, Document, and

communicate Support
procedures
Define corrective procedures for outages
Routinely test corrective procedures
HA process:
Prevent Detect capture resume analyze fix

Classify high priority systems, and the steps that need to
be taken in each phase
Keep an active log of every outage
If we dont provide sufficient tools to get to root cause, then
shame on us.
If you dont implement the diagnositic capabilities that are
provided to help get to root cause, then shame on you
Serious outages should never happen more than once.
Summary
Deficiencies in operational processes and procedures
are the root cause of the vast majority of escalations
Address these, you dramatically increase your chances of

a successful RAC deployment, and will save yourself a lot
of future pain
Additional areas of challenge
Configuration Management Initial Install and config,

standardized gold image deployment
Incident Management - Diagnosing cluster-related
problems

RAC Operational Best Practices

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

RAC Operational Best Practices

Uploaded by

Copyright:

Available Formats

RAC & ASM Best Practices

You Probably Need More than just RAC

Operational Best Practices (IT MGMT 101)

People, Process, AND Technology

Why do people buy RAC?

Cost reduction, consolidation, infrastructure that

Growing expectations for uninterrupted service.

Why do RAC Implementations

Insufficient budget and effort is put towards filling

HA is difficult to do, and cannot be done with

Operational processes and discipline are critical

8-12 months spent implementing 2 systems somewhat

Hw, OS, storage, network, rdbms,

New mission critical deployment using same technology stack

CIO: Oracle products do not meet our

Requirements, aka SLOs were not defined

Inadequate test environments

Inadequate change control

Poor knowledge xfer between internal teams

Inadequate system monitoring

Inadequate Support procedures

Provisioning of IT Service Mgmt

Unless People, Processes and Technology are considered

Response time guidelines

Incident documentation & Reporting, incident handling,

RCAs, QA & Process improvement

Standard configs, gold images, CEMLIs

Risk assessment, backout, sw maintenance, decommission

New deployments, upgrades, Emergency release, component

BP: Set & Manage Expectations

Expectations with RAC are different at the outset

Must communicate what the technology can AND

BP: Clearly define SLOs

Cannot architect, design, OR manage a system without clearly understanding the

Define HA/recovery time objectives, throughput, response time, data

Must address different failure conditions

Must be linked to the business requirements

Response time and resolution time

Manage to the SLOs

Negotiated response and resolution times

Incident log: date, time, description, duration, resolution

Example Resolution Time

Severity 1 Priority 3 SRs

Severity 2 Priority 1 SRs

< 132 hrs

Example Response Time

120 min 3 hrs

BP: TEST, TEST, TEST

Functional, destructive, and stress testing

Test environments must be representative of production

Both in terms of configuration, and capacity

Ideally, problems would never be encountered first in production

BP: Define, document, and

Hw changes, configuration changes, patches and patchsets, upgrades,

A well designed system will be able to tolerate some

If a problem (that was fixed) is encountered again elsewhere, it is a

BP: Plan for, and execute

Architecture, development, and operations

Learn how to identify and diagnose problems

e.g. evictions are not a problem, they are a symptom

Hanganalyze, system state dumps, truss, etc

Needs to occur pre-production

BP: Monitor your system

Establish a (performance) baseline