You are on page 1of 25

RAC & ASM Best Practices

You Probably Need More than just RAC

Kirk McGowan
Technical Director RAC Pack
Oracle Server Technologies
Cluster and Parallel Storage Development

Agenda

Operational Best Practices (IT MGMT 101)


Background

Requirements
Why RAC Implementations Fail
Case Study
Criticality of IT Service Management (ITIL)
process

Best Practices

People, Process, AND Technology

Why do people buy RAC?


Low cost scalability

Cost reduction, consolidation, infrastructure that


can grow with the business

High Availability

Growing expectations for uninterrupted service.

Why do RAC Implementations


fail?
RAC, scale-out clustering is new technology

Insufficient budget and effort is put towards filling


the knowledge gap

HA is difficult to do, and cannot be done with


technology alone

Operational processes and discipline are critical


success factors, but are not addressed
sufficiently

Case Study
Based on true stories. Any resemblance, in
full or in part, to your own experiences is
intentional and expected.
Names have been changed to protect the
innocent

Case Study
Background

8-12 months spent implementing 2 systems somewhat


different architectures, very different workloads, identical
tech stacks
Oracle expertise (Development) engaged to help flatten
tech learning curve
Non-mission critical systems, but important elements of a
larger enterprise re-architecture effort.
Many technology issues encountered across the stack, and
resolved over the 8-12 month implementation

Hw, OS, storage, network, rdbms,


cluster, and application

Case Study
Situation

New mission critical deployment using same technology stack


Distinct architecture, applications development teams, and
operations teams
Large staff turnover
Major escalation, post production

CIO: Oracle products do not meet our


business requirements
RAC is unstable
DG doesnt handle the workload
JDBC connections dont failover

Case Study
Operational Issues

Requirements, aka SLOs were not defined


e.g. Claim of 20s failover time; application logic included 80s
failover time, cluster failure detection time alone set to 120s.

Inadequate test environments


Problems encountered first in production including the fact that
SLOs could not be met

Inadequate change control


Lessons learned in previous deployments were not applied to
new deployment rediscovery of same problems
Some changes implemented in test, but never rolled into
production re-occuring problems (outages) in production
No process for confirming a change actually fixes the problem
prior to implementing in production

Case Study
More Operational Issues

Poor knowledge xfer between internal teams


Configuration recommendations, patches, fixes identified in previous
deployments were not communicated.
Evictions are a symptom, not the problem.

Inadequate system monitoring


OS level statistics (CPU, IO, memory) were not being captured.
Impossible to RCA on many problems without ability to correlate cluster /
database symptoms with system level activity.

Inadequate Support procedures


Inconsistent data capture
No on-site vendor support consistent with criticality of system
No operations manual
Managing and responding to outages
Responding and restoring service after outages

Overview of Operational
Process Requirements
What are ITIL Guidelines?
ITIL (the IT Infrastructure Library) is the most widely accepted
approach to IT service management in the world, ITIL
provides a comprehensive and consistent set of best
practices for IT service management, promoting a quality
approach to achieving business effectiveness and efficiency
in the use of information systems.

IT Service Management
IT Service Management = Service Delivery
+ Service Support
Service Delivery: partially concerned with
setting up agreements and monitoring the
targets within these agreements.
Service Support: processes can be viewed
as delivering services as laid down in
these agreements.

Provisioning of IT Service Mgmt


In all organizations, must be matched to current and rapidly
changing business demands. The objective is to continually
improve the quality of service, aligned to the business
requirements, cost-effectively. To meet this objective, three
areas need to be considered:

People with the right skills, appropriate training and the right
service culture
Effective and efficient Service Management processes
Good IT Infrastructure in terms of tools and technology.

Unless People, Processes and Technology are considered


and implemented appropriately within a steering framework,
the objectives of Service Management will not be realized.

Service Delivery
Financial Management
Service Level Management

Severity/priority definitions
e.g. Sev1, Sev2, Sev3, Sev4

Response time guidelines


SLAs

Capacity Management
IT Service Continuity Management
Availability Management

Service Support
Incident Management

Incident documentation & Reporting, incident handling,


escalation procedures

Problem Management

RCAs, QA & Process improvement

Configuration Management

Standard configs, gold images, CEMLIs

Change Management

Risk assessment, backout, sw maintenance, decommission

Release Management

New deployments, upgrades, Emergency release, component


release

BP: Set & Manage Expectations


Why is this important?

Expectations with RAC are different at the outset


HA is as much (if not moreso) about the processes and
procedures, than it is about the technology
No matter what technology stack you implement, on its own it
is incapable of meeting stringent SLAs

Must communicate what the technology can AND


cant do
Must be clear on what else needs to be in place to
supplement the technology if HA business
requirements are going to be met.

HA isnt cheap!

BP: Clearly define SLOs


Sufficiently granular

Cannot architect, design, OR manage a system without clearly understanding the


SLOs
24x7 is NOT an SLO

Define HA/recovery time objectives, throughput, response time, data


loss, etc

Need to be established with an understanding of the cost of downtime for the system.
RTO and RPO are key availability metrics
Response time and throughput are key performance metrics

Must address different failure conditions

Planned vs unplanned
Localized vs site-wide

Must be linked to the business requirements

Response time and resolution time

Must be realistic

Manage to the SLOs


Definitions of problem severity levels
Documented targets for both incident response time, and resolution
time, based on severity
Classification of applications w.r.t. business criticality
Establish SLA with business

Negotiated response and resolution times


Definition of metrics
E.g. Application Availability shall be measured using the following
formula:
Total Minutes In A Calendar Month minus
Unscheduled Outage Minutes minus Scheduled Outage Minutes
in such month, divided by Total Minutes In A Calendar Month
Negotiated SLOs
Effectively documents expectations between IT and business

Incident log: date, time, description, duration, resolution

Example Resolution Time


Matrix
Severity 1 Priority 1 and 2 SRs

< 1 hour

Severity 1 Priority 3 SRs

< 13 Hours

Severity 2 Priority 1 SRs

< 14 hours

Severity 2 SRs

< 132 hrs

Example Response Time


Matrix

Status

Sev1/P1 Sev1/P2

Sev2/P1 Sev2

Sev3/
Sev4

New,XFR

15

30

15

30

60

ASG

15

60

15

30

60

IRR, 2CB

15

30

15

60

120

RVW,1CB 15

60

15

60

120

PCR,RDV 60

N/A

60

120

3 hrs

WIP

60

60

60

18 hrs

4 days

INT

60

60

120 min 3 hrs

LMS,CUS 4

2 days

4 days

DEV

3 days

10 days

BP: TEST, TEST, TEST


Testing is a shared responsibility

Functional, destructive, and stress testing

Test environments must be representative of production

Both in terms of configuration, and capacity


Separate from Production
Building a test harness to mimic production workload is a necessary, but nontrivial effort

Ideally, problems would never be encountered first in production

If they are, the first question should be: Why didnt we catch the problem in test?
Exceeding some threshold
Unique timing or race condition
What can we do so we catch this type of problem in the future?
Build a test case that can be reused as part of pre-production testing.

BP: Define, document, and


adhere to Change Control
Processes
This amounts to self discipline
Applies to all changes at all levels of the tech stack

Hw changes, configuration changes, patches and patchsets, upgrades,


and even significant changes in workload.
If no changes are introduced, system will reach a steady state, and
function for ever.

A well designed system will be able to tolerate some


fluctuations, and faults.
A well managed system will meet service levels

If a problem (that was fixed) is encountered again elsewhere, it is a


change management process problem, not a technology problem. I.e.
rediscovery should not happen.
Ensure fixes are applied across all nodes in a cluster, and all
environments to which the fix applies.

BP: Plan for, and execute


Knowledge Xfer
New technology has a learning curve.
10g, RAC, and ASM cross traditional job boundaries so
knowledge xfer must be executed across all affected groups

Architecture, development, and operations


Network admin, sysadmin, storage admin, dba

Learn how to identify and diagnose problems

e.g. evictions are not a problem, they are a symptom


Learn how to use the various tools and interpret output

Hanganalyze, system state dumps, truss, etc


Understand behaviour distinction between cause and
symptom

Needs to occur pre-production

Operational Readiness

BP: Monitor your system


Define key metrics and monitor them actively

Establish a (performance) baseline

Learn how to use Oracle-provided tools

RDA (+ RACDDT)
AWR/ADDM
Active Session History
OSWatcher

Coordinate monitoring and collection of OS level stats as


well as db-level stats

Problems observed at one layer are often just symptoms of


problems that exist at a different layer

Dont jump to conclusions

BP: Define, Document, and


communicate Support
procedures

Define corrective procedures for outages

Routinely test corrective procedures

HA process:

Prevent Detect capture resume analyze fix


Classify high priority systems, and the steps that need to
be taken in each phase
Keep an active log of every outage
If we dont provide sufficient tools to get to root cause, then
shame on us.
If you dont implement the diagnositic capabilities that are
provided to help get to root cause, then shame on you
Serious outages should never happen more than once.

Summary
Deficiencies in operational processes and procedures
are the root cause of the vast majority of escalations

Address these, you dramatically increase your chances of


a successful RAC deployment, and will save yourself a lot
of future pain

Additional areas of challenge

Configuration Management Initial Install and config,


standardized gold image deployment
Incident Management - Diagnosing cluster-related
problems

You might also like