Professional Documents
Culture Documents
Kirk McGowan
Technical Director RAC Pack
Oracle Server Technologies
Cluster and Parallel Storage Development
Agenda
Requirements
Why RAC Implementations Fail
Case Study
Criticality of IT Service Management (ITIL)
process
Best Practices
High Availability
Case Study
Based on true stories. Any resemblance, in
full or in part, to your own experiences is
intentional and expected.
Names have been changed to protect the
innocent
Case Study
Background
Case Study
Situation
Case Study
Operational Issues
Case Study
More Operational Issues
Overview of Operational
Process Requirements
What are ITIL Guidelines?
ITIL (the IT Infrastructure Library) is the most widely accepted
approach to IT service management in the world, ITIL
provides a comprehensive and consistent set of best
practices for IT service management, promoting a quality
approach to achieving business effectiveness and efficiency
in the use of information systems.
IT Service Management
IT Service Management = Service Delivery
+ Service Support
Service Delivery: partially concerned with
setting up agreements and monitoring the
targets within these agreements.
Service Support: processes can be viewed
as delivering services as laid down in
these agreements.
People with the right skills, appropriate training and the right
service culture
Effective and efficient Service Management processes
Good IT Infrastructure in terms of tools and technology.
Service Delivery
Financial Management
Service Level Management
Severity/priority definitions
e.g. Sev1, Sev2, Sev3, Sev4
Capacity Management
IT Service Continuity Management
Availability Management
Service Support
Incident Management
Problem Management
Configuration Management
Change Management
Release Management
HA isnt cheap!
Need to be established with an understanding of the cost of downtime for the system.
RTO and RPO are key availability metrics
Response time and throughput are key performance metrics
Planned vs unplanned
Localized vs site-wide
Must be realistic
< 1 hour
< 13 Hours
< 14 hours
Severity 2 SRs
Status
Sev1/P1 Sev1/P2
Sev2/P1 Sev2
Sev3/
Sev4
New,XFR
15
30
15
30
60
ASG
15
60
15
30
60
IRR, 2CB
15
30
15
60
120
RVW,1CB 15
60
15
60
120
PCR,RDV 60
N/A
60
120
3 hrs
WIP
60
60
60
18 hrs
4 days
INT
60
60
LMS,CUS 4
2 days
4 days
DEV
3 days
10 days
If they are, the first question should be: Why didnt we catch the problem in test?
Exceeding some threshold
Unique timing or race condition
What can we do so we catch this type of problem in the future?
Build a test case that can be reused as part of pre-production testing.
Operational Readiness
RDA (+ RACDDT)
AWR/ADDM
Active Session History
OSWatcher
HA process:
Summary
Deficiencies in operational processes and procedures
are the root cause of the vast majority of escalations