You are on page 1of 41

A Closer Look inside

Oracle ASM

UKOUG Conference 2007


Luca Canali, CERN IT

CERN - IT Department
CH-1211
Genve 23
LCG
Switzerland

www.cern.ch/it

Outline
Oracle ASM for DBAs
Introduction and motivations

ASM is not a black box


Investigation of ASM internals
Focus on practical methods and troubleshooting

ASM and VLDB


Metadata, rebalancing and performance

Lessons learned from CERNs production


DB services

CERN - IT Department
CH-1211 Genve 23
Switzerland

www.cern.ch/it

Inside Oracle ASM, UKOUG Dec 2007 - 2

ASM
Oracle Automatic Storage Management
Provides the functionality of a volume manager
and filesystem for Oracle (DB) files
Works with RAC

Oracle 10g feature aimed at simplifying storage


management
Together with Oracle Managed Files and the Flash
Recovery Area

An implementation of S.A.M.E. methodology


Goal of increasing performance and reducing cost

CERN - IT Department
CH-1211 Genve 23
Switzerland

www.cern.ch/it

Inside Oracle ASM, UKOUG Dec 2007 - 3

ASM for a Clustered Architecture


Oracle architecture of redundant low-cost
components
Servers

SAN

Storage

CERN - IT Department
CH-1211 Genve 23
Switzerland

www.cern.ch/it

Inside Oracle ASM, UKOUG Dec 2007 - 4

ASM Disk Groups


Example: HW = 4 disk arrays with 8 disks each
An ASM diskgroup is created using all available disks

The end result is similar to a file system on RAID 1+0


ASM allows to mirror across storage arrays
Oracle RDBMS processes directly access the storage
RAW disk access
ASM Diskgroup

Mirroring

Striping

CERN - IT Department
CH-1211 Genve 23
Switzerland

www.cern.ch/it

Failgroup1

Striping

Failgroup2
Inside Oracle ASM, UKOUG Dec 2007 - 5

Files, Extents, and Failure Groups


Files and
extent
pointers

Failgroups
and ASM
mirroring

CERN - IT Department
CH-1211 Genve 23
Switzerland

www.cern.ch/it

Inside Oracle ASM, UKOUG Dec 2007 - 6

ASM Is not a Black Box


ASM is implemented as an Oracle instance

Familiar operations for the DBA


Configured with SQL commands
Info in V$ views
Logs in udump and bdump
Some secret details hidden in X$TABLES and
underscore parameters

CERN - IT Department
CH-1211 Genve 23
Switzerland

www.cern.ch/it

Inside Oracle ASM, UKOUG Dec 2007 - 7

Selected V$ Views and X$ Tables

CERN - IT Department
CH-1211 Genve 23
Switzerland

www.cern.ch/it

View Name

X$ Table

Description

V$ASM_DISKGROUP

X$KFGRP

performs disk discovery and lists


diskgroups

V$ASM_DISK

X$KFDSK, X$KFKID

performs disk discovery, lists disks


and their usage metrics

V$ASM_FILE

X$KFFIL

lists ASM files, including metadata

V$ASM_ALIAS

X$KFALS

lists ASM aliases, files and


directories

V$ASM_TEMPLATE

X$KFTMTA

ASM templates and their properties

V$ASM_CLIENT

X$KFNCL

lists DB instances connected to


ASM

V$ASM_OPERATION

X$KFGMG

lists current rebalancing operations

N.A.

X$KFKLIB

available libraries, includes asmlib

N.A.

X$KFDPARTNER

lists disk-to-partner relationships

N.A.

X$KFFXP

extent map table for all ASM files

N.A.

X$KFDAT

allocation table for all ASM disks

Inside Oracle ASM, UKOUG Dec 2007 - 8

ASM Parameters
Notable ASM instance parameters:
*.asm_diskgroups='TEST1_DATADG1','TEST1_
RECODG1'
*.asm_diskstring='/dev/mpath/itstor*p*'
*.asm_power_limit=5
*.shared_pool_size=70M
*.db_cache_size=50M
*.large_pool_size=50M
*.processes=100

CERN - IT Department
CH-1211 Genve 23
Switzerland

www.cern.ch/it

Inside Oracle ASM, UKOUG Dec 2007 - 9

More ASM Parameters


Underscore parameters
Several undocumented parameters
Typically dont need tuning
Exception: _asm_ausize and _asm_stripesize
May need tuning for VLDB in 10g

New in 11g, diskgroup attributes


V$ASM_ATTRIBUTE, most notable
disk_repair_time
au_size

X$KFENV shows underscore attributes

CERN - IT Department
CH-1211 Genve 23
Switzerland

www.cern.ch/it

Inside Oracle ASM, UKOUG Dec 2007 - 10

ASM Storage Internals


ASM Disks are divided in Allocation Units (AU)
Default size 1 MB (_asm_ausize)
Tunable diskgroup attribute in 11g

ASM files are built as a series of extents


Extents are mapped to AUs using a file extent map
When using normal redundancy, 2 mirrored extents
are allocated, each on a different failgroup
RDBMS read operations access only the primary
extent of a mirrored couple (unless there is an IO
error)
In 10g the ASM extent size = AU size

CERN - IT Department
CH-1211 Genve 23
Switzerland

www.cern.ch/it

Inside Oracle ASM, UKOUG Dec 2007 - 11

ASM Metadata Walkthrough


Three examples follow of how to read
data directly from ASM.
Motivations:
Build confidence in the technology, i.e.
get a feeling of how ASM works
It may turn out useful one day to
troubleshoot a production issue.

CERN - IT Department
CH-1211 Genve 23
Switzerland

www.cern.ch/it

Inside Oracle ASM, UKOUG Dec 2007 - 12

Example 1: Direct File Access 1/2


Goal: Reading ASM files with OS tools, using
metadata information from X$ tables
1. Example: find the 2 mirrored extents of the
RDBMS spfile
2. sys@+ASM1> select GROUP_KFFXP Group#,
DISK_KFFXP Disk#, AU_KFFXP AU#, XNUM_KFFXP
Extent# from X$KFFXP where
number_kffxp=(select file_number from
v$asm_alias where name='spfiletest1.ora');
GROUP#
DISK#
AU#
EXTENT#
---------- ---------- ---------- ---------1
16
17528
0
1
4
14838
0

CERN - IT Department
CH-1211 Genve 23
Switzerland

www.cern.ch/it

Inside Oracle ASM, UKOUG Dec 2007 - 13

Example 1: Direct File Access 2/2


3. Find the disk path
sys@+ASM1> select disk_number,path from
v$asm_disk where GROUP_NUMBER=1 and disk_number
in (16,4);
DISK_NUMBER PATH
----------- -----------------------------------4 /dev/mpath/itstor417_1p1
16 /dev/mpath/itstor419_6p1

4. Read data from disk using dd


dd if=/dev/mpath/itstor419_6p1 bs=1024k
count=1 skip=17528 |strings

CERN - IT Department
CH-1211 Genve 23
Switzerland

www.cern.ch/it

Inside Oracle ASM, UKOUG Dec 2007 - 14

X$KFFXP
Column Name

Description

NUMBER_KFFXP

ASM file number. Join with v$asm_file and


v$asm_alias

COMPOUND_KFFXP

File identifier. Join with compound_index in


v$asm_file

INCARN_KFFXP

File incarnation id. Join with incarnation in


v$asm_file

XNUM_KFFXP

ASM file extent number (mirrored extent pairs


have the same extent value)

PXN_KFFXP

Progressive file extent number

GROUP_KFFXP

ASM disk group number. Join with v$asm_disk


and v$asm_diskgroup

DISK_KFFXP

ASM disk number. Join with v$asm_disk

AU_KFFXP

Relative position of the allocation unit from the


beginning of the disk.

LXN_KFFXP

0->primary extent,1->mirror extent, 2->2nd mirror


copy (high redundancy and metadata)

CERN - IT Department
CH-1211 Genve 23
Switzerland

www.cern.ch/it

Inside Oracle ASM, UKOUG Dec 2007 - 15

Example 2: A Different Way


A different metadata table to reach the same
goal of reading ASM files directly from OS:
sys@+ASM1> select GROUP_KFDAT Group#
,NUMBER_KFDAT Disk#, AUNUM_KFDAT AU# from

X$KFDAT

where fnum_kfdat=(select
file_number from v$asm_alias where
name='spfiletest1.ora');
GROUP#
DISK#
AU#
---------- ---------- ---------1
4
14838
1
16
17528

CERN - IT Department
CH-1211 Genve 23
Switzerland

www.cern.ch/it

Inside Oracle ASM, UKOUG Dec 2007 - 16

X$KFDAT
Column Name (subset)

Description

GROUP_KFDAT

Diskgroup number, join with


v$asm_diskgroup

NUMBER_KFDAT

Disk number, join with v$asm_disk

COMPOUND_KFDAT

Disk compund_index, join with


v$asm_disk

AUNUM_KFDAT

Disk allocation unit (relative position from


the beginning of the disk), join with
x$kffxp.au_kffxp

V_KFDAT

Flag: V=this Allocation Unit is used;


F=AU is free

FNUM_KFDAT

File number, join with v$asm_file

XNUM_KFDAT

Progressive file extent number join with


x$kffxp.pxn_kffxp

CERN - IT Department
CH-1211 Genve 23
Switzerland

www.cern.ch/it

Inside Oracle ASM, UKOUG Dec 2007 - 17

Example 3: Yet Another Way


Using the internal package dbms_diskgroup
declare
fileType varchar2(50); fileName varchar2(50);
fileSz number; blkSz number; hdl number; plkSz number;
data_buf raw(4096);
begin
fileName := '+TEST1_DATADG1/TEST1/spfiletest1.ora';
dbms_diskgroup.getfileattr(fileName,fileType,fileSz,
blkSz);
dbms_diskgroup.open(fileName,'r',fileType,blkSz,
hdl,plkSz, fileSz);

dbms_diskgroup.read(hdl,1,blkSz,data_buf);
dbms_output.put_line(data_buf);
end;
/

CERN - IT Department
CH-1211 Genve 23
Switzerland

www.cern.ch/it

Inside Oracle ASM, UKOUG Dec 2007 - 18

DBMS_DISKGROUP
Can be used to read/write ASM files directly
Its an Oracle internal package
Does not require a RDBMS instance
11gs asmcmd cp command uses dbms_diskgroup

Procedure Name

Parameters

dbms_diskgroup.open

(:fileName, :openMode, :fileType, :blkSz,


:hdl,:plkSz, :fileSz)

dbms_diskgroup.read

(:hdl, :offset, :blkSz, :data_buf)

dbms_diskgroup.createfile

(:fileName, :fileType, :blkSz, :fileSz, :hdl,


:plkSz, :fileGenName)

dbms_diskgroup.close

(:hdl)

dbms_diskgroup.commitfile

(:handle)

dbms_diskgroup.resizefile

(:handle,:fsz)

dbms_diskgroup.remap

(:gnum, :fnum, :virt_extent_num)

dbms_diskgroup.getfileattr

(:fileName, :fileType, :fileSz, :blkSz)

CERN - IT Department
CH-1211 Genve 23
Switzerland

www.cern.ch/it

Inside Oracle ASM, UKOUG Dec 2007 - 19

File Transfer Between OS and ASM


The supported tools (10g)

RMAN
DBMS_FILE_TRANSFER
FTP (XDB)
WebDAV (XDB)
They all require a RDBMS instance

In 11g, all the above plus asmcmd


cp command
Works directly with the ASM instance

CERN - IT Department
CH-1211 Genve 23
Switzerland

www.cern.ch/it

Inside Oracle ASM, UKOUG Dec 2007 - 20

Strace and ASM 1/3


Goal: understand strace output when
using ASM storage

Example:

read64(15,"#33\0@\"..., 8192, 473128960)=8192

This is a read operation of 8KB from FD 15 at offset


473128960
What is the segment name, type, file# and block# ?

CERN - IT Department
CH-1211 Genve 23
Switzerland

www.cern.ch/it

Inside Oracle ASM, UKOUG Dec 2007 - 21

Strace and ASM 2/3


1. From /proc/<pid>/fd I find that FD=15 is
/dev/mpath/itstor420_1p1

2. This is disk 20 of D.G.=1 (from v$asm_disk)


3. From x$kffxp I find the ASM file# and extent#:

Note: offset 473128960 = 451 MB + 27 *8KB

sys@+ASM1>select number_kffxp,
xnum_kffxp from x$kffxp where
group_kffxp=1 and disk_kffxp=20 and
au_kffxp=451;
NUMBER_KFFXP XNUM_KFFXP
------------ ---------268
17

CERN - IT Department
CH-1211 Genve 23
Switzerland

www.cern.ch/it

Inside Oracle ASM, UKOUG Dec 2007 - 22

Strace and ASM 3/3


4. From v$asm_alias I find the file alias for file 268:
USERS.268.612033477
5. From v$datafile view I find the RDBMS file#: 9
6. From dba extents finally find the owner and
segment name relative to the original IO
operation:
sys@TEST1>select owner,segment_name,segment_type
from dba_extents where FILE_ID=9 and
27+17*1024*1024 between block_id and
block_id+blocks;
OWNER SEGMENT_NAME SEGMENT_TYPE
----- ------------ -----------SCOTT
EMP
TABLE

CERN - IT Department
CH-1211 Genve 23
Switzerland

www.cern.ch/it

Inside Oracle ASM, UKOUG Dec 2007 - 23

Investigation of Fine Striping

B0

B1

B2

B3

B4

B5

B6

B7

AU = 1MB

An application: finding the layout of fine-striped files


Explored using strace of an oracle session
executing alter system dump logfile ..
Result: round robin distribution over 8 x 1MB
extents

A0

A1

A2

A3

A4

A5

A6

A7

Fine striping size = 128KB (1MB/8)

CERN - IT Department
CH-1211 Genve 23
Switzerland

www.cern.ch/it

Inside Oracle ASM, UKOUG Dec 2007 - 24

Metadata Files
ASM diskgroups contain hidden files
Not listed in V$ASM_FILE (file# <256)
Details are available in X$KFFIL
In addition the first 2 AUs of each disk are marked as
file#=0 in X$KFDAT
Example (10g):
GROUP#
FILE#
FILESIZE_AFTER_MIRR RAW_FILE_SIZE
---------- ---------- ------------------- ------------1
1
2097152
6291456
1
2
1048576
3145728
1
3
264241152
795869184
1
4
1392640
6291456
1
5
1048576
3145728
1
6
1048576
3145728

CERN - IT Department
CH-1211 Genve 23
Switzerland

www.cern.ch/it

Inside Oracle ASM, UKOUG Dec 2007 - 25

ASM Metadata 1/2


File#0, AU=0: disk header (disk name, etc),
Allocation Table (AT) and Free Space Table (FST)
File#0, AU=1: Partner Status Table (PST)
File#1: File Directory (files and their extent pointers)
File#2: Disk Directory
File#3: Active Change Directory (ACD)
The ACD is analogous to a redo log, where
changes to the metadata are logged.
Size=42MB * number of instances
Source: Oracle Automatic Storage Management, Oracle Press Nov
2007, N. Vengurlekar, M. Vallath, R.Long

CERN - IT Department
CH-1211 Genve 23
Switzerland

www.cern.ch/it

Inside Oracle ASM, UKOUG Dec 2007 - 26

ASM Metadata 2/2


File#4: Continuing Operation Directory (COD).
The COD is analogous to an undo tablespace. It
maintains the state of active ASM operations
such as disk or datafile drop/add. The COD log
record is either committed or rolled back based
on the success of the operation.
File#5: Template directory
File#6: Alias directory
11g, File#9: Attribute Directory
11g, File#12: Staleness registry, created when
needed to track offline disks

CERN - IT Department
CH-1211 Genve 23
Switzerland

www.cern.ch/it

Inside Oracle ASM, UKOUG Dec 2007 - 27

ASM Rebalancing
Rebalancing is performed (and mandatory)
after space management operations
Goal: balanced space allocation across disks
Not based on performance or utilization
ASM spreads every file across all disks in a diskgroup

ASM instances are in charge of rebalancing


Extent pointers changes are communicated to the RDBMS
RDBMS ASMB process keeps an open connection to ASM
This can be observed by running strace against ASMB

In RAC, extra messages are passed between the cluster


ASM instances
LMD0 of the ASM instances are very active during rebalance

CERN - IT Department
CH-1211 Genve 23
Switzerland

www.cern.ch/it

Inside Oracle ASM, UKOUG Dec 2007 - 28

ASM Rebalancing and VLDB


Performance of Rebalancing is important for
VLDB
An ASM instance can use parallel slaves
RBAL coordinates the rebalancing operations
ARBx processes pick up chunks of work. By
default they log their activity in udump

Does it scale?
In 10g serialization wait events can limit
scalability
Even at maximum speed rebalancing is not
always I/O bound

CERN - IT Department
CH-1211 Genve 23
Switzerland

www.cern.ch/it

Inside Oracle ASM, UKOUG Dec 2007 - 29

ASM Rebalancing Performance


Tracing ASM rebalancing operations
10046 trace of the +arbx processes
Oradebug setospid
oradebug event 10046 trace name context forever, level 12
Process log files (in bdump) with orasrp (tkprof will not work)

Main wait events from my tests with RAC (6 nodes)


DFS lock handle
Waiting for CI level 5 (cross instance lock)

Buffer busy wait


unaccounted for
enq: AD - allocate AU
enq: AD - deallocate AU
log write(even)
log write(odd)

CERN - IT Department
CH-1211 Genve 23
Switzerland

www.cern.ch/it

Inside Oracle ASM, UKOUG Dec 2007 - 30

ASM Single Instance Rebalancing


Single instance rebalance
Faster in RAC if you can rebalance with only 1
node up (I have observed: 20% to 100% speed
improvement)
Buffer busy wait can be the main event
It seems to depend on the number of files in the
diskgroup.
Diskgroups with a small number of (large) files have
more contention (+arbx processes operate concurrently
on the same file)
Only seen in tests with 10g

11g has improvements regarding rebalancing


contention

CERN - IT Department
CH-1211 Genve 23
Switzerland

www.cern.ch/it

Inside Oracle ASM, UKOUG Dec 2007 - 31

Rebalancing, an Example

CERN - IT Department
CH-1211 Genve 23
Switzerland

www.cern.ch/it

Data: D.Wojcik, CERN IT


Inside Oracle ASM, UKOUG Dec 2007 - 32

Rebalancing Workload
When ASM mirroring is used (e.g. with
normal redundancy)
Rebalancing operations can move more data
than expected

Example:
5 TB (allocated): ~100 disks, 200 GB each
A disk is replaced (diskgroup rebalance)
The total IO workload is 1.6 TB (8x the disk size!)
How to see this: query v$asm_operation, the column
EST_WORK keeps growing during rebalance

The issue: excessive repartnering

CERN - IT Department
CH-1211 Genve 23
Switzerland

www.cern.ch/it

Inside Oracle ASM, UKOUG Dec 2007 - 33

ASM Disk Partners


ASM diskgroup with normal redundancy
Two copies of each extents are written to
different failgroups

Two ASM disks are partners:


When they have at least one extent set in
common (they are the 2 sides of a mirror for
some data)

Each ASM disk has a limited number of


partners
Typically 10 disk partners: X$KFDPARTNER
Helps to reduce the risk associated with 2
simultaneous disk failures

CERN - IT Department
CH-1211 Genve 23
Switzerland

www.cern.ch/it

Inside Oracle ASM, UKOUG Dec 2007 - 34

Free and Usable Space


When ASM mirroring is used not all the
free space should be occupied
V$ASM_DISKGROUP.USABLE_FILE_MB:
Amount of free space that can be safely utilized
taking mirroring into account, and yet be able to
restore redundancy after a disk failure
its calculated for the case of the worst scenario,
anyway it is a best practice not to have it go
negative (it can)
This can be a problem when deploying a small
number of large LUNs and/or failgroups

CERN - IT Department
CH-1211 Genve 23
Switzerland

www.cern.ch/it

Inside Oracle ASM, UKOUG Dec 2007 - 35

Fast Mirror Resync


ASM 10g with normal redundancy does not
allow to offline part of the storage
A transient error in a storage array can cause
several hours of rebalancing to drop and add
disks
It is a limiting factor for scheduled maintenances
11g has new feature fast mirror resync
Redundant storage can be put offline for maintenance
Changes are accumulated in the staleness registry
(file#12)
Changes are applied when the storage is back online

CERN - IT Department
CH-1211 Genve 23
Switzerland

www.cern.ch/it

Inside Oracle ASM, UKOUG Dec 2007 - 36

Read Performance, Random I/O


IOPS measured with SQL (synthetic test)

8675 IOPS

~130 IOPS per disk


Destroking, only the external
part of the disks is used

CERN - IT Department
CH-1211 Genve 23
Switzerland

www.cern.ch/it

Inside Oracle ASM, UKOUG Dec 2007 - 37

Read Performance, Sequential I/O

Limited by HBAs -> 4 x 2 Gb


(measured with parallel query)

CERN - IT Department
CH-1211 Genve 23
Switzerland

www.cern.ch/it

Inside Oracle ASM, UKOUG Dec 2007 - 38

Implementation Details
Multipathing
Linux Device Mapper (2.6 kernel)

Block devices
RHEL4 and 10gR2 allow to skip raw devices mapping
External half of the disk for data disk groups

JBOD config
No HW RAID
ASM used to mirror across disk arrays

HW:
Storage arrays (Infortrend): FC controller, SATA disks
FC (Qlogic): 4Gb switch and HBAs (2Gb in older HW)
Servers are 2x CPUs, 4GB RAM, 10.2.0.3 on RHEL4,
RAC of 4 to 8 nodes

CERN - IT Department
CH-1211 Genve 23
Switzerland

www.cern.ch/it

Inside Oracle ASM, UKOUG Dec 2007 - 39

Conclusions
CERN deploys RAC and ASM on Linux on
commodity HW
2.5 years of production, 110 Oracle 10g RAC
nodes and 300TB of raw disk space (Dec 2007)

ASM metadata
Most critical part, especially rebalancing
Knowledge of some ASM internals helps
troubleshooting

ASM on VLDB
Know and work around pitfalls in 10g
11g has important manageability and
performance improvements

CERN - IT Department
CH-1211 Genve 23
Switzerland

www.cern.ch/it

Inside Oracle ASM, UKOUG Dec 2007 - 40

Q&A

Q&A
Links:
http://cern.ch/phydb
http://twiki.cern.ch/twiki/bin/view/PSSGroup/ASM_Internals
http://www.cern.ch/canali

CERN - IT Department
CH-1211 Genve 23
Switzerland

www.cern.ch/it

Inside Oracle ASM, UKOUG Dec 2007 - 41

You might also like