Measurement and (alibration oí

Svstem I´O

.v Oracte !bite Paper
May 200¨


Measurement and CaIibration of System I/O Page 2
Measurement and (alibration oí Svstem I´O
EXECUTIVE OVERVIEW
1his \hite Paper describes the use oí Oracle`s I´O load generation and
measurement tool. ORION. ORION pro·ides a reasonable estimation oí the I´O
capabilities oí a particular hardware coníiguration. independent oí anv database
load. 1he íocus oí this work is to pro·ide ·eriíication oí I´O capabilitv and to
assist in íuture capacitv planning exercises.
INTRODUCTION
O·er the last decade. relational databases ha·e increased in size bv manv orders oí
magnitude. 1his has meant that a smaller percentage oí the database resides within
the database buííer cache at runtime. 1he impact oí this combined with e·er
increasing workloads and queries that require access bv table scans means an
increased I´O workload. 1his increase in workload means that capacitv planning
íor storage should no longer be períormed bv disk space requirements. but is a
íunction oí the required I´O operations IOPS, and the bandwidth oí the entire
I´O subsvstem.
As a general rule. OL1P svstems tend to be IOPS constrained and DSS svstems
tend to be bandwidth constrained. 1he purpose oí the tests executed: as part oí
the I´O calibration process. is to determine eííecti·e real world I´O metrics íor a
proposed hardware coníiguration. Please note. Real \orld metrics íor I´O oíten
results in I´O components working at less than their theoretical maximums. In
summarv. this is part oí the calibration exercise.
1o achie·e eííecti·e metrics. most oí the testing will be done on the read
component oí the tests. because reads íorm the majoritv oí database I´O and ha·e
the most impact on the user experience e.g. querv elapsed times.
lor OL1P it is common to see IOPS running into the 10.000s per second howe·er
100.000s per second is rare. In these tvpes oí svstems the usual determining íactor
to achie·e acceptable IOPS. is the number oí disks. lor these workloads. the seek
time oí the disk plavs a ·erv important part oí the response time as well o·erall
throughput. 1he I´O sizes in OL1P workloads are usuallv small and represent a
single database block e.g. 8k.
On DSS svstems the I´Os are random in nature but are large. oíten about 1MB in
size. lor this reason. the bandwidth oí the entire I´O stack lBAs. Switches. Disk
Measurement and CaIibration of System I/O Page 3
Arravs and Disks, becomes the gating íactor. In manv cases. DSS workloads are
thought oí as sequential workloads. 1his is a true misunderstanding because
concurrent users and parallel queries result in large but random I´Os. In extreme
cases single disks mav onlv produce 50° oí their theoretical read rates quoted íor
large sequential reads,.
DSS svstems I´O períormance is usuallv measured in Gigabvtes´sec. so please be
aware that manv I´O components are in íact network components and are oíten
speciíied in Gigabits. (oníusing Gigabvtes and Gigabits is oíten a common mistake
made in the capacitv planning process and reiníorces the need íor good design
calculations when engineering a svstem. It cannot be told enough the horror stories
oí I´O sizing being out bv a íactor oí 8!
Because high bandwidth I´O subsvstems are expensi·e and will dominate the
budget oí a DSS svstem. we see two approaches to sizing.
1he íirst approach would be a pragmatic engineering approach. which would be to
deíine the quantitv oí I´O required to keep a (PU or core busv. An estimate íor
this ·alue would be about 100 Megabvtes´second´core. Please be aware this
number is hugelv workload speciíic and is intended as rule oí thumb. also as (PUs
get íaster this number will increase proportionatelv.
A second and equallv ·alid approach. would be to coníigure a target amount oí
I´O based upon industrv trends. 1he (urrent industrv trend íor the I´O required
íor an Lntrv Le·el Lnterprise Data \arehouse is 2 Gigabvtes´second.
(oníigurations supporting 4 to 8 Gigabvtes´second are not uncommon íor more
demanding requirements.
Ií vour goal is to build a trulv scalable I´O subsvstem íor DSS workloads vou
should ha·e the abilitv to coníigure bv (PU´core or bv more pragmatic methods.
MEASUREMENT AND CALIBRATION METHOD
In order to calibrate vour svstem. Oracle pro·ides a series oí scripts built upon
Oracle`s existing general-purpose utilitv that allows I´O testing without períorming
a íull database install. 1he goal oí this testing is to pro·ide quick and ·alid data
points íor the hardware sizing and coníiguration process.
1he process can be deíined in the íollowing steps:
1. Design and assemble hardware íor I´O calibration. All hardware and
soítware components non-Oracle, needed should be listed and
in·entoried íor the testing. In tests where scale up´out is being
demonstrated. all ratios oí the components should be documented. e.g.
lost:4(PU.4(or.e4lBA:4Switch:4Arravs:4Disks
2. Repeatedlv run the I´O scripts to get I´O bandwidth and IOPS íigures íor
·arious size coníigurations ·arious coníigurations mav be deíined as
additional host machines or additional hardware within a single host,.
Measurement and CaIibration of System I/O Page 4
3. (ollect results and return to Oracle íor ·alidation and interpretation.
4. Ií all results look good íor sample. see Interpreting the results section, the
coníiguration will be certiíied to be oí a known I´O capabilitv. Ií the
results show inconsistencv or bottlenecks. it mav require íurther work to
coníigure the I´O subsvstem correctlv. 1his mav require debugging or the
addition oí more hardware. Please note. to make the test eííecti·e. use all
oí the a·ailable resources and trv to achie·e 8 Gigabvtes´second íor the
largest coníiguration. 1his will allow íor a high le·el oí coníidence when
planning large Data \arehouses.
5. Oracle will produce a certiíication report íor sample report. please reíer
Appendix (, that can be used íor capacitv planning and sales purposes.
HARDWARE INSTRUCTIONS
(ollect the íollowing iníormation and pro·ide a pictorial ·iew oí the coníigured
svstem írom end-to-end
1. Svstem ·endor name and host model
2. (PU - 1vpe´clock speed´4 oí cores´L1 and L2 cache sizes
3. RAM - total memorv
4. lBA - ·endor e.g. Lmulex. Brocade,´4 oí lBAs´ports per
lBA´Bandwidth oí lBAs´1vpe e.g. iS(SI. libre (hannel,
5. Switches - ·endor e.g. Brocade. Lmulex,´4 oí switches´ports per
switch´Bandwidth
6. Storage Arrav - Vendor Model number´4 ports´amount oí cache´write
or read cache
¯. Disks - Vendor e.g. Seagate. litachi,. tvpe SA1A. SAS. S(SI,. capacitv.
RPM. total number oí disks used
8. lardware le·el RAID and coníiguration - ií used
9. Other important details like - lBA dri·er ·ersion. Storage arrav dri·er
·ersion etc
Measurement and CaIibration of System I/O Page 5
SOFTWARE INSTRUCTIONS

Orion 1ool
Orion - Oracle I´O Number is a tool. which ·erv closelv simulates Oracle I´O
pattern íor OL1P. DSS or mixed workload without ha·ing to installing the Oracle
database soítware. Orion is used to measure achie·able IOPS or MBPS oí a
coníigured svstem - it can be a large SMP or small multi-node RA( svstem. 1his
tool is a·ailable íor most platíorms and can be downloaded írom Oracle
1echnologv Network O1N, using the íollowing link
http:´´www.oracle.com´technologv´soítware´tech´orion´index.html
Orion Wrapper
Orion takes manv command line arguments to co·er an extensi·e range oí
engineering data points at times. this can become complex and long running, and
is not RA( aware meaning it doesn`t run automaticallv across RA( nodes,.
1o make Orion more end-user íriendlv. the Orion \rapper orion.pl, was written
and it can be downloaded írom
http:´´realworld.us.oracle.com´twiki´bin´·iew´Perí\eb´Measurement(alibration
SvstemIO

1he Orion \rapper uses:
• A simple coníiguration íile to deri·e all the Orion command line options
• ssh and scp to run Orion across manv nodes
E.g. orion.pl -t dss -f dss_params.txt -d 600 -n dss_2node_test
Simulates large random IO reads across 2 nodes
Where
-t dss (simulate DSS type IO pattern)
-f dss_params.txt (parameters like node name, disk device
path etc is defined here)
-d 600 (test duration is 600 seconds)
-n dss_2node_test (name of the run and also the
subdirectory where results are stored)

Parameter íiles dss_params.txt or oltp_params.txt uses kevwords commonlv used in
the Oracle user communitv.
A sample dss_params.txt and oltp_params.txt is pro·ided in the íollowing section
whereas a sample mixed_params.txt is pro·ided in Appendix A,.
Measurement and CaIibration of System I/O Page 6


INPUT FILE FORMAT

dss_params.txt
# Begin of dss_params.txt
# DSS workload parameter file
# Keywords are case in-sensitive, values are case sensitive

# Disk device or LUN path=number of spindles (one line per device)
/dev/raw/raw1=5
/dev/raw/raw2=5
/dev/raw/raw10=10
/dev/raw/raw11=12

# Default large random IO size, should be specified in bytes
dss_io_size=1048576

num_nodes=1
node_names=rac_n1
# If more than one node, then use comma separated node names
# node_names=rac_n1, rac_n2

# Degree Of Parallelism (dop) = #of cores * 2 * #of concurrent queries
dop_per_node=32
# If more than one node, then use comma separated dop_per_node
# In the following example, node 1 dop=32, node 2 dop=32
# dop_per_node=32,32

# Location of orion executable (`which orion' output should go here)
orion_location=/home/oracle/orion/orion

# End of dss_params.txt

Measurement and CaIibration of System I/O Page 7

oltp_params.txt

# Begin of oltp_params.txt
# OLTP workload parameter file
# Keywords are case in-sensitive, values are case sensitive

# Disk device or LUN path=number of spindles (one line per device)
/dev/raw/raw1=5
/dev/raw/raw2=5
/dev/raw/raw10=10
/dev/raw/raw11=12

# Default small random IO size, should be specified in bytes
oltp_io_size=8192

num_nodes=1
node_names=rac_n1
# If more than one node, then use comma separated node names
# node_names=rac_n1, rac_n2

# users_per_node = number of cores or 2 x number of cores
# And aim for IOPS with latency less than 6 or 7 ms
users_per_node=8
# If more than one node, then use comma separated users_per_node
# In the following example, node 1 users=8, node 2 users=8
# users_per_node=8,8

# Location of orion executable (`which orion' output should go here)
orion_location=/home/oracle/orion/orion

# End of oltp_params.txt

Measurement and CaIibration of System I/O Page 8
HOW TO RUN THE TESTS

Setup
Login as oracle on anv node or anv other user. but not as root. Assuming oracle
home directorv is ´home´oracle,
mkdir -p orion; cd orion
Download and extract both the Orion executable and the Orion wrapper to
´home´oracle´orion directorv. Subdirectories created under ´home´oracle´orion
will be results. params. doc. liles in the orion directorv will be orion executable
rename it as orion, and orion.pl edit and replace perl path. the ·erv íirst line as per
vour svstem perl bin location,.

NO NLLD to download or install the Orion executable and Orion \rapper script
on other nodes. But vou need to coníigure passwordless ssh please reíer Appendix
B íor details,.

Orion \rapper in·okes Orion to get onlv one data point and it is dependent on
dop_per_node íor DSS I´O simulation or users_per_node íor OL1P I´O
simulation. Good starting ·alue íor these parameters are
dop_per_node~ number oí cores 2 5 \hv 5 · because we mostlv seen 5
concurrent DSS queries,
users_per_node~4 oí cores or 2 number oí cores,

Running DSS test
Ldit params´dss_params.txt and modiív all ·alues as appropriate to vour svstem.
Replace de·ice path to reílect vour svstem de·ice or LUN path. In·oke Orion
\rapper
./orion.pl -t dss -f params/dss_params.txt -d 600 -n dss_test_run1
All temporarv scripts and results are stored in node_name:´tmp directorv while the
test is running. Aíter the test completes. scripts and results írom all nodes are sa·ed
in ´home´oracle´orion´results´dss_test _run1 directorv oí the node that in·oked
orion.pl.

Measurement and CaIibration of System I/O Page 9
Running OL1P test
Ldit params´oltp_params.txt and modiív all ·alues as appropriate to vour svstem.
Replace de·ice path to reílect vour svstem de·ice or LUN path. In·oke Orion
\rapper
./orion.pl -t oltp -f params/oltp_params.txt -d 600 -n oltp_test_run1
Results írom all nodes are sa·ed in ´home´oracle´orion´results´oltp_test_run1
directorv oí the node that in·oked orion.pl.

OUTPUT RESULTS FILE FORMAT

1est on each node produces the íollowing íiles
orion_[test_type]_[node_name]_iops
orion_[test_type]_[node_name]_lat.csv
orion_[test_type]_[node_name]_summary.txt
orion_ [test_type]_[node_name]_iostat.txt
orion_[test_type]_[node_name]_mbps.csv
orion_[test_type]_[node_name]_trace.txt
orion_[test_type]_[node_name].lun
orion_[test_type]_[node_name].sh
orion_[test_type]_[node_name] _iostat.sh

1he most interesting iníormation is in the
orion_|test_tvpe|_|node_name|_summarv.txt íile. 1his íile contains
• Input parameters used
• 1hroughput MBPS, obser·ed íor the large random I´O
• I´O rate IOPS, obser·ed íor the small random I´O
• Latencv obser·ed íor the small random I´O
Measurement and CaIibration of System I/O Page 10
INTERPRETTING THE RESULTS
Aíter the test completes on all nodes. Orion \rapper displavs MBPS or IOPS and
latencv on the screen íor each node. Results are also logged in oltp_summarv.txt or
dss_summarv.txt íile.
Sample results for DSS
Results from node rac_n1
Maximum Large MBPS=600.89 @ Small=0 and Large=48

Results from node rac_n2
Maximum Large MBPS=590.12 @ Small=0 and Large=48

Sample results for OL1P
Results from node rac_n1
Maximum Small IOPS=1156 @ Small=8 and Large=0
Minimum Small Latency=6.59 @ Small=8 and Large=0

Results from node rac_n2
Maximum Small IOPS=1245 @ Small=8 and Large=0
Minimum Small Latency=7.01 @ Small=8 and Large=0

Good Case results tvo probtev at a voae teret ava .cate. rett rbev vore voae. are aaaea)
lor example. consider a 4-node svstem. with each node coníigured to pro·ide 500
MBPS
As a íirst step. run Orion on each node separatelv or stand-alone low to do this - edit
dss_params.txt and speciív num_nodes~1 and node_names~rac_n1. run test. then íor 2
nd
node. replace
node_names~rac_n2. run test. repeat this íor all nodes one at a time,
Node 1 Orion MBPS = 490
Node 2 Orion MBPS = 485
Node 3 Orion MBPS = 495
Node 4 Orion MBPS = 480
Good result
Now run Orion on all 4 nodes concurrentlv low to do this - edit dss_params.txt and speciív
num_nodes~4 and node_names~rac_n1. rac_n2. rac_n3. rac_n4. then run test. test is run concurrentlv on all
4 nodes,
Node 1 Orion MBPS = 480
Node 2 Orion MBPS = 470
Node 3 Orion MBPS = 485
Node 4 Orion MBPS = 475
Good result a drop oí ·5 ° is acceptable,
Measurement and CaIibration of System I/O Page 11
Problem case results
Case J taoe.v`t .cate rett rbev vore voae. are aaaea)
As a íirst step. run Orion on each node separatelv or stand-alone.
Node 1 Orion MBPS = 490
Node 2 Orion MBPS = 485
Node 3 Orion MBPS = 495
Node 4 Orion MBPS = 480
Good result

Now run Orion íor all 4 nodes concurrentlv
Node 1 Orion MBPS = 390
Node 2 Orion MBPS = 340
Node 3 Orion MBPS = 380
Node 4 Orion MBPS = 410
Not good (drop oí ~100 MBPS), we mav be limited at Switch or Disk Arrav

Case 2 tprobtev at a girev voae)
As a íirst step. run Orion on each node separatelv or stand-alone
Node 1 Orion MBPS = 490
Node 2 Orion MBPS = 485
Node 3 Orion MBPS =
Node 4 Orion MBPS = 480
Not good drop oí ~100 MBPS on node 3,. we mav be limited at Node 3
lBA´Switch´Disk Arrav´Number oí Disks
Measurement and CaIibration of System I/O Page 12
APPENDIX A

mixed_params.txt
# Begin of mixed_params.txt
# MIXED workload parameter file
# Keywords are case in-sensitive, values are case sensitive

# Disk device or LUN path = number of spindles (one line per device)
/dev/raw/raw1=5
/dev/raw/raw2=5
/dev/raw/raw10=10
/dev/raw/raw11=12

# Specify both dss_io_size and oltp_io_size
dss_io_size=1048576
oltp_io_size=8192

num_nodes=1
node_names=rac_n1

# Specify both dop_per_node and users_per_node
dop_per_node=32
users_per_node=8

orion_location=/home/oracle/orion/orion

# End of mixed_params.txt

Measurement and CaIibration of System I/O Page 13
APPENDIX B

Generating Authorized keys on each node
# Login as oracle or any other user (not as root)
# On node1 (e.g. node name: rac_n1)
1. mkdir ~/.ssh
2. chmod 700 ~/.ssh
3. touch ~/.ssh/authorized_keys
4. chmod 600 ~/.ssh/authorized_keys
5. cd ~/.ssh
6. /usr/bin/ssh-keygen -t rsa
# Accept default location and enter password (e.g. welcome1)
7. /usr/bin/ssh-keygen -t dsa
# Accept default location and enter password (e.g. welcome1)
8. ssh rac_n1 cat ~/.ssh/id_rsa.pub >>authorized_keys
9. ssh rac_n1 cat ~/.ssh/id_dsa.pub >>authorized_keys

# On node2 (e.g. node name rac_n2)
1. Repeat steps 1 to 7 from above
2. ssh rac_n2 cat ~/.ssh/id_rsa.pub >>authorized_keys
3. ssh rac_n2 cat ~/.ssh/id_dsa.pub >>authorized_keys

# Similarly generate the authorized_keys on each node

Measurement and CaIibration of System I/O Page 14
Pulling authorized_keys file from each node
# On node1 (rac_n1)
1. cd ~/.ssh
2. scp rac_n2:~/.ssh/authorized_keys authorized_keys.node2
# Repeat step 2 for other nodes as well
3. cat authorized_keys.node2 >>authorized_keys
# Repeat step 4 for other nodes as well

Pushing back the concatenated authorized_keys file to all nodes
# On node1 (rac_n1)
1. cd ~/.ssh
2. scp authorized_keys rac_n2:~/.ssh/.
# Repeat step 2 for other nodes as well

Running SSH Agent to avoid password prompt
# On node1
1. exec /usr/bin/ssh-agent $SHELL
2. /usr/bin/ssh-add
# Now verify things work without password prompt
3. ssh rac_n2 touch /tmp/xx
4. ssh rac_n2 ls -l /tmp/xx


Measurement and CaIibration of System I/O Page 15
APPENDIX C

System Configuration Details
Vendor Name/Host ModeI
System Components System
TotaI
Per Node
(or bIock)
Nodes 4 n.a
Processors (e.g. Dual core Ìntel Xeon CPU - 2.66 GHz etc) 8 2
Cores 16 4
Memory 32 GB 8 GB
Host Bus Adapters (e.g. Brocade, Emulex GBits/sec 1 port FC) 8 2
Storage Array Switch (e.g. Brocade DS4100 ÷ 4 GBits/sec FC
switch with 32 ports)
2 n.a
Storage Subsystem (e.g. vendor name and other details) 4 1
Storage Subsystem Disk Drives (e.g. Seagate, Hitachi,15k rpm
FC drives. Size 146GB total raw capacity, 133GB usable)
120 30
Hardware level RAÌD-5 (4+1), total LUNs presented to server 24 6
Total storage space 9.2 TB 2.3TB
Operation System e.g. RHEL 4.0 U4, HP-UX 11.31 etc) 4 1
Others (e.g. HBA driver version, load balancing software
details, Orion tool version)
4 1
Date completed:

ResuIts 1 Node TotaI 2 Node TotaI 4 Node TotaI
Throughput (MB/sec) 550 1000
(500+500)
2000
(500+500+500+500)
ÌOPS/Avg. Latency (msec) 1200/6.00 1400/6.00
(1200+1200)
(6.00+6.00)
4800/6.00
(1200+1200+1200+1200)
(6.00+6.00+6.00+6.00)

Pictorial representation of the System Configuration

Measurement and CaIibration of System I/O
[May] 2007
Author: [ReaI WorId Performance Group]
Contributing Authors: [Andrew HoIdsworth, Vinayagam Djegaradjane]

OracIe Corporation
WorId Headquarters
500 OracIe Parkway
Redwood Shores, CA 94065
U.S.A.

WorIdwide Inquiries:
Phone: +1.650.506.7000
Fax: +1.650.506.7200
oracIe.com

Copyright © 2007, OracIe. AII rights reserved.
This document is provided for information purposes onIy and the
contents hereof are subject to change without notice.
This document is not warranted to be error-free, nor subject to any
other warranties or conditions, whether expressed oraIIy or impIied
in Iaw, incIuding impIied warranties and conditions of merchantabiIity
or fitness for a particuIar purpose. We specificaIIy discIaim any
IiabiIity with respect to this document and no contractuaI obIigations
are formed either directIy or indirectIy by this document. This document
may not be reproduced or transmitted in any form or by any means,
eIectronic or mechanicaI, for any purpose, without our prior written permission.
OracIe is a registered trademark of OracIe Corporation and/or its affiIiates.
Other names may be trademarks of their respective owners.