You are on page 1of 29

Analyzing Oracle’s impact using simple userland

tools – Storage subsystem (Specific to


Datawarehousing)
Krishna Manoharan
krishmanoh@gmail.com

1
Introduction
Every application impacts the host Operating
system and connected sub-systems in a unique
way.
In order to profile an application and understand its
impact on the environment, there are a number of
userland tools provided within the OS.
Many of these tools do not require super-user
privileges thus enabling ordinary users such as
dba’s or application developers, the ability to see
and gauge the impact of the application on the
system.

2
Subsystems in an environment

One needs to analyze the impact of


an application on all the major
subsystems in an environment.
 CPU
 Memory
 Storage
 Network

3
Profiling an application
To profile an application, one
needs to know
What to observe (Metrics)
How to observe (Tools to
gather these metrics)
And finally how to interpret the
results (correlate, compare and
draw conclusions)
4
Storage Subsystem
The storage subsystem normally consists of the
following:

Application •Oracle
Host

• Created for applications.


Filesystem • Can have different block sizes.

• Created from luns.


H
Volume • Can be of different layout – Stripe, Concat, mirror etc
B
A
Luns through • Cache is the memory (staging area) on the Array.
• Luns are carved from Raid Groups on Array.
Cache • Luns are also of different layouts – mirror, concat, stripe etc
Array

Array Controller • Array Management

Disks • Actual Disks on the Array

5
Storage Subsystem – Contd.
Disks refer to the actual hard drives which we are all familiar
with. Disks are of different capacities – 72GB, 146GB, 300GB,
different kinds – FC, SATA, SAS and finally different speeds –
7200 RPM, 10K RPM, 15K RPM.
Cache refers to the memory on the array to which all writes are
staged. Cache also contains pre-fetch data. The controller is
the intelligence behind the Array.
Raid groups are created using the disks on the array. Luns are
carved out of the Raid Groups and assigned to the host. Luns
can be of any size, as can volumes and filesystems. Raid
groups can be of different layouts – mirror, stripe, stripe-mirror,
Raid 5 and Luns inherit the same layout. Luns are normally
multipathed (with 2 or more paths).
Volumes are created from luns – Volumes can be created as
concat, mirror, stripe-mirror, mirror-stripe, Raid5 etc.
Filesystems have different block sizes – For vxfs, the block
sizes are 1K to 8K.

6
Storage Subsystem – Metrics.
IO requests starts with the application issuing a IO
system call (read, write).
Based on the current activity of the system, the request
may be processed immediately or routed to a queue of
requests (similar to a run queue on a CPU – wait
column in iostat).
It waits in the queue for a period of time until available
to be dispatched (Wait time – wsvc_t column in iostat)
It then executes on the disk taking time to complete
(response time – asvc_t column in iostat).
Corresponding to the above activities, there is also
the size of the IO operation (math using iostat
counters), bandwidth (kr/s + kw/s columns in iostat)
and number of IO operations (r/s + w/s in iostat).

7
Storage Subsystem – Metrics (Contd.)

The common metrics used to


describe storage performance are
Wait – Average transactions waiting to be serviced (Similar
to run queue on a CPU)
Wait time – Average time spent in the wait queue.
Service Time – The time in milliseconds which a lun spends
on servicing a request.
IOPS – Number of IO Operations/second
I/O sizes – The average size of an IO operation in KB or
MB.
Throughput – The average bandwidth available in MB/sec.

8
Storage Subsystem – Tools
The following tools are used to capture
storage statistics. Both run-time and
historical data is very much essential.
Run Time data
iostat – Gives statistics at a lun level (service time, IOPS, IO
sizes, throughput)
vxstat – Gives statistics at a volume level (assuming we are
using Veritas Volume Manager). Statistics available are again
service time, IOPS and IO sizes.
vxdmpadm – Gives statistics at a lun level (service time, IOPS
and IO sizes)
odmstat – Gives statistics for oracle datafiles (if using Veritas
ODM).
swat – Sun Storedge Workload Analysis Tool – Gives statistics
at a lun level (service time, IOPS, Throughput, IO sizes).
Oracle v$views – Historical and run time (Not at a lun level)

9
Storage Subsystem – Tools (Contd.)

Historical data capture tools


swat – Sun Storedge Workload Analysis Tool – Gives
statistics at a lun level (service time, IOPS, Throughput,
IO sizes).
sar – sar also captures disk stats
Oracle v$views – Historical and run time ((Not at a lun
level)

10
Storage Subsystem – Tools (Contd.)
Of the tools listed, only iostat, odmstat and swat can be
run by a non-privileged user. Oracle v$views can be
viewed by anyone with appropriate oracle privileges.
Normally luns assigned to a host are small sized and
numerous. So using tools such as iostat are very
cumbersome.
The most user friendly tool is swat and is ideal for
collecting and analyzing data over a long term. It collects
and graphs the data for easy analysis.
Using iostat with the extended options (iostat –xnM
<interval>) will give the most useful information. The
columns to look for are wait, asvc_t, r/s, w/s, Mr/s and
Mw/s.

11
Storage Subsystem – Data Collection
mkrishna@viveka:> iostat –xnM 1
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.2 0 0 c1t10d0
0.6 1.6 0.0 0.0 0.0 0.0 0.0 19.2 0 1 c0t8d0
0.0 0.1 0.0 0.0 0.0 0.0 0.0 11.2 0 0 c8t20d53

r/s (number of reads/second), w/s ( number of writes/second),


Mr/s (MB reads/second), Mw/s (MB Writes/second) are
indicators of the work load.
wait – This is the run-queue. This shows the pending
operations waiting to be serviced. Normally, it should be 0.
wsvc_t – The wait time for the above statistic. Should be 0.
avsc_t – The average response time to a lun – It can vary
anywhere from 1ms to10ms. The average response time for a
volume in a datawarehouse system should be 20-40ms. During
heavy loads, volume response times can vary between 20-100
ms.
12
Storage Subsystem – Data Collection (Contd.)
From an oracle perspective, the views which contain IO
data are
v$filestat – Specific to oracle datafiles
v$sysstat – At the instance level
v$segstat – At the segment level
v$tempstat – Temporary Tablespace file stats
dba_hist_sysmetric_summary – Data from the AWR snapshots
It is easy to write a sql which groups by mount point or
filename and reports
Number of IOPS
Response Time
Throughput
However Oracle does not report statistics at a Lun level.
These need to come from the OS.

13
Storage Subsystem – Data Collection (Contd.)
v$filestat – Specific to oracle datafiles only (no redo, temp files)
It appears to be event driven and so should be accurate.
Since it is cumulative, one needs to take a snapshot of v$filestat
before and after a load.
The relevant columns are
PHYRDS + PHYWRTS = IOPS to the file
AVGIOTIM – Average response time of the file in 1/100th of a
second. Divide by 10 to report in ms. These timings can vary
between 1 to 30 ms depending on the size of the file and the
kind of activity. I would assume that 25 ms is about the
maximum you should ever see.
MAXIORTM – Maximum time spent on a single read in 1/100th
of a second. Divide by 10 to report in ms. This shows the
slowest read ever on the file. Anything greater than 30ms
would raise alarms.
MAXIOWTM - Maximum time spent on a single write in 1/100th
of a second. Divide by 10 to report in ms. This shows the
slowest write even on the file. Anything greater than 30ms
would raise alarms.
It is important to look at MAXIORTM and MAXIOWTM as these
show the poorest performance for the datafile.

14
Storage Subsystem – Data Collection (Contd.)
v$sysstat – Reports cumulative statistics at the instance
level. Again, in order to understand the impact of a load, a
snapshot of v$sysstat needs to be taken before and after
the load.
It appears to be event driven and so should be accurate.
physical read total IO requests + physical write total IO
requests = IOPS
physical read total bytes + physical write total bytes =
Throughput

15
Storage Subsystem – Data Collection (Contd.)
v$tempstat – Cumulative Temporary Tablespace file stats. Again,
in order to understand the impact of a load, a snapshot of
v$tempstat needs to be taken before and after the load.
It appears to be event driven and hence must be accurate.
The relevant columns are
PHYRDS + PHYWRTS = IOPS to the file
AVGIOTIM – Average response time of the file in 1/100th of a
second. Divide by 10 to report in ms. These timings can vary
between 1 to 30 ms depending on the size of the file and the
kind of activity. I would assume that 25 ms is about the
maximum you should ever see.
MAXIORTM – Maximum time spent on a single read in 1/100th
of a second. Divide by 10 to report in ms. This shows the
slowest read ever on the file. Anything greater than 30ms
would raise alarms.
MAXIOWTM - Maximum time spent on a single write in 1/100th
of a second. Divide by 10 to report in ms. This shows the
slowest write even on the file. Anything greater than 30ms
would raise alarms.
It is important to look at MAXIORTM and MAXIOWTM as these
show the poorest performance for the datafile.

16
Storage Subsystem – Data Collection (Contd.)
dba_hist_sysmetric_summary – Data from the AWR
snapshots. The accuracy of the data in this table is
debatable. I have noticed discrepancies in the data
reported in this table.
The data is reported by snapshot number.
Assume it is the average during the entire
snapshot interval.
Physical Read Total IO Requests Per Sec +
Physical Write Total IO Requests Per Sec =
IOPS
Physical Read Total Bytes Per Sec + Physical
Write Total Bytes Per Sec = Throughput

17
Storage Subsystem – Data Collection (Contd.)
odmstat gives file level I/O specific details if Veritas ODM is
enabled.
[oracle@viveka] $ /opt/VRTS/bin/odmstat *dbf
OPERATIONS FILE BLOCKS AVG TIME(ms)
FILE NAME READ WRITE READ WRITE READ WRITE
APD_01.dbf 36 3 1152 96 2.2 0.0
ARD_10.dbf 31 9 1056 320 5.8 1.1

Operations refer to number of IOPS


File Blocks refer to size of IOPS. It is reported in sectors
(1 sector = 512Bytes)
Avg Time refer to the service time.

18
Storage Subsystem – Data Correlation
Data (IOPS, Service Time etc) needs to be collected from
the OS and Database perspective and correlated.
Data needs to be analyzed over a period of time to profile
the nature of the workload.

19
Profile of a typical Datawarehouse system - Storage
It is very difficult to generalize access patterns and storage profile for a
Datawarehouse system. However the below probably is a good guideline.
For a averagely used Enterprise Datawarehouse for a large company not in
the retail industry and configured properly
IO Profile would show smaller number of IOPS, but large sized IOPS.
Typical IOPS would be in the range of 5-10K IOPS/s during heavy usage.
During normal hours, one can see around 2K-3K IOPS.
For a block size of 16K and db_multiblock_read_count set to 64, you can
expect to see IOP sizes from 16K to +1MB. Luns used for Redo logs will
show significantly large sized IOPS during peak DML activity.
Large number of direct reads/writes and multiblock reads/writes.
High number of parallel operations and heavy PGA activity.
Average throughput would be in the range of 450-600MB/sec.
Heavy temporary tablespace usage.
Heavy redo logs activity during periods of DML. It is important to note that
redo log group members are generally small – 512M to 1.5GB and so
typically the Storage/Unix Administrator will re-use luns when creating
redo log filesystems. This is not good practice. You probably will notice
that the redo log luns will see the biggest IOP sizes for write operations.

20
Storage from an oracle perspective (Datawarehouse)

For datawarehousing, the critical components are


Storage
Memory
CPU
Storage plays a critical role in performance.
Do not skimp on storage. Plan ahead for 2 years and layout the
file systems appropriately. It is normal to have 3-4x overhead
for initial sizing.
Appropriate sizing and configuration is very important.
Most operations process huge amounts of data resulting in
considerable IO.
Datawarehousing is more throughput dependant and less on
response time.

21
Datawarehousing - Array
Cheaper modular arrays (such as HDS AMS1000) work better for
datawarehousing kind of loads rather than the high end arrays.
Go for the fastest drives (15K RPM) . Use 72GB drives instead of
146GB Drives.
Since Oracle does read-ahead into the buffer cache, disable read
cache or minimize read cache on the array. Try and assign
maximum possible cache for staging writes.
Avoid striping on the array (Raid 5, Raid 10). Array based striping
does not offer big stripe widths (1M or greater). Most are limited to
384K (HDS). Stripe width refers to the width of the stripe on a
single disk.
Go for Disk mirroring (Raid 1), preferable 1D +1P as the Raid
Group configuration.
Use the entire Raid Group as a single lun.
Share the luns across as many controllers as possible.

22
Datawarehousing - System
There is a lot of sort/merge of IO requests happening at
every layer (Volume Manager, HBA, Array Controller) to
minimize head movement. To make best use of it, ensure
that 32M is set as the maximum size of a single IO request
that can be passed down from the driver to the HBA.
Configure volumes as stripes with big stripe widths (1M or
greater). Stripe can be used for all file systems (redo,
archive, data, temporary).
Use even number of luns for creating volumes (2, 4 or 6 or
8).
If using 146GB drives, then usable space is only about 90-
100GB. Do not exceed this.
Configure multipathing such that the all the active paths to a
lun are written to at the same time.
If using vxfs, then set the block size to 8K (maximum).

23
Datawarehousing - Oracle
Basics
Veritas ODM is a must (async + direct i/o).
Make sure that all async i/o patches specific to platform are applied.
Do not re-use luns. That is, once a lun is used for a filesystem, it is not to
be used for any other requirement or other filesystems. There will be
considerable wastage, however well worth it.
Do not intermix data, redo, archive and index files. Keep them on separate
filesystems. It is easy to maintain and also troubleshoot performance.
Use appropriate block size as required.
Redo configuration
Use 4 redo log groups with 2 members each. Place the redo logs on
dedicated filesystems. Make sure the size of the members are big ( >
1500MB).
Temporary Tablespaces
You can use either raw volumes or ODM enabled datafiles.
Solid State Memory is best suited for Temporary tablespaces.
However if it is not available, 72GB, 15K RPM drives can be used.
Create Temporary Tablespace groups and assign the Temporary
Tablespaces to groups accordingly. Hopefully oracle will use the
temporary tablespaces without conflict.

24
Datawarehousing – Oracle (Contd.)
Tablespaces and datafiles
Look ahead for 2 years and plan as below. Beyond 2 years,
offload old data to static instances.
Identify number of tablespaces for a schema.
Number of datafiles for a tablespace (depending on size of the
objects and projected growth). If you know the growth, then
pre-create appropriate datafiles are required.
Oracle round-robins when creating extents for objects. So the
more datafiles available in a tablespace when creating an
object, the better the striping of the extents will be.
Use fixed size datafiles (10G or 20G). Do not enable auto-
extend. Use uniform extent sizing. Disabling auto-extend and
uniform extent sizing reduces fragmentation and is a lot more
efficient for the database – especially when doing
updates/deletes.
Use multiples of the extent size to match with the stripe width
on the volume. Especially for big tables (> 15-20GB), use big
extents such 200M or higher as required.
Split the datafiles across all available filesystems.

25
Datawarehousing – Oracle (Contd.)
Multiple Block sizes
Multiple block sizes are a mixed bag. Useful only when you know
your data very well. On Solaris, the maximum block size is 32K.
Enabling multiple block sizes require you to set aside a portion of
the SGA for the specific block size.
Writes -
Updates are a very costly operation when having bigger block
sizes.
Inserts using bulk loads will be very fast. Conventional loading
– depends on the memory set aside.
Reads –
Index scans - Would probably give good performance. But is
retrieving 32K data really necessary when you need only 8K.
Direct reads – I guess it would be fast as it bypasses the buffer
cache.
Do not know the impact of multiple block sizes on undo
tablespace?

26
Oracle and Storage statistics
Help! I have setup all as discussed. How do I know if Oracle is
performing adequately? I do not have access to run privileged
commands (vxstat etc) and I want to see statistics from oracle’s
perspective. I am not happy with odmstat. I want more data.
Oracle does collect IO statistics at all levels (object, datafile and
instance) – I assume most, if not all are event driven. If they are
event driven, then they are extremely accurate. If time sampled,
then these are only indicators.
Event driven statistics are the wait events – sequential reads,
scattered reads, log archive i/o etc.
Oracle also captures number of IOPS, throughput, physical reads,
writes, response times etc.
All this data is stored in the v$ views and the dba_hist tables.
As to how accurate the numbers are, we can only guess.
I have personally seen discrepancies in Oracle’s reporting, so it is
best to correlate with OS statistics.

27
Conclusion

Get storage right the first time or the datawarehouse


solution will fail.
Take the time and do proper assessment along with testing
before deploying the instance.
Do not skimp on storage.
Storage is the most important component and most easily
forgotten too.
Always correlate oracle statistics with OS statistics.

28
Questions?

29

You might also like