You are on page 1of 66

AIX Performance Tuning for Databases

December 2, 2005
Mathew Accapadi

12/08/21 © 2005 IBM Corporation


AIX Performance Tools
 Monitoring/Analysis tools for CPU
- Profiling tools: gprof, xprofiler, prof, tprof, time, timex
- Monitoring tools: vmstat, iostat, sar, emstat, alstat, mpstat, lparst
at, topas, nmon, PTX, jtopas, ps, wlmmon, pprof, procmon
- Trace tools: trace, trcrpt, curt, splat, truss
- Hardware counter tools: PMAPI, tcount

 Monitoring/Analysis tools for memory


– Profiling tools: svmon, MALLOCDEBUG, ps
– Monitoring tools: svmon, vmstat, topas, jtopas, nmon, PTX, lsps,
ps
- Trace tools: trace, trcrpt
- Hardware counter tools: PMAPI, tcount

2 12/08/21 © 2005 IBM Corporation


AIX Performance Tools cont.
 Monitoring/Analysis tools for Network
– netstat, nfsstat, netpmon
– iptrace, ipreport, ipfilter, tcpdump
– topas, jtopas, nmon, PTX
– trace, trcrpt, curt

 Monitoring/Analysis tools for I/O


– iostat, vmstat
– filemon, fileplace, lvmstat
– topas, jtopas, nmon, PTX
– trace, trcrpt, curt

3 12/08/21 © 2005 IBM Corporation


AIX Performance Tools cont.

 /proc tools
– Proccred, procfiles, procflags, procldd, procmap,
procrun, procsig, procstack, procstop, proctree, procwait,
procwdx
 SPMI
– provides access to all PTX metrics
– allows application metrics to be export to PTX tools
 Hardware counter tools
– PMAPI
– hpmcount, hpmstat, libhpm

4 12/08/21 © 2005 IBM Corporation


AIX Tuning Tools
 CPU
– schedo (scheduler options)
– priority tools: nice/renice
– affinity tools: bindprocessor, bindintcpu, rset tools
– ulimit (cpu limit)

 Memory
– vmo (virtual memory options)
– ioo (io options but related to memory)
– fdpr, chdev (sys0 device), ulimit (data/stack/rss limits)

5 12/08/21 © 2005 IBM Corporation


AIX Tuning Tools cont
 Network
– no (network options), nfso (network options)
– chdev (network adapter tuning), ifconfig (interface tuning)

 I/O
– ioo (I/O options)
– lvmo (LVM options)
– chdev (hdisk, adapter tuning)
– migratepv, migratelp, reorgvg

6 12/08/21 © 2005 IBM Corporation


Standard Tuning Framework
 Starting with AIX 5.2, the main tuning tools will be
part of a standard framework
– Tools will end with ‘o’ (schedo, vmo, ioo, no, nfso, …)
– Each command will have the following options:
• -a display all the parameters
• -p (set the value now and make it permanent)
• -r (apply the value at the next reboot and make it perman
ent)
• -o specifies the tunable parameter with its value
• -h shows help information for the tunable
• -L shows current/default/ranges for the tunable
• -d resets tunable to default value
• -D resets all tunables to default values
7 12/08/21 © 2005 IBM Corporation
Standard Tuning Framework cont.
 Tuning commands store values in /etc/tunables directory
– Standard tuning framework commands will modify the following fil
es:
• /etc/tunables/nextboot will be used to apply tunable values when the sy
stem boots
• /etc/tunables/lastboot contains the values that were set at the last syst
em boot
• /etc/tunables/lastboot.log contains log information for any tunable that
was changed
– Tunables file commands:
• tunrestore – sets tunables based on parameters in file (used in /etc/initt
ab to set tunables from /etc/nextboot)
• tuncheck – used to validate the parameter values in a file
• tunsave – saves tunable values to a stanza file
• tundefault – resets tunable parameters to default values

8 12/08/21 © 2005 IBM Corporation


Semaphore Tuning

 Other OS’s may require tuning of semaphore parameters


 Oracle on AIX uses post/wait mechanism instead of semaphores
– No need to tune semaphores on AIX
 AIX does not provide tuning for semaphore parameters for this but
you do not need to tune these parameters since they’re already set
as high as they can go
– semmni (131072) – max number of semaphore id’s
– semmsl (65535) – max number of semaphores per id
– semopm (1024) – max number of operations per semop call
– semume (1024) – max number of undo entries per process
– semvmx (32767) – maximum value of a semaphore
 Use ipcs to see semaphores/message queues/shared memory

9 12/08/21 © 2005 IBM Corporation


Message Queue Tuning

– Message queue structures are also dynamically scaled by AIX


up to their maximum values
– Message queue parameters do not need to be tuned
– Upper limits of message queue parameters:
• msgmax (4194304) – maximum size of message in bytes
• msgmnb (4194304) – maximum number of bytes on a queue
• msgmni (131072) - maximum number of message queue IDs
• msgmnm (524288) - maximum number of messages per queue

10 12/08/21 © 2005 IBM Corporation


Shared Memory Tuning
 Following shared memory parameters do not need to be tuned
on AIX (shared memory structures are scaled dynamically up t
o upper limits)
– shmmni (131072) – number of shared memory IDs
– shmmin (1) - minimum shared memory segment size in byt
es
– shmmax (3.25 GB) – maximum shared memory region size
for 32-bit process
– shmmax64 (32 TB) – maximum shared memory region size
for 64-bit process on a 64-bit kernel
– shmmax64 (1 TB) – maximum shared memory region size f
or 64-bit process on a 32-bit kernel
 Application can request the size of the shared memory area an
d whether or not to use large pages or pin memory using shm
get()

11 12/08/21 © 2005 IBM Corporation


Pinned Shared Memory

 To pin shared memory, it requires two steps


– vmo –p –o v_pinshm=1
– set application parameter to pin the memory (ie,
lock_sga=TRUE for Oracle)
• Application must specify SHM_PIN in the shmget system call
 Make sure that pinned memory does not go above
the maxpin value (defaults to 80% of RAM but can
be changed with vmo-p –o maxpin%=value)

12 12/08/21 © 2005 IBM Corporation


Using Large Pages for the Buffer Cache

 Large pages (16MB) can make a noticeable improvement in


performance when the buffer cache is large
 Steps needed to use large pages
– To enable the use of large pages for shared memory, vmo parameter
v_pinshm must be set to 1
– Use vmo parameters lgpg_regions, lgpg_size to specify how many lar
ge pages to pin and bosboot/reboot to enable large pages
– Application must be configured to use large pages (ie, lock_sga=TRU
E for Oracle)
– Application user id must be configured with permission to use large
pages:
• chuser capabilities=CAP_BYPASS_RAC_VMM,CAP_PROPAGATE <user id>
– Verify that large pages are there using ‘vmstat –l’ or ‘svmon –G’

13 12/08/21 © 2005 IBM Corporation


vmo Parameters
# vmo -a
memory_frames = 1048576
pinnable_frames = 999141
maxfree = 152
minfree = 144
minperm% = 20
minperm = 200476
maxperm% = 50
maxperm = 501190
strict_maxperm = 0
maxpin% = 80
maxpin = 838861
maxclient% = 50
lrubucket = 131072
defps = 1
nokilluid = 0
numpsblks = 131072
npskill = 1024
npswarn = 4096
v_pinshm = 0

14 12/08/21 © 2005 IBM Corporation


vmo Parameters cont
 pta_balance_threshold = 50
 pagecoloring = 0
 framesets = 2
 mempools = 1
 lgpg_size = 0
 lgpg_regions = 0
 num_spec_dataseg = n/a
 spec_dataseg_int = n/a
 memory_affinity = n/a
 htabscale = -1
 force_relalias_lite = 0
 relalias_percentage = 0
 data_stagger_interval = 161
 large_page_heap_size = n/a
 kernel_heap_psize = n/a
 soft_min_lgpgs_vmpool = 0
 vmm_fork_policy = 0
 low_ps_handling = 1
 mbuf_heap_psize = n/a
 strict_maxclient = 1
 cpu_scale_memp = 8

15 12/08/21 © 2005 IBM Corporation


vmo Additional Parameters (5.3)
 memplace_data = 2
 memplace_mapped_file = 2
 memplace_shm_anonymous = 2
 memplace_shm_named = 2
 memplace_stack = 2
 memplace_text = 2
 memplace_unmapped_file = 2
 npsrpgmax = 12288
 npsrpgmin = 9216
 npsscrubmax = 12288
 npsscrubmin = 9216
 rpgclean = 0
 rpgcontrol = 2
 scrub = 0
 scrubclean = 0

16 12/08/21 © 2005 IBM Corporation


ioo Parameters
 # ioo -a
 minpgahead = 2
 maxpgahead = 32
 pd_npages = 65536
 maxrandwrt = 0
 numclust = 1
 numfsbufs = 186
 sync_release_ilock = 0
 lvm_bufcnt = 9
 j2_minPageReadAhead = 2
 j2_maxPageReadAhead = 128
 j2_nBufferPerPagerDevice = 512
 j2_nPagesPerWriteBehindCluster = 32
 j2_maxRandomWrite = 0
 j2_nRandomCluster = 0
 jfs_clread_enabled = 0
 jfs_use_read_lock = 1
 hd_pvs_opn = 2
 hd_pbuf_cnt = 384
 j2_inodeCacheSize = 400
 j2_metadataCacheSize = 400
 j2_dynamicBufferPreallocation = 16
 j2_maxUsableMaxTransfer = 512
 pgahd_scale_thresh = 0
 pv_min_pbuf = 512 (5.3)

17 12/08/21 © 2005 IBM Corporation


schedo Parameters
# schedo -a
v_repage_hi = 0
v_repage_proc = 4
v_sec_wait = 1
v_min_process = 2
v_exempt_secs = 2
pacefork = 10
sched_D = 16
sched_R = 16
timeslice = 1
maxspin = 16384
%usDelta = 100
affinity_lim = 7
idle_migration_barrier = 4
fixed_pri_global = 0
big_tick_size = 1
force_grq = 0

18 12/08/21 © 2005 IBM Corporation


schedo Additional Parameters (5.3)
 krlock_confer2self = 0
 krlock_conferb4alloc = 0
 krlock_enable = 1
 krlock_spinb4alloc = 1
 krlock_spinb4confer = 1024
 n_idle_loop_vlopri = n/a
 search_globalrq_mload = n/a
 search_smtrunq_mload = n/a
 setnewrq_sidle_mload = n/a
 shed_primrunq_mload = n/a
 sidle_S1runq_mload = n/a
 sidle_S2runq_mload = n/a
 sidle_S3runq_mload = n/a
 sidle_S4runq_mload = n/a
 slock_spinb4confer = 1024
 smt_snooze_delay = n/a
 smtrunq_load_diff = n/a
 unboost_inflih = n/a

19 12/08/21 © 2005 IBM Corporation


I/O Layers

 Database Application --> Library


 Asynchronous I/O (optional)
 Filesystem
 Virtual Memory Manager (optional)
 LVM
 Disk Subsystem powerpath/vpath layer (optional)
 Disk Driver
 Fibre Channel Protocol
 Fibre Channel Device Driver
 Host Bus Adapter -> Switch -> Disk Subsystem

20 12/08/21 © 2005 IBM Corporation


Application I/O layer

 Databases can do reads using read(), readv(), pread(),


lio_listio(), aio_read()
 Databases can do writes with write(), writev(), pwrite(),
lio_listio(), aio_write()
 Database I/O sizes (with exception for the redo log writer
and archiver) based on database block buffer size and
parameters such as db_multiblock_read_count
 Usually, biggest improvements in performance are achieved
through tuning at the application/database layer

21 12/08/21 © 2005 IBM Corporation


Database Logical vs Physical I/Os

 It’s more efficient to rely on the Database caching than on


filesystem caching
– No system call required to do reads
– Database logical I/Os usually referred to as ‘buffer gets’
– Database physical I/Os handled via read system call but may
not be a physical I/O from an operating system perspective
(could be a logical I/O on the OS side since the data may be
retrieved from the filesystem cache)
• OS physical I/Os (seen in output of iostat command) always go to
the disk layer whereas DB physical I/Os can be retrieved from the
filesystem cache (DB physical I/Os >= OS physical I/Os)
• If filesystem cache is not used, DB physical I/Os would be equal to
the number of OS physical I/Os

22 12/08/21 © 2005 IBM Corporation


Asynchronous I/O Layer

 AIO to LVM files can take a fastpath and bypass the AIO servers
– No tuning needs to be done in this case
 AIO to filesystems currently use a set of AIO queues and AIO
server threads
– AIO server threads take I/Os off the queues and submit them to the
filesystem
– Number of AIO server threads can be tuned (maxservers, a per CPU
value)
– AIO server thread does synchronous or non-synchronous I/O based
on the file open flags
– The AIO parameter ‘maxreqs’ specifies how many AIOs can be in
progress and on the queues at any one time
• Once the limit is reached, EAGAIN error is returned to the application

23 12/08/21 © 2005 IBM Corporation


Asynchronous I/O Parameters

# lsattr -E -l aio0

autoconfig available STATE to be configured at system restart True


fastpath enable State of fast path True
kprocprio 39 Server PRIORITY True
maxreqs 4096 Maximum number of REQUESTS True
maxservers 10 MAXIMUM number of servers per cpu True
minservers 1 MINIMUM number of servers True

24 12/08/21 © 2005 IBM Corporation


Asynchronous I/O Tuning

 maxservers and maxreqs may need to be tuned


 AIO server threads are created dynamically as needed up to
the maxservers parameter and will stay in existence from
then on
 AIO parameters can be tuned on a permanent basis using
SMIT or chdev –l aio0 –a parameter=value
 AIO parameters can be dynamically increased temporarily
using the aioo command
– aioo –o parameter=value

25 12/08/21 © 2005 IBM Corporation


Filesystem Sequential Read Tuning

 Sequential reads benefit from readahead parameters


– minpgahead, maxpgahead for JFS
– j2_minPageReadAhead, j2_maxPageReadAhead for JFS2
– Increasing max readahead parameters can benefit sequential
reads (ie. table scans)
• Need to increase maxfree parameter also to ensure that LRU keeps
up with readahead
– ‘rbr’ mount option releases pages from the file cache once they
are read into the application buffers (only for sequential reads)
– Filesystem readahead only available when using filesystem
cache (must rely on database readahead otherwise)

26 12/08/21 © 2005 IBM Corporation


Filesystem Sequential Write Tuning

 ‘rbw’ mount will release pages on writes if writes


are sequential and file was opened non-
synchronously
 Non-synchronous sequential writes may benefit
from write-behind tuning
– numclust parameter for JFS,
j2_nPagesPerWriteBehindCluster for JFS2
– In most cases, databases will not be doing such writes
since most database files are opened with a sync flag

27 12/08/21 © 2005 IBM Corporation


Filesystem Buffer Tuning

 I/Os to filesystems use buffer structures called


bufstructs
– Each filesystem preallocates a pool of bufstructs when
the filesystem is mounted
– If a bufstruct is not available, the I/O is queued until an
already submitted I/O completes and releases its
bufstruct
– A counter is incremented for each filesystem type when a
bufstruct is not available; run ‘vmstat –v’ to examine
counters

28 12/08/21 © 2005 IBM Corporation


Filesystem Buffer Tuning cont.
 vmstat -v
 ---------
 1572864 memory pages
 1509213 lruable pages
 18938 free pages
 3 memory pools
 211903 pinned pages
 80.1 maxpin percentage
 5.0 minperm percentage
 5.0 maxperm percentage
 58.4 numperm percentage
 882475 file pages
 0.0 compressed percentage
 0 compressed pages
 27.2 numclient percentage
 5.0 maxclient percentage
 411210 client pages
 0 remote pageouts scheduled
 0 pending disk I/Os blocked with no pbuf
 0 paging space I/Os blocked with no psbuf
 6263 filesystem I/Os blocked with no fsbuf
 0 client filesystem I/Os blocked with no fsbuf
 0 external pager filesystem I/Os blocked with no fsbuf

29 12/08/21 © 2005 IBM Corporation


Filesystem Buffer Tuning cont.

 Increase ioo parameter numfsbufs for JFS filesystems if


‘filesystem I/Os blocked with no fsbuf’ counter continues to
increase
 Increase ioo parameter j2_nBufferPerPagerDevice if
‘external pager filesystem I/Os blocked with no fsbuf’
counter continues to increase
– Starting with 5.2 ML4, it should be rare that this parameter
would need to be tuned since JFS2 does dynamic buffer
allocation (increases bufstructs as needed)
– Ioo parameter j2_dynamicBufferPreallocation controls how
many Kbytes worth of bufstructs are allocated each time
• Can set to 0 to disable dynamic buffer allocation

30 12/08/21 © 2005 IBM Corporation


Filesystem Direct I/O

 By default, JFS and JFS2 filesystem I/Os are cached in


memory
 Caching can be bypassed through the use of Direct I/O or
Concurrent I/O
 Direct I/O is used by default with Oracle on GPFS
filesystems by specifying an open() flag (O_DIO)
 Direct I/O is enabled at the filesystem by using the ‘dio’
mount option
– I/O size must be a multiple of the filesystem block size
 Direct I/O mount supported by JFS, JFS2, GPFS

31 12/08/21 © 2005 IBM Corporation


Performance Impacts of Direct I/O

 Bypassing caching eliminates VMM overhead


 Direct I/O reads will not be able to take advantage of
readahead at the filesystem layer (disk subsystem may
provide readahead)
– Database may provide its own prefetch mechanism
 Useful for random I/O and I/O that would normally not have a
high cache hit rate
 Direct I/O writes will not be considered complete until they
have made it to disk
– Many databases open their files with a sync flag, so writes
must go to disk each time anyway

32 12/08/21 © 2005 IBM Corporation


Filesystem Concurrent I/O

 Concurrent I/O is a JFS2 feature which is Direct I/O without


inode locking
– Inode lock is still held when the file is being extended
– Inode locking by the filesystem is not necessary if the
application is performing proper serialization
• Application vendor must provide support for Concurrent I/O
– Concurrent I/O has all characteristics of Direct I/O
• No readahead, so it may not be optimal in all cases
– No inode locking can provide a large increase in performance
(when DB has a lot of writes and reads to the same files)

33 12/08/21 © 2005 IBM Corporation


Concurrent I/O Enablement

 Concurrent I/O is enabled either through the ‘cio’ mount


option or the O_CIO flag in the open() system call
 Oracle 10G will open files with O_CIO if the oracle parameter
‘setall’ is used
 Directory where the application executables and libraries
reside should not be mounted with CIO (ex.
$ORACLE_HOME – oracle doesn’t use latches on files there)
– A ‘namefs’ mount can be used to mount a subdirectory
using ‘cio’
• mount –v namefs –o cio /filesystem/subdir /filesystem/subdir
 For Oracle redo log, put redo logs in separate filesystem
created with a 512-byte filesystem and mount with ‘cio’

34 12/08/21 © 2005 IBM Corporation


Filesystem File Caching

 When DIO or CIO is not used, files are cached in


real memory
– Size of file cache is based on ‘vmo’ parameters maxperm
(for JFS) and maxclient (for JFS2)
• strict_maxperm=0 (default) makes JFS cache size a soft limit
• strict_maxclient=1 (default) makes JFS2 cache size a hard
limit
• With soft limits, number of file pages in RAM can exceed the
limit but if page replacement needs to occur only file pages
are stolen/replaced

35 12/08/21 © 2005 IBM Corporation


Page Replacement (LRU)

 Page replacement also known as LRU or Least Recently


Used is handled by one or more threads called lrud (kernel
multi-threaded process)
 LRU runs when certain thresholds are reached
– If number of file pages in memory reaches within ‘minfree’
pages of the file cache limit (if limit is strict)
• LRU stops when number of file pages in memory is within ‘maxfree’
pages of the file cache limit
– If number of free memory pages on VMM freelist reaches
‘minfree’
• LRU stops when number of free memory pages on freelist reaches
‘maxfree’
– If a WLM class has reached its limit

36 12/08/21 © 2005 IBM Corporation


LRU scanning

 LRU scans the page frame table looking for eligible pages using a simple
least recently used criteria
– If the page has its reference bit set, it is not stolen but reference bit is reset for
the next pass of LRU
– If the reference bit is not set, the page may be stolen if the page meets certain
criteria
• If the page is a file page, then if the number of file pages is above the file cache limit,
the page can be stolen
• If the number of file pages in memory is in between the minperm value and the
maxperm/maxclient value, then repaging rates are used to determine if the page can
be stolen
– if the lru_file_repage parameter is set to 0, then if the number of file pages in memory is above the
minperm value, file pages are stolen
> Recommendation: set lru_file_repage=0, minperm%=1
• If the number of file pages in memory is below minperm, then LRU steals an
unreferenced page regardless of whether it’s a file page or a computational page

37 12/08/21 © 2005 IBM Corporation


LRU Scanning cont.

 If not enough eligible pages are found after scanning


‘lrubucket’ worth of pages (default is 131072 pages), LRU
starts over in that bucket and scans again
– The scan rate can be viewed under the ‘sr’ column in the
output of the vmstat command
 When pages are stolen, they may be freed immediately if the
pages were not modified or freed after the pages are written
out to disk
– The free rate can be viewed under the ‘fr’ column in the output
of the vmstat command

38 12/08/21 © 2005 IBM Corporation


Page Replacement and Paging Space

 Computational pages that are modified and stolen must be


paged out to paging space
– Paging space I/O activity can be seen in the ‘pi’ (page-in rate)
and ‘po’ (page-out rate) columns of vmstat
– With vmstat –I, file pages read in from the filesystem (for
cached filesystems) have page-in rates under the ‘fi’ column of
vmstat; file writes have page-out rates under the filesystem are
under the ‘fo’ column
 Performance can be noticeably impacted if computational
pages (such as the database buffer caches or the process
heap) are paged out and have to be paged in again

39 12/08/21 © 2005 IBM Corporation


vmstat -I
# vmstat -I 10
System Configuration: lcpu=32 mem=191488MB
kthr memory page faults cpu
r b p avm fre fi fo pi po fr sr in sy cs us sy id wa
8 1 0 12945134 4213 151 7488 0 21 179 193 711 202851 203538 15 9 76 1
9 1 0 12945157 3926 25 6423 0 23 179 277 453 116885 191633 14 8 78 1
8 1 0 12945194 5759 15 9065 0 24 231 463 2008 125516 190439 14 9 76 1
12 1 0 12945211 5486 31 9958 0 15 243 428 3799 117624 189488 14 18 64 4
10 1 0 12945247 4280 29 6193 0 7 140 224 427 113468 190980 12 8 79 0
11 1 0 12945258 3921 10 5845 0 0 0 0 484 112393 191256 11 8 80 0
11 0 0 12945262 4092 12 5823 0 3 51 89 407 112539 191034 12 8 80 0
7 2 0 12946529 4025 88 6353 0 32 383 493 541 114747 191927 11 9 79 1
6 1 0 12945285 3868 80 6564 0 19 218 433 622 118519 190818 14 11 74 1
9 1 0 12945301 4663 60 9375 0 17 165 240 3114 118963 192304 13 10 77 1
8 7 0 12945308 4282 11 9270 0 0 0 0 1878 109050 185043 10 16 72 2
9 1 0 12945398 3898 10 5835 0 0 0 0 499 113986 193613 12 8 79 0

avm = 12945406 pages or 12945406*4K bytes = 49GB Active Virtual Memory


This server has 187GB of RAM.

40 12/08/21 © 2005 IBM Corporation


How Many File Pages in Memory?
# vmstat -v
49020928 memory pages
47128656 lruable pages
5807 free pages
4 memory pools
2404954 pinned pages
80.0 maxpin percentage
20.0 minperm percentage
80.0 maxperm percentage
77.9 numperm percentage
36732037 file pages
0.0 compressed percentage
0 compressed pages
78.0 numclient percentage
80.0 maxclient percentage
36767316 client pages
0 remote pageouts scheduled
321640 pending disk I/Os blocked with no pbuf
763 paging space I/Os blocked with no psbuf
2888 filesystem I/Os blocked with no fsbuf
9832 client filesystem I/Os blocked with no fsbuf
2038066 external pager filesystem I/Os blocked with no fsbuf

41 12/08/21 © 2005 IBM Corporation


Tuning to Prevent Paging

 Assuming that memory is not over-committed, tuning vmo


parameters may eliminate paging
 For JFS, lowering maxperm below numperm while keeping it
a soft limit should eliminate paging
 For JFS2, maxclient can be lowered but it would have to be
changed to make it a soft limit
– maxperm would have to be lowered to the same value or
higher than maxclient
 Best solution is to simply disable the use of repage counters

42 12/08/21 © 2005 IBM Corporation


Disabling Use of Repage Counters

 vmo parameter ‘lru_file_repage’ can be set to 0,


which means to not use the repage counters
 If value is 1 (default),
then
if numperm is between minperm and maxperm
or if numclient is between minperm and maxclient,
repage counters are used
 If value is 0,
then
if numperm is higher than minperm (for JFS)
or if numclient is higher than minperm (for JFS2),
only file pages are stolen
 Best solution for paging issues when using filesystems:
lower minperm to a low value like 5%
and set lru_file_repage=0

43 12/08/21 © 2005 IBM Corporation


Result of Setting lru_file_repage=0
# vmstat -I 10
System Configuration: lcpu=16 mem=191488MB

kthr memory page faults cpu


r b p avm fre fi fo pi po fr sr in sy cs us sy id wa
9 2 0 35226223 4272 950 9879 0 0 1016 1810 3939 99613 53246 37 7 55 1
6 0 0 35226288 4284 272 8256 0 0 618 1088 2883 63720 47746 30 5 66 0
7 1 0 35226356 4194 469 8078 0 0 758 1214 2805 59068 45870 26 5 69 0
5 0 0 35226431 4320 542 8101 0 0 865 1387 2886 58182 43960 28 4 68 0
7 0 0 35226479 4338 640 8002 0 0 923 1561 2779 54933 40913 28 6 66 0
8 1 0 35226556 4355 565 22850 0 0 959 1928 9190 74983 48209 40 9 49 2
9 1 0 35226899 3910 379 8232 0 0 756 1717 2889 63565 48098 31 5 64 0
8 1 0 35226971 3968 489 8351 0 0 878 2190 3177 67044 50490 32 5 63 0
9 0 0 35228294 3965 632 8473 0 0 1284 2755 2923 71085 48734 33 5 61 0
8 0 0 35227083 4423 639 8406 0 0 1113 1807 2597 58393 42845 29 6 64 1
5 1 0 35227125 3905 876 8059 0 0 1092 1845 3029 55988 42565 26 5 69 0
5 1 0 35227164 4240 1898 9557 0 0 2502 4365 4544 65081 45162 29 6 64 1
8 2 0 35227229 3960 840 21796 0 0 1097 2011 9045 59728 44146 34 9 54 3
7 1 0 35227279 4693 750 8321 0 0 1321 2156 2827 58103 43630 30 7 63 1

avm = 35227279 pages or 35227279*4K bytes = 134GB Active Virtual Memory


This server has 187GB of RAM.

44 12/08/21 © 2005 IBM Corporation


Memory Over-Commitment

 Memory is considered over-committed if the working storage


requirements (computational) exceed the real memory size
– ‘avm’ column in the output of vmstat shows the working storage
number of pages
• Multiply this by 4K and if it is greater than RAM, then memory is over-
committed
– If memory is over-committed, it is recommended to reduce the
workload or add more real memory
– Important to have sufficient paging space in the case of
memory over-commitment
• Should be at least the size of ‘avm’ if prior to AIX 5.3
• AIX 5.3 provides paging space garbage collection

45 12/08/21 © 2005 IBM Corporation


Page Replacement by Memory Pool

 By default (if memory_affinity=1) , each chip module (MCM on


Power4 or DCM on Power5) will have at least one memory pool
 Each memory pool will have its own LRU daemon to do page
replacement and VMM parameters such as minfree, maxfree,
minperm, maxperm, maxclient apply on a per-pool basis
 LRU will run on its own pool when thresholds are reached
 Number of memory pools in the chip module is based on amount of
RAM on the chip module (if memory_affinity=1) and the number of
CPUs in the LPAR
 If memory_affinity is disabled (set to 0 using ‘vmo’, bosboot,
reboot), then number of pools is based on total amount of RAM and
number of CPUs
– This method guarantees evenly sized memory pools which is desirable
– AIX 5.3 will not allow disabling of memory affinity until ML3

46 12/08/21 © 2005 IBM Corporation


Monitoring Memory Usage

 ‘vmstat –v’ can be used to show number of file pages in


memory
 ‘avm’ column in vmstat shows working
storage/computational memory usage
 ‘fre’ column in vmstat shows free real memory
– Note that in other OS’s like Solaris, free memory may not really
be free but also include the filesystem cache
 Process memory usage can be monitored using commands
such as ‘ps’, ‘svmon’, ‘topas’, ‘nmon’, or PTX (Performance
Tool Box)

47 12/08/21 © 2005 IBM Corporation


Monitoring Process Memory Usage using PS
 PS reports memory in 1KB units

 # ps gv
 PID PGIN SIZE RSS TRS DRS C PRI NI %CPU TIME CMD
 0 7 64 64 0 64 120 16 -- 0.1 2:25 swapper
 1 108 844 880 36 844 0 60 20 0.0 0:03 init
 8196 0 48 48 0 48 120 255 -- 27.0 954:21 wait
 12294 0 48 48 0 48 120 255 -- 26.2 926:39 wait
 16392 0 48 48 0 48 120 255 -- 26.0 918:13 wait
 20490 0 48 48 0 48 0 255 -- 0.0 0:00 wait
 24588 0 56 56 0 56 120 17 -- 0.0 0:33 reaper
 28686 0 92 92 0 92 0 16 -- 0.3 12:01 lrud

48 12/08/21 © 2005 IBM Corporation


Monitoring System Memory Usage using svmon

 # svmon -G
 size inuse free pin virtual
 memory 1572864 1554348 18516 211932 652201
 pg space 1048576 5363

 work pers clnt lpage


 pin 211932 0 0 0
 in use 652220 495384 406744 0

49 12/08/21 © 2005 IBM Corporation


Monitoring Process Memory Using svmon
 # svmon –P 978946
 Pid Command Inuse Pin Pgsp Virtual 64-bit Mthrd LPage
 978946 oracle 50541 3840 0 46773 Y N N

 Vsid Esid Type Description LPage Inuse Pin Pgsp Virtual


 73ece 70000000 work default shmat/mmap - 36255 0 0 36255
 0 0 work kernel segment - 6696 3818 0 6696
 5becb 10 pers text data BSS heap, - 3750 0 - -
 /dev/lv63:86028
 9bed3 11 work text data BSS heap - 1789 0 0 1789
 688ad 90000000 work loader segment - 1221 0 0 1221
 7124e 90020014 work shared library text - 420 0 0 420
 3bee7 9001000a work shared library text - 121 0 0 121
 e3ebc 80020014 work private load - 103 0 0 103
 c3eb8 8001000a work private load - 90 0 0 90
 13ec2 f00000002 work process private - 29 22 0 29
 808f0 9ffffffe work other kernel segments - 29 0 0 29
 49249 9fffffff pers shared library text, - 11 0 - -
 /dev/hd2:76083
 53eca 8fffffff work private load - 10 0 0 10
 d3eba ffffffff work application stack - 10 0 0 10
 bbed7 - pers /dev/lv63:32787 - 6 0 - -
 6becd - pers /dev/app_oracle:10518 - 1 0 - -

50 12/08/21 © 2005 IBM Corporation


LVM Tuning

 LVM stored data in LTG units (default 128K)


– I/Os sent to the LVM larger than LTG size are broken up into
multiple I/Os but disk layer can coalesce them back into larger
I/Os
 LVM initiates I/O to the disk layer once a buffer structure
called a pbuf is available
– Shortages of pbufs can be viewed using ‘vmstat –v’
– Pbufs can be dynamically increased by increasing the value of
hd_pbuf_cnt (using ioo in AIX 5.2)
– AIX 5.3 has a pbuf pool per Volume Group
• Shortages are seen in the output of the ‘lvmstat’ command
• Per-VG pbufs can be increased using the lvmo command

51 12/08/21 © 2005 IBM Corporation


LVM Tuning cont.

 Sequential I/O can benefit from LVM striping across multiple


disks
– Stripe size and width should take into account the typical
application I/O size
 LVM can also provide mirroring; however, it is usually more
efficient to do hardware mirroring if available
 lvmstat can be used to monitor LVM hotspots
– Individual lvm partitions can be moved from one hdisk to
another using the migratelp command
– Entire LVM devices can be moved from one hdisk to another
using the migratepv command even while LVM device is being
used

52 12/08/21 © 2005 IBM Corporation


lvmo - LVM tuning command

 # lvmo –a

 vgname = rootvg
 pv_pbuf_count = 512
 total_vg_pbufs = 512
 max_vg_pbuf_count = 16384
 pervg_blocked_io_count = 17
 global_pbuf_count = 512
 global_blocked_io_count = 17

53 12/08/21 © 2005 IBM Corporation


Disk Tuning

 Disks may have a max_coalesce and/or a max_transfer


parameter which are upper limits on the size of a disk I/O
– Increasing max_coalesce and/or max_transfer can allow
coalescing of sequential I/Os into larger I/Os
 Disks may have a tunable queue_depth parameter which
places a limit on how many I/Os can be sent to the disk at a
time
– I/Os not queued to the disk may be coalesced by the device
driver
– Logical disks with many physical disks in them can benefit from
larger values for the queue_depth

54 12/08/21 © 2005 IBM Corporation


Disk Bottlenecks

 High disk utilizations (viewed by iostat) can be an indication


of a bottleneck
 Logical hdisks may not have enough physical disks or may
not be optimally organized in the disk subsystem (striping, #
of LUNs, # of paths, # of ports)
 Disk multipathing software may have bottlenecks (could be
limited by disk queue depths or by multipath process)
 Other servers attached to the SAN may affect the I/O
response time of a server
 Bottleneck can also occur on the SAN switch
 Disk subsystem monitoring software should be used to
detect bottlenecks

55 12/08/21 © 2005 IBM Corporation


iostat
 # iostat 5
 tty: tin tout avg-cpu: % user % sys % idle % iowait
 0.0 5.7 6.7 5.9 59.9 27.6

 Disks: % tm_act Kbps tps Kb_read Kb_wrtn


 hdisk1144 8.6 127.5 15.3 284 992
 hdisk1145 7.1 110.0 11.3 108 993
 hdisk2 0.0 0.0 0.0 0 0
 hdisk3 0.0 0.0 0.0 0 0
 hdisk11 0.0 0.0 0.0 0 0
 hdisk13 0.0 0.0 0.0 0 0
 hdisk14 0.0 0.0 0.0 0 0
 hdisk15 0.0 0.0 0.0 0 0
 hdisk35 0.2 2.8 0.7 0 28
 hdisk41 0.0 0.0 0.0 0 0
 hdisk42 0.0 0.0 0.0 0 0
 hdisk43 0.0 0.0 0.0 0 0
 hdisk46 0.0 0.0 0.0 0 0
 hdisk47 23.3 77.9 19.5 780 0

56 12/08/21 © 2005 IBM Corporation


iostat –a, iostat -s
 # iostat -a 1

 System configuration: lcpu=2 drives=3

 tty: tin tout avg-cpu: % user % sys % idle % iowait


 0.0 37.3 0.0 0.5 99.5 0.0

 Adapter: Kbps tps Kb_read Kb_wrtn


 scsi0 0.0 0.0 0 0

 Disks: % tm_act Kbps tps Kb_read Kb_wrtn


 hdisk0 0.0 0.0 0.0 0 0
 hdisk1 0.0 0.0 0.0 0 0

57 12/08/21 © 2005 IBM Corporation


iostat –D (detailed disk stats)
 # iostat –D 5
 hdisk11 xfer: %tm_act bps tps bread bwrtn
 0.0 0.0 0.0 0.0 0.0
 read: rps avgserv minserv maxserv timeouts fails
 0.0 0.0 0.0 0.0 0 0
 write: wps avgserv minserv maxserv timeouts fails
 0.0 0.0 0.0 0.0 0 0
 queue: avgtime mintime maxtime avgwqsz avgsqsz sqfull
 0.0 0.0 0.0 0.0 0.0 0
 hdisk12 xfer: %tm_act bps tps bread bwrtn
 0.0 0.0 0.0 0.0 0.0
 read: rps avgserv minserv maxserv timeouts fails
 0.0 0.0 0.0 0.0 0 0
 write: wps avgserv minserv maxserv timeouts fails
 0.0 0.0 0.0 0.0 0 0
 queue: avgtime mintime maxtime avgwqsz avgsqsz sqfull
 0.0 0.0 0.0 0.0 0.0 0

58 12/08/21 © 2005 IBM Corporation


iostat –A (Asynchronous I/O stats)
 omd :iostat -A 1

 System configuration: lcpu=2 drives=3

 aio: avgc avfc maxg maxf maxr avg-cpu: % user % sys % idle % iowait
 0 0 0 0 4096 0.0 0.0 100.0 0.0

 Disks: % tm_act Kbps tps Kb_read Kb_wrtn


 hdisk0 0.0 0.0 0.0 0 0
 hdisk1 0.0 0.0 0.0 0 0
 cd0 0.0 0.0 0.0 0 0

 avgc Average global AIO request count per second for the specified interval (filesystem)
 avfc Average fastpath request count per second for the specified interval (raw LV)
 maxgc Maximum global AIO request count since the last time this value was fetched (filesystem)
 maxfc Maximum fastpath request count since the last time this value was fetched (raw LV)
 maxreqs Maximum AIO requests allowed

59 12/08/21 © 2005 IBM Corporation


Network Layers

 Transmit
– Application -> socket -> TCP or UDP -> IP -> Interface ->
Device Driver -> Adapter -> wire
 Receive
– Wire -> Adapter -> Device Driver -> Demux -> IP -> TCP
or UDP -> socket -> Application

60 12/08/21 © 2005 IBM Corporation


Network Tuning

 Databases that transmit data over a network (local or


remote) may run into network bottlenecks
 Parameters such as tcp_sendspace and tcp_recvspace can
be used to submit larger network I/Os without blocking
 Parameters such as tcp_nodelayack and tcp_nagle_limit can
be used to eliminate delays that can occur due to algorithms
such as Nagle and delayed acknowledgements
– Set tcp_nodelayack=1
– Set tcp_nagle_limit=0
 Check for media mismatches (sometimes non-zero CRC
errors in output of ‘netstat –v’ is a clue)

61 12/08/21 © 2005 IBM Corporation


CPU monitoring

 A server is considered starved for CPU resources if the


number of runnable threads exceeds the number of logical
CPUs and if CPU utilization is 100%
 CPU utilization in this case refers to sum of %user and
%system
 %iowait is simply a form of idle time
– Indicates % of time the CPU was idle but there was at least
one I/O in progress
 CPU monitoring tools have new options for SMT and
SPLPAR

62 12/08/21 © 2005 IBM Corporation


CPU Utilization in an SMT enabled LPAR
 Each physical CPU has two hardware threads; each hardware thread is viewed as a logical
processor by AIX
– each logical processor still collects 100 utilization samples per second
– "ticks" will still be collected in per-logical processor cpuinfo structures (for binary compatibility)
– additional purr-based metrics (from the PURR registers) will be collected in new structures
– and sorted in the same four categories: user, sys, iowait, and idle
– values are accumulated purr tics

 New "physical" cpu utilization calculation


– current metrics can be misleading unless they’re modified to use PURR
– case of 1 hardware thread 100% busy and one hardware thread idle would result in 50%
utilization with old method
• but physical processor is really 100% busy
– displayed %user,%sys,%idle,%wait will now be calculated using the purr-based metrics
• in case of one thread 100% busy, PURR-based utilization would be 100%
• one thread would receive (almost) all the purr increments, the other (practically) none
• practically 100% of purr increments would go into the %user and %sys buckets

63 12/08/21 © 2005 IBM Corporation


topas
 Topas Monitor for host: server EVENTS/QUEUES FILE/TTY
 Tue Apr 12 19:36:59 2005 Interval: 2 Cswitch 177 Readch 1901
 Syscall 200 Writech 686
 Kernel 0.4 |# | Reads 4 Rawin 0
 User 0.0 |# | Writes 1 Ttyout 678
 Wait 10.7 |#### | Forks 0 Igets 0
 Idle 89.0 |######################### | Execs 0 Namei 14
 Runqueue 0.0 Dirblk 0
 Network KBPS I-Pack O-Pack KB-In KB-Out Waitqueue 0.0
 en2 0.8 2.5 0.5 0.1 0.7
 en7 0.0 0.0 0.0 0.0 0.0 PAGING MEMORY
 Faults 69 Real,MB 128256
 Disk Busy% KBPS TPS KB-Read KB-Writ Steals 0 % Comp 7.0
 hdisk2 0.0 0.0 0.0 0.0 0.0 PgspIn 0 % Noncomp 0.5
 hdisk0 0.0 12.0 2.5 0.0 12.0 PgspOut 0 % Client 0.5
 PageIn 0
 Name PID CPU% PgSp Owner PageOut 0 PAGING SPACE
 syncd 78034 0.3 0.5 root Sios 0 Size,MB 512
 topas 270336 0.0 1.4 root % Used 3.5
 gil 25156 0.0 0.1 root NFS (calls/sec) % Free 96.4
 sched 4920 0.0 0.1 root ServerV2 0
 sched 4652 0.0 0.1 root ClientV2 0 Press:
 sched 4384 0.0 0.1 root ServerV3 0 "h" for help
 aixmibd 320434 0.0 0.6 root ClientV3 0 "q" to quit
 nfsd 373102 0.0 0.2 root
 rpc.lock 73904 0.0 0.2 root

64 12/08/21 © 2005 IBM Corporation


Performance Data Collection

 If a performance problem requires IBM support, a tool called


PERFPMR is used to collect performance data
 PERFPMR is downloadable from a public ftp site:
– ftp ftp.software.ibm.com using anonymous ftp
– cd /aix/tools/perftools/perfpmr/perfXX (where XX is the AIX
release)
– Get the compressed tar file in that directory and install it using
the directions in the provided README file
– PERFPMR is updated periodically, so it’s advisable to check
the FTP site for the most recent version

65 12/08/21 © 2005 IBM Corporation


Running PERFPMR

 Once PERFPMR has been installed, you can run it in any directory
– To determine the amount of space needed, estimate at least 20MB
per logical CPU plus an extra 50MB of space
– Run “perfpmr.sh <# of seconds>” at at time when the performance
problem is occurring
– A pair of 5-second traces are collected first
– Then various monitoring tools are run for the duration of time specified
as a parameter to perfpmr.sh
– After this, tprof, filemon, iptrace, tcpdump data is collected
– Finally, system config data is collected
– Data can be tar’d up and sent to testcase.software.ibm.com with the
filename having the pmr# in it

66 12/08/21 © 2005 IBM Corporation

You might also like